Freshworks is seeking a Lead Data Engineer for their Machine Learning engineering team. This role involves gathering requirements from ML/DS teams, designing and implementing scalable distributed big data pipelines for ML use-cases, and working with Data Scientists to train and serve models.
About the role
Overview
We are looking for a Lead Data Engineer for the Machine Learning (ML) engineering development team. The primary focus will be to gather requirements from ML/DS teams and identify the optimal solution. Then design, implement, monitor and maintain these scalable distributed big data pipelines for different big data ML use-cases. You will be working with Data Scientists to train, refresh and serve models using big data ML pipelines.
Responsibilities
Collaborate with ML engineers and Data Scientists to gather requirements.
Design and Implement ETL big data pipelines to train ML models.
Streaming processing and Batch pipelines using UDFs, ML libraries and load processed data to multiple distributed data sources.
API programming knowledge to train and server the ML models.
Selecting and integrating a variety of big data tools and frameworks required for processing
Responsible for availability, scalability, reliability, and performance of the big data platform.
Skills And Qualifications
Minimum of 6+ years relevant experience
Proven background in ETL development and large scale data processing.
Proficiency with Big Data ecosystem - Spark (PySpark), Hadoop, HDFS, HIVE, NoSQL, and modern Cloud Data lakes (Cloudera Data Platform or Deltalake)
Strong SQL expertise, optimizing complex joins and database concepts
Strong programming development experience in languages like Python and Java.
Experience with building stream-processing systems, using Spark-Streaming.
Experience with workflow orchestration tools, such as Oozie, Airflow.
Experience with Unix/Shell or Python scripting.
Knowledge of AWS is a plus.
Knowledge of AI/ML and MLOps is a plus.
Skills
EtlSparkPysparkHadoopHDFSHIVENosqlCloudera Data PlatformDeltalakeSQLPythonJavaSpark StreamingOozieAirflowUnix/ShellAWSAI/MLMlops