Overview
We are looking for a Lead Data Engineer for the Machine Learning (ML) engineering development team. The primary focus will be to gather requirements from ML/DS teams and identify the optimal solution. Then design, implement, monitor and maintain these scalable distributed big data pipelines for different big data ML use-cases. You will be working with Data Scientists to train, refresh and serve models using big data ML pipelines.
Responsibilities
- Collaborate with ML engineers and Data Scientists to gather requirements.
- Design and Implement ETL big data pipelines to train ML models.
- Streaming processing and Batch pipelines using UDFs, ML libraries and load processed data to multiple distributed data sources.
- API programming knowledge to train and server the ML models.
- Responsible for availability, scalability, reliability, and performance of the big data platform.
Skills and Qualifications
- Minimum of 6+ years relevant experience.
- Proven background in ETL development and large scale data processing.
- Proficiency with Big Data ecosystem - Spark (PySpark), Hadoop, HDFS, HIVE, NoSQL, and modern Cloud Data lakes (Cloudera Data Platform or Deltalake).
- Strong SQL expertise, optimizing complex joins and database concepts.
- Responsible for availability, scalability, reliability, and performance of the big data platform.
- Knowledge of AWS is a plus.
- Knowledge of AI/ML and MLOps is a plus.