Data Science

Overview

We are looking for a Lead Data Engineer for the Machine Learning (ML) engineering development team. The primary focus will be to gather requirements from ML/DS teams and identify the optimal solution. Then design, implement, monitor and maintain these scalable distributed big data pipelines for different big data ML use-cases. You will be working with Data Scientists to train, refresh and serve models using big data ML pipelines.

Responsibilities

  • Collaborate with ML engineers and Data Scientists to gather requirements.
  • Design and Implement ETL big data pipelines to train ML models.
  • Streaming processing and Batch pipelines using UDFs, ML libraries and load processed data to multiple distributed data sources.
  • API programming knowledge to train and server the ML models.
  • Responsible for availability, scalability, reliability, and performance of the big data platform.

Skills and Qualifications

  • Minimum of 6+ years relevant experience.
  • Proven background in ETL development and large scale data processing.
  • Proficiency with Big Data ecosystem - Spark (PySpark), Hadoop, HDFS, HIVE, NoSQL, and modern Cloud Data lakes (Cloudera Data Platform or Deltalake).
  • Strong SQL expertise, optimizing complex joins and database concepts.
  • Responsible for availability, scalability, reliability, and performance of the big data platform.
  • Knowledge of AWS is a plus.
  • Knowledge of AI/ML and MLOps is a plus.