SMARTDATA.WORKS

Overview

We are looking for a Lead Data Engineer for the Machine Learning (ML) engineering development team. The primary focus will be to gather requirements from ML/DS teams and identify the optimal solution. Then design, implement, monitor and maintain these scalable distributed big data pipelines for different big data ML use-cases. You will be working with Data Scientists to train, refresh and serve models using big data ML pipelines.

Responsibilities

Collaborate with ML engineers and Data Scientists to gather requirements.
Design and Implement ETL big data pipelines to train ML models.
Streaming processing and Batch pipelines using UDFs, ML libraries and load processed data to multiple distributed data sources.
API programming knowledge to train and server the ML models.
Responsible for availability, scalability, reliability, and performance of the big data platform.

Skills and Qualifications

Minimum of 6+ years relevant experience.
Proven background in ETL development and large scale data processing.
Proficiency with Big Data ecosystem - Spark (PySpark), Hadoop, HDFS, HIVE, NoSQL, and modern Cloud Data lakes (Cloudera Data Platform or Deltalake).
Strong SQL expertise, optimizing complex joins and database concepts.
Responsible for availability, scalability, reliability, and performance of the big data platform.
Knowledge of AWS is a plus.
Knowledge of AI/ML and MLOps is a plus.

Data Science

Data Science