

Data Engineer
Role: Data Engineer
We are seeking a highly skilled Data Engineer to set up Change Data Capture (CDC) for multiple database types to support data lake hydration. The ideal candidate should have hands-on experience with Debezium or other CDC frameworks and strong expertise in ETL transformations using Apache Spark for both streaming and batch data processing.
Key Responsibilities
• Implement Change Data Capture (CDC) for diverse databases to enable real-time and batch data ingestion.
• Develop ETL pipelines using Apache Spark (PySpark/Java) to transform raw CDC data into structured, analytics-ready datasets.
• Work with Apache Spark DataFrames, Spark SQL, and Spark Streaming to build scalable data pipelines.
• Optimize data workflows for performance, reliability, and scalability in a big data environment.
• Utilize Apache Airflow to orchestrate data pipelines and schedule workflows.
• Leverage AWS services for data ingestion, storage, transformation, and processing (e.g., S3, Glue, EMR, Lambda, Step Functions, MWAA).
Required Skills
• Java: Mid to senior-level experience.
• Python (PySpark): Mid-level experience.
• Apache Spark: Proficiency in DataFrames, Spark SQL, Spark Streaming, and ETL pipelines.
• Apache Airflow: Experience managing and scheduling workflows.
• AWS Expertise:
• S3 (CRUD operations)
• EMR & EMR Serverless
• Glue Data Catalog
• Step Functions
• MWAA (Managed Workflows for Apache Airflow)
• AWS Lambda (Python-based)
• AWS Batch
Nice-to-Have Skills (Bonus)
• Scala for Spark development.
• Apache Hudi for incremental data processing and ACID transactions.
• Apache Griffin for data quality and validation.
• Performance tuning and optimization in big data environments.
• AWS Deequ - not required, but a plus