Refer a freelancer, and you both get 1 free week of DFH Premium. They must use your code {code} at sign-up. More referrals = more free weeks! T&Cs apply.
1 of 5 free roles viewed today. Upgrade to premium for unlimited.

Data Engineer

This role is for a Data Engineer with a contract length of "unknown," offering a pay rate of "$X per hour." Key skills include Java, Python (PySpark), Apache Spark, Apache Airflow, and AWS services. Industry experience in CDC is required.
🌎 - Country
United States
💱 - Currency
$ USD
💰 - Day rate
Unknown
Unknown
🗓️ - Date discovered
February 18, 2025
🕒 - Project duration
Unknown
🏝️ - Location type
Unknown
📄 - Contract type
Unknown
🔒 - Security clearance
Unknown
📍 - Location detailed
Reston, VA
🧠 - Skills detailed
#Data Ingestion #Big Data #S3 (Amazon Simple Storage Service) #Data Processing #"ACID (Atomicity #Consistency #Isolation #Durability)" #Batch #SQL (Structured Query Language) #Spark (Apache Spark) #Apache Spark #Java #Data Lake #AWS Lambda #Python #Data Catalog #Spark SQL #Storage #Data Engineering #Datasets #"ETL (Extract #Transform #Load)" #Airflow #Data Quality #Data Pipeline #AWS (Amazon Web Services) #PySpark #Lambda (AWS Lambda) #Scala #Apache Airflow #Databases
Role description
You've reached your limit of 5 free role views today. Upgrade to premium for unlimited access.

Role: Data Engineer

We are seeking a highly skilled Data Engineer to set up Change Data Capture (CDC) for multiple database types to support data lake hydration. The ideal candidate should have hands-on experience with Debezium or other CDC frameworks and strong expertise in ETL transformations using Apache Spark for both streaming and batch data processing.

Key Responsibilities
• Implement Change Data Capture (CDC) for diverse databases to enable real-time and batch data ingestion.
• Develop ETL pipelines using Apache Spark (PySpark/Java) to transform raw CDC data into structured, analytics-ready datasets.
• Work with Apache Spark DataFrames, Spark SQL, and Spark Streaming to build scalable data pipelines.
• Optimize data workflows for performance, reliability, and scalability in a big data environment.
• Utilize Apache Airflow to orchestrate data pipelines and schedule workflows.
• Leverage AWS services for data ingestion, storage, transformation, and processing (e.g., S3, Glue, EMR, Lambda, Step Functions, MWAA).

Required Skills
• Java: Mid to senior-level experience.
• Python (PySpark): Mid-level experience.
• Apache Spark: Proficiency in DataFrames, Spark SQL, Spark Streaming, and ETL pipelines.
• Apache Airflow: Experience managing and scheduling workflows.
• AWS Expertise:
• S3 (CRUD operations)
• EMR & EMR Serverless
• Glue Data Catalog
• Step Functions
• MWAA (Managed Workflows for Apache Airflow)
• AWS Lambda (Python-based)
• AWS Batch

Nice-to-Have Skills (Bonus)
• Scala for Spark development.
• Apache Hudi for incremental data processing and ACID transactions.
• Apache Griffin for data quality and validation.
• Performance tuning and optimization in big data environments.
• AWS Deequ - not required, but a plus