PySpark Data Engineer

This role is for a PySpark Data Engineer, offering over 6 months of onsite work in "Irving, TX / Jacksonville, FL / Jersey City, NJ" with a pay rate of "unknown." Requires 10+ years in data management, Hadoop, SQL, and banking domain experience.
🌎 - Country
United States
💱 - Currency
$ USD
💰 - Day rate
Unknown
Unknown
🗓️ - Date discovered
January 16, 2025
🕒 - Project duration
More than 6 months
🏝️ - Location type
On-site
📄 - Contract type
Fixed Term
🔒 - Security clearance
Unknown
📍 - Location detailed
Texas, United States
🧠 - Skills detailed
#Data Lake #Scripting #Sqoop (Apache Sqoop) #PySpark #Jenkins #Shell Scripting #Batch #Apache Kafka #Agile #Spark (Apache Spark) #Big Data #NoSQL #Regression #Data Warehouse #RDBMS (Relational Database Management System) #HDFS (Hadoop Distributed File System) #Kafka (Apache Kafka) #Unix #Data Pipeline #Scrum #BitBucket #Data Management #Python #SQL (Structured Query Language) #Deployment #Libraries #Data Engineering #Data Access #GIT #Hadoop #Databases #Teradata #"ETL (Extract #Transform #Load)" #Documentation
Role description
Log in or sign up for free to view the full role description and the link to apply.

Job Description

JOB TITLE- PySpark Data Engineer

LOCATION Irving, TX / Jacksonville, FL / Jersey City, NJ (ONSITE / FULL TIME ) (GC / USC / GCEAD / OPT)

Desired Skills

Big Data, Hadoop, SQL

Skill: Big Data (PySpark) Tech Lead–

10 + Years Overall Experience in Data Management, Data Lake and Data Warehouse.

6+ Years Hadoop, Hive, Sqoop, SQL, Teradata.

6+ Years PySpark(Python and Spark), Unix.

Good to have Industry leading ETL experience.

Banking Domain experience.

Key Responsibilities:

Ability to design, build and unit test applications on Spark framework on Python.

Build PySpark based applications for both batch and streaming requirements, which will require in-depth knowledge on majority of Hadoop and NoSQL databases as well.

Develop and execute data pipeline testing processes and validate business rules and policies.

Optimize performance of the built Spark applications in Hadoop using configurations around Spark Context, Spark-SQL, Data Frame, and Pair RDD's.

Optimize performance for data access requirements by choosing the appropriate native Hadoop file formats (Avro, Parquet, ORC etc) and compression codec respectively.

Ability to design & build real-time applications using Apache Kafka & Spark Streaming.

Build integrated solutions leveraging Unix shell scripting, RDBMS, Hive, HDFS File System, HDFS File Types, HDFS compression codec.

Build data tokenization libraries and integrate with Hive & Spark for column-level obfuscation.

Experience in processing large amounts of structured and unstructured data, including integrating data from multiple sources.

Create and maintain integration and regression testing framework on Jenkins integrated with BitBucket and/or GIT repositories.

Participate in the agile development process, and document and communicate issues and bugs relative to data standards in scrum meetings.

Work collaboratively with onsite and offshore team.

Develop & review technical documentation for artifacts delivered.

Ability to solve complex data-driven scenarios and triage towards defects and production issues.

Ability to learn-unlearn-relearn concepts with an open and analytical mindset.

Participate in code release and production deployment.

Challenge and inspire team members to achieve business results in a fast paced and quickly changing environment