Data Engineer

Location: McLean, VA


Seeking a Data Engineer to assist in modernizing our Collateral Modeling process. This modernization involves porting SAS code to Python (Pandas and PySpark) code that will run in Amazon Web Services (AWS) Elastic Map Reduce (EMR). This project is in progress with some of the pieces already in production, we are looking to add an additional engineer to accelerate the process.

Responsibilities

Translate existing SAS code into Python code. We are using both Pandas data frames and PySpark data frames so knowledge of both is required.

Verify that the Python version of the SAS code is equivalent to the SAS version. This involves running both processes, comparing the output, and resolving any differences.

Leverage PySpark and AWS EMR to parallelize the process and reduce the runtime.

Optimize the Python code to reduce the runtime.

Enhance the Python process to be fault-tolerant and contain checkpoints to make rerunning a subset of the process more efficient.

Write automated tests for Python code.

Peer review code and automated tests, help team members with design and implementation challenges.

Qualifications

At least 3 years of experience developing production Python code

A strong understanding of Pandas and PySpark

A strong understanding of SQL

Experience with SAS

Solid understanding of software design principles

Preferred Skills

BS in Computer Science or equivalent experience

Experience with cloud computing and storage services, particularly AWS EMR

Experience writing automated unit, integration, regression, performance and acceptance tests

Strong quantitative skills (statistics, econometrics, linear algebra)