Data Engineer (Python, Pandas, PySpark, SAS)
Location: McLean, VA
Seeking a Data Engineer to assist in modernizing our Collateral Modeling process. This modernization involves porting SAS code to Python (Pandas and PySpark) code that will run in Amazon Web Services (AWS) Elastic Map Reduce (EMR). This project is in progress with some of the pieces already in production, we are looking to add an additional engineer to accelerate the process.
• Translate existing SAS code into Python code. We are using both Pandas data frames and PySpark data frames so knowledge of both is required.
• Verify that the Python version of the SAS code is equivalent to the SAS version. This involves running both processes, comparing the output, and resolving any differences.
• Leverage PySpark and AWS EMR to parallelize the process and reduce the runtime.
• Optimize the Python code to reduce the runtime.
• Enhance the Python process to be fault-tolerant and contain checkpoints to make rerunning a subset of the process more efficient.
• Write automated tests for Python code.
• Peer review code and automated tests, help team members with design and implementation challenges.
• At least 3 years of experience developing production Python code
• A strong understanding of Pandas and PySpark
• A strong understanding of SQL
• Experience with SAS
• Solid understanding of software design principles
• BS in Computer Science or equivalent experience
• Experience with cloud computing and storage services, particularly AWS EMR
• Experience writing automated unit, integration, regression, performance and acceptance tests
• Strong quantitative skills (statistics, econometrics, linear algebra)