Data Engineer (Python/PySpark/SAS) - Open to C2C
Location: McLean, VA
• Translate existing SAS code into Python code. We are using both Pandas data frames and PySpark data frames so knowledge of both is required.
• Verify that the Python version of the SAS code is equivalent to the SAS version. This involves running both processes, comparing the output, and resolving any differences.
• Leverage PySpark and AWS EMR to parallelize the process and reduce the runtime.
• Optimize the Python code to reduce the runtime.
• Enhance the Python process to be fault-tolerant and contain checkpoints to make rerunning a subset of the process more efficient.
• Write automated tests for Python code.
• Peer review code and automated tests, help team members with design and implementation challenges.
• At least 3 years of experience developing production Python code
• A strong understanding of Pandas and PySpark
• A strong understanding of SQL
• Experience with SAS
• Solid understanding of software design principles
• BS in Computer Science or equivalent experience
• Experience with cloud computing and storage services, particularly AWS EMR
• Experience writing automated unit, integration, regression, performance and acceptance tests
• Strong quantitative skills (statistics, econometrics, linear algebra)