8076 - Tool Implementation - Engineer

Location: Reston, VA


We are looking for a Senior Site Reliability Engineer (SRE) to help change the way organizations monitor their applications and infrastructure on-prem and in the Cloud. The ideal candidate will be a full-stack Software Engineer and a Systems Engineer with a passion for making tools and services fast and reliable. Come join our team if you take pride in solving hard engineering problems, automating everything, and making systems scale.

We are the Monitoring team at Freddie Mac that provides enterprise-wide capability to monitor fault, capacity, availability, and performance for infrastructure and applications. We currently use Nimsoft, Specturm, SiteScope, Icinga, OpNet/RiverBed, and Splunk. We are adopting the SRE model. Our push is to correlate infrastructure and application alerting into events, provide comprehensive dashboards consumable at all levels of the organization, with a view to monitoring our Cloud services and containerized solutions.

Responsibilities include:
• Support the existing monitoring portal, reconciliation scripts against the environment and the CMDB, and other tools. Work toward enhancing these capabilities and identify additional monitoring related opportunities throughout the organization.
• Lead the design of automation solutions which include considerations for infrastructure, operation systems, platforms, and services. Collaborate with all other technical teams to complete the solutions.
• Lead development of Java and Python based monitoring self-service portal.
• Create reusable libraries and patterns in an open-source contribution model.
• Design and develop web applications, backend services, build and test automation, and prototyping.
• Develop or integrate monitoring dashboards which can be used by the network operations center, the CIO and everyone in-between.
• Support other engineers, perform pair programming or peer reviewing of code, and help ensure the team is growing.
• Take ownership for the quality, reliability, and performance of our tools and services.
• Participate in on-call production support rotation within a team of six engineers.
• The team works closely with our business stakeholders to understand the application/platform monitoring needs and configures their requirements as their core function. 

• Automation experience with Chef, Puppet, Salt or Ansible, and similar technologies.
• Expert in Java and Python programming.
• Excellent scripting skills in shell, Perl and optionally in bladelogic.
• Hands on experience with Linux and Windows administration.
• Experience with database management and programming.
• Understanding of virtualization and networking is a big plus.
• Experience in configuration/integration with external systems/3rd party tools such as serviceNow, Nimsoft, Specturm, SiteScope Icinga, OpNet/RiverBed, and Splunk.
• Strong organizational and communication skills.

Java  High 5 - 7 Years