Site Reliability Engineer
November 19, 2016

Are you excited to be on the front-line of securing Microsoft and our customers that entrust their security to Microsoft? Are you passionate about enabling customers and team members to deliver reliable, high-performance solutions at scale with agility? Do you enjoy optimizing reliability, performance, efficiency, and solving complex problems with software and systems solutions?

Microsoft’s C+E Security Division builds and operates a diverse set of services across the stack from foundational dial-tone platforms in Azure to the systems that back offerings like Office 365 ATP and Azure IP. We work as a team across development, site reliability engineering (SRE), program management, security and ML specialists to protect, detect and remediate adapting and evolving threats. Speed and agility done correctly enable higher quality so anything done to ease the path to get secure, reliable, and effective features out is critical.

Our mission in the C+E Security Division’s SRE team is to optimize and protect the software and systems behind our internal and customer offerings keeping an ever-watchful eye on their reliability, latency, performance, and capacity. SRE is responsible for bulletproofing, reinforcing, ruggedizing and generally improving the quality of service and innovation-throughput of the services under our purview. The team is staffed with engineers that have an affinity for quality of service, improving operability, and driving high reliability. The team owns their services in production, and drives reliability and performance across massive scale by mastering the full depth and breadth of the stack. As an engineer on the SRE team you will have full access to the technology stack and be responsible for hardening, scaling, monitoring, and ensuring world class uptime.


• Contribute to design, write and deliver software to improve the reliability, scalability, latency, and efficiency of your services.

• Solve problems relating to mission critical services and create solutions to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.

• Influence and contribute to new designs, architectures, standards and methods for large-scale distributed systems.

• Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.

• Conduct periodic on call duties using a follow-the-sun model (on an as needed basis).

Skills & Qualifications:

• BS Degree in Computer Science, Electrical & Computer Engineering or Mathematics or equivalent experience

• 2+ years of experience and outstanding coding skills in C, C++, C#, Java, Python or similar languages

• 1+ years of experience in testing, deploying and supporting services

• Experience in SDLC and Agile projects

• A passion for building and participating in highly effective teams and development processes

• Expertise in problem solving and analyzing distributed systems and critical production service environments

• Debugging, testing / validation and analytics/SQL skills.

• Big data experience preferred (COSMOS, Hadoop)

• Fundamental understanding of OSI model/stack

• Firm sense of accountability, ownership and initiative for end-to-end project lifecycle with solid project management and skills.

• Strong communication and collaboration skills to work with people from a variety of technical backgrounds

• Experience defining and measuring service key performance indicators.

• Statistics experience and bias for measurement and data driven improvements

• The ability to drive Live Site issues and repair items to resolution

A little about us:
Microsoft offers training and employment opportunities to help you turn your military experience and skills into a civilian technology career.

