Are you passionate about applying your skills to continuously improve a large-scale service? Do you enjoy working in a diverse environment where individuals collaborate to accomplish projects that will have measurable impact on service reliability? If so, the SPARC Site Reliability Engineering team within OneDrive Consumer is looking for you. We are seeking experienced individuals passionate about data analysis, with emphasis on service telemetry and reliability metrics, capable of delivering livesite and security engineering projects and running large scale internet services.
As a Service Engineer on the OneDrive Consumer team you’ll be responsible for broad online service engineering areas including, but not limited to, Incident, Capacity and Problem Engineering, and Compliance and Security. Such responsibilities include specifications, design, coding and delivering solutions for managing the service. The position requires the flexibility and aptitude to be able to conceptually and technically traverse intra and inter service boundaries down to protocol level details while simultaneously keeping the customer experience as the focal point. This position provides an opportunity to learn how large services work in Microsoft infrastructure initiatives including Azure and Autopilot; as well as opportunity to influence service design, drive automation and create modern, mature telemetry systems. The ideal candidate should have experience with large-scale services and the data processing capabilities therein. Experience with Cosmos or other similar MapReduce systems and/or SQL Azure is ideal. Experience in a development role is required as are aptitude in multiple programming languages such as Java and/or C#.
Responsibilities can include:
• Manage availability, latency, scalability and efficiency of services by engineering reliability into software and systems.
• Focus on operability of the service including security, privacy, resiliency, business continuity and disaster recovery.
• Analyze Azure SQL performance and capacity efficiency and drive improvements therein.
• Maintain an unwavering focus on Quality of Service.
• Execute with high accountability to schedule and quality and have passion to improve, evolve and revolutionize the systems you manage.
• Respond to and resolve emergent service problems; debug software and build automation to prevent problem recurrence.
• Participate in the incident management lifecycle, including escalation, communication, debugging, resolution, and problem management.
• Lead service teams through live site reviews and corrective action plans.
• Participate in 24x7 on-call on a rotational basis.
• Capture and analyze performance, utilization, and health data for trending and analysis, server capacity, and threshold testing.
• Develop a data-driven approach to prepare for feature growth.
• Closely collaborate with partner teams when engineering & business dependencies exist.
• 3+ years of software industry experience running a service, preferably in the consumer Internet space, with service reliability exceeding 99.9%.
• 3+ years of experience managing and engineering solutions using SQL Server or Azure SQL at scale and in Highly Available (HA) environments, with a preferred focus on performance and capacity analysis.
• 1+ years of software engineering with an emphasis on system level automation.
• 1+ year of experience with a service running in the Microsoft Cloud or similar medium-scale services.
• Great oral and written communication skills
• Understanding of large-scale online service network architectures including load balancing, GTM, ACLs, routing, network captures, etc.
• Strong working knowledge with Windows 2012, and IIS, including Active Directory, TCP/IP protocols, and security hardening procedures.
• Experience in data structures, algorithms and complexity analysis.
• Experience with Microsoft internal tools and processes, experience with deployment technologies (Autopilot, Azure) is a plus.
• BA/BS/MS degree in CS, related discipline or equivalent work experience.
Basic Qualifications: 1+ years of programming, software engineering, service engineering, software development, and/or site reliability engineering experience
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request to firstname.lastname@example.org.