Site Reliability Engineer
Posted: November 07, 2017
Reference ID: 115186BR
Are you passionate about technology? Do you love building new things? Do you want to develop the future of IBM's Cloud offerings? If you answered YES, then we have the right opportunity for you!
The shift toward the consumption of IT as a service, i.e., the cloud, is one of the most important changes to happen to our industry in decades. At IBM, we are driven to shift our technology to an as-a-service model and to help our clients transform themselves to take full advantage of the cloud. With industry leadership in analytics, security, commerce, and cognitive computing and with unmatched hardware and software design and industrial research capabilities, no other company is as well positioned to address the full opportunity of cloud computing.
The Next Generation Cloud Network Engineering (NextGenCloud) team is a team dedicated to ensuring that the IBM Cloud is at the forefront of cloud technology, from data center design to network architecture to storage and compute clusters to flexible infrastructure services. While our focus is on Network as a Service (NaaS), we are part of the team building IBM's next generation cloud platform to deliver performance and predictability for our customers' most demanding workloads, at global scale and with leadership efficiency, resiliency and security. It is an exciting time, and as a team we are driven by this incredible opportunity to thrill our clients. We are looking for a Site Reliability Engineer to join our team, who innovates & shares our passion for winning in the cloud marketplace.
This position is for a mid- to senior Reliability Engineer who should have at least 7 years' industry experience maintaining or assisting in maintaining site reliability. In this role, you will work as a member of the Site Reliability team with the following key responsibilities:
To summarize, in this role you will engage in all aspect of the lifecycle of the IBM's NaaS, from idea to architecture and through deployment, operation, and improvement ensuring that our clients have the most reliable and performant experience possible.
- Troubleshoot and debug software delivered by various development teams within NextGenCloud.
- Provide detailed trouble reports back to the development teams including automated methods to reproduce any defects.
- Assist troubleshooting and maintaining pre-production CICD systems in support of deployment.
- Work with the team to ensure automation and the highest level of determinism possible in the installation and configuration of new systems (software and hardware).
- Document automation and the interaction of software and system as necessary to enable in others.
- Maintain services post-deployment through data collection and monitoring ensuring overall health of the services provided.
- Participate in and support other teams with trouble issues when on call.
- Participate in retrospectives.
- Engage and encourage collaboration and a focus on issue resolution.
- Engage in meaningful planning to improve software, systems, and processes.
This opportunity is for someone in the continental United States. Job Requirements
- 7+ years' experience as with systems and/or software engineering.
- 2+ years' experience with software development.
- 2+ years' experience with systems engineering.
- 2+ years' experience troubleshooting software.
- Experience in a devops environment.
- Experience with Git.
- Experience with OpenStack or similar proprietary cloud like Azure or AWS.
- Familiarity with CICD and their pipelines; experience with Zuul or Jenkins a plus.
- Familiarity with containers and HA clusters; experience with Docker and Kubernetes a plus.
- Excellent knowledge of TCP/IP networking.
- Strong background in network engineering a plus.
- Hands-on data center operational experience a big plus.
- Proven ability to collaborate and work well within a team.
- Ability to communicate effectively both verbally and in writing.