Manager Hardware Fleet Reliability
Location:
San Jose , California
Posted:
November 21, 2017
Reference:
R0020651

Looking for a company that inspires passion, courage and imagination, where you can be part of the team shaping the future of global commerce? Want to shape how millions of people buy, sell, connect, and share around the world? If you’re interested in joining a purpose driven community that is dedicated to creating an ambitious and inclusive workplace, join eBay – a company you can be proud to be a part of.

Manager Hardware Fleet reliability

Job Description

INTRODUCTION

Looking for a company that inspires passion, courage and imagination, where you can be part of the team shaping the future of global commerce? Want to shape how millions of people buy, sell, connect, and share around the world? If you’re interested in joining a purpose driven community that is dedicated to creating an ambitious and inclusive workplace, join eBay - a company you can be proud to be a part of.

OPPORTUNITY

At eBay, we are starting a new chapter in our iconic internet history of being the largest online marketplace in the world. We have more than a billion million listings with 80% of them selling new items, in over 400 markets around the world. The collection of services runs on a significant server and storage infrastructure, and the hardware engineering team is chartered to drive the reliability, efficiency and performance of this layer.

We are looking to for a leader to drive the hardware reliability of the fleet for hardware and firmware reliability of servers once deployed, and will also help eBay adopt best practices of monitoring for hardware health. This person will coordinate with server vendors, with internal eBay data center and platform teams and with monitoring and remediation systems.


KEY RESPONSIBILITIES

  • This role owns the fleet reliability discipline including topics that may not be described below. The responsibilities of the team are below.

    • Own and drive to resolution L2/L3 fleet issue escalations.

    • Organize and arrange training of L1 data center technicians.

    • Work with cloud infrastructure team to implement correct hardware health monitoring and remediation states in our data center automation system.

    • Be the reviewer for and a user of an automated regression system to do full stack testing of various hardware/firmware/OS/key applications.

    • Provide scripts and processes to enhance intake, server verification, burn in and decommissioning of hyperscale servers.

    • Participate in the optimization of technology refresh program, including DC migrations.

    • Triage L2/L3 HW incidents and epidemics as they occur to provide speedy addressing and resolution.

  • Lead by example with root cause, analysis, experimentation and resolution of some key hardware reliability issues.

  • Work with supply chain to evaluate and track quality of servers and components and publish quality results to customer teams in the company and to vendors in QBR/EBR


DESIRED SKILLS AND EXPERIENCE

  • At least 10 years of system and/or hardware engineering of server and storage systems which 3-5 years need to be in a scale out environment. Highly desired is experience in dealing with a large server fleet, with automation of processes.

  • At least 3 years of being a manager or a team lead

  • Must have deep knowledge of CPU, servers, memory, disks as BIOS, BMC and Linux. This deep knowledge may be best demonstrated by leading the development of a hardware, driver or firmware component or project.

  • Expertise in testing and debug of various aspects of server hardware and firmware.

  • Working familiarity with some of the following area: storage subsystem hardware, networking systems, power supplies & distribution, mechanical / thermal testing.

  • Working familiarity of Linux OS, hardware test utilities and shell and/or Python scripting.

  • Demonstrated technical and people leadership abilities with good communication skills

  • Bonus: First hand exposure to platforms for compute or storage services.

  • Bonus: Performance testing of compute servers, storage subsystems or networking

  • Bonus: Exposure to statistical reliability testing of hardware systems and components

  • BS EE or CS with continued formal or informal education

  • Position ideally based in San Jose, CA with a small amount travel required.

This website uses cookies to enhance your experience. By continuing to browse the site, you agree to our use of cookies

View our privacy policy

View our

R0020651

A little about us:
eBay is a global commerce platform and payments leader connecting millions of buyers and sellers.

Know someone who would be interested in this job? Share it with your network.