Looking for a company that inspires passion, courage and imagination, where you can be part of the team shaping the future of global commerce? Want to shape how millions of people buy, sell, connect, and share around the world? If you’re interested in joining a purpose driven community that is dedicated to creating an ambitious and inclusive workplace, join eBay – a company you can be proud to be a part of.
Manager Hardware Fleet reliability
Looking for a company that inspires passion, courage and imagination, where you can be part of the team shaping the future of global commerce? Want to shape how millions of people buy, sell, connect, and share around the world? If you’re interested in joining a purpose driven community that is dedicated to creating an ambitious and inclusive workplace, join eBay - a company you can be proud to be a part of.
At eBay, we are starting a new chapter in our iconic internet history of being the largest online marketplace in the world. We have more than a billion million listings with 80% of them selling new items, in over 400 markets around the world. The collection of services runs on a significant server and storage infrastructure, and the hardware engineering team is chartered to drive the reliability, efficiency and performance of this layer.
We are looking to for a leader to drive the hardware reliability of the fleet for hardware and firmware reliability of servers once deployed, and will also help eBay adopt best practices of monitoring for hardware health. This person will coordinate with server vendors, with internal eBay data center and platform teams and with monitoring and remediation systems.
This role owns the fleet reliability discipline including topics that may not be described below. The responsibilities of the team are below.
Own and drive to resolution L2/L3 fleet issue escalations.
Organize and arrange training of L1 data center technicians.
Work with cloud infrastructure team to implement correct hardware health monitoring and remediation states in our data center automation system.
Be the reviewer for and a user of an automated regression system to do full stack testing of various hardware/firmware/OS/key applications.
Provide scripts and processes to enhance intake, server verification, burn in and decommissioning of hyperscale servers.
Participate in the optimization of technology refresh program, including DC migrations.
Triage L2/L3 HW incidents and epidemics as they occur to provide speedy addressing and resolution.
Lead by example with root cause, analysis, experimentation and resolution of some key hardware reliability issues.
Work with supply chain to evaluate and track quality of servers and components and publish quality results to customer teams in the company and to vendors in QBR/EBR
DESIRED SKILLS AND EXPERIENCE
At least 10 years of system and/or hardware engineering of server and storage systems which 3-5 years need to be in a scale out environment. Highly desired is experience in dealing with a large server fleet, with automation of processes.
At least 3 years of being a manager or a team lead
Must have deep knowledge of CPU, servers, memory, disks as BIOS, BMC and Linux. This deep knowledge may be best demonstrated by leading the development of a hardware, driver or firmware component or project.
Expertise in testing and debug of various aspects of server hardware and firmware.
Working familiarity with some of the following area: storage subsystem hardware, networking systems, power supplies & distribution, mechanical / thermal testing.
Working familiarity of Linux OS, hardware test utilities and shell and/or Python scripting.
Demonstrated technical and people leadership abilities with good communication skills
Bonus: First hand exposure to platforms for compute or storage services.
Bonus: Performance testing of compute servers, storage subsystems or networking
Bonus: Exposure to statistical reliability testing of hardware systems and components
BS EE or CS with continued formal or informal education
Position ideally based in San Jose, CA with a small amount travel required.