Serve Robotics

Director of Systems Reliability & Field Resilience

Los Angeles, CA, US, San Francisco, CA, US

Hybrid, RemoteFull time roleSenior Level, Director / Executive

9 days ago

About the Job

At Serve Robotics, we’re reimagining how things move in cities. Our personable sidewalk robot is our vision for the future. It’s designed to take deliveries away from congested streets, make deliveries available to more people, and benefit local businesses.

The Serve fleet has been delighting merchants, customers, and pedestrians along the way in Los Angeles while doing commercial deliveries. We’re looking for talented individuals who will grow robotic deliveries from surprising novelty to efficient ubiquity.

Who We Are

We are tech industry veterans in software, hardware, and design who are pooling our skills to build the future we want to live in. We are solving real-world problems leveraging robotics, machine learning and computer vision, among other disciplines, with a mindful eye towards the end-to-end user experience. Our team is agile, diverse, and driven. We believe that the best way to solve complicated dynamic problems is collaboratively and respectfully.

Serve Robotics is seeking a Director of Systems Reliability & Field Resilience, responsible for continuously improving end-to-end operational reliability across our robotic delivery operations infrastructure. In this role, you and your team will proactively identify, triage, and resolve complex, cross-domain issues that impact delivery service quality/efficiency, and will work cross-functionally to build monitoring, alerting, automation and resiliency into our platform.

In this role you will provide leadership and direction to your team while also contributing directly in defining, building and deploying solutions. You will work closely with engineering, product and operations to prioritize the work, and you’ll hire, allocate resources and support your team to deliver capabilities from concept to production.

The Serve Robotics delivery platform spans a wide range of technologies, from cloud and networking infrastructure that powers delivery matching, front-end solutions for robot fleet supervisors and field agents, and on-robot embedded and autonomous systems that all must work seamlessly together to fulfill our daily delivery growth and economics. You will lead a team of experts with backgrounds in SRE, Devops and Cloud Infrastructure and partner across the entire engineering organization to ensure a robust and resilient delivery infrastructure.

The ideal candidate will have a strong track record of hands-on leadership of small and highly technical software engineering teams. You will have experience hiring, mentoring and coaching Sr. level engineers, building a high-performance, collaborative team. You are a highly capable and technical generalist who is comfortable working across all components of a complex system and partnering with domain experts and functional teams to identify issues, perform detailed root cause analysis, and develop strategies for short- and long-term solutions that will often require highly technical collaboration between your team and other engineering teams to deliver.

Responsibilities

Full-Stack Troubleshooting & System Deep Dives: Become the go-to expert for identifying root causes of service issues—whether they're in cloud APIs, robot hardware, network layers, or operational workflows—and coordinate with the respective owning teams to resolve and prevent them.
Build and Lead a Global Systems Reliability Team: Hire, mentor, and grow a multidisciplinary team of high-context generalists who can investigate system-wide failures, document their learnings, and drive improvements across organizational boundaries.
Own the On-Call & Incident Management Process: Take over and evolve the company's on-call process into a mature, well-documented, and inspectable system. Define SLAs, escalation policies, and a best-in-class paging infrastructure that aligns with our service goals.
Establish and Maintain a Knowledge Base: Ensure on-call responders have access to actionable documentation, playbooks, and troubleshooting guides. Make knowledge capture a core part of incident response.
Reliability Analytics & Intuition Building: Use incident and operational data to build a deep intuition about where our systems are most fragile. Create predictive frameworks and reliability metrics that help the organization stay ahead of failures.
Service Health & Performance Dashboards: Build and maintain dashboards that monitor the health of end-to-end services—not just software, but everything that supports customer delivery. Highlight systemic issues, performance regressions, and areas needing investment.
Cross-Functional Collaboration: Work closely with engineering, infrastructure, hardware, field ops, customer support, and leadership to align on reliability priorities and drive systemic improvement efforts.

Qualifications

8+ years of experience in a technical engineering or operations role, with at least 3 years in a leadership position. Background in both software engineering and IT/DevOps a plus.
Deep experience with complex distributed systems, infrastructure, and system debugging, triage and root cause analysis. Familiarity with observability tools like Datadog, Grafana, Prometheus, ELK, etc. a plus.
Strong understanding of hardware/software integration, particularly in cloud-connected device infrastructure including robotics, consumer electronics and embedded systems
Proven success leading incident response or SRE-style functions, and managing on-call teams
Ability to drive organization wide improvements by building trusted cross-functional relationships and technical collaboration across teams
Strong data and dash-boarding skills; can translate operational data into clear insights and action plans
Excellent communication and organizational skills; comfortable writing high-quality docs and leading blameless postmortems

What Makes You Stand Out

Relentless Drive for Quality: You set high standards for code and system design, continually raising the bar for your team and the organization.
Strong Cross-Functional Communicator: You effectively collaborate with product, operations, and executive teams to ensure technology and business goals are aligned.
Strategic Vision Paired with Execution: You think beyond immediate tasks to chart a roadmap that ensures platform longevity and innovation. You excel at driving changes that boost overall team cohesion and performance.
Passion for Innovation: You bring curiosity and enthusiasm for solving complex challenges in delivery and fleet management, keeping up with the latest trends and technologies in the space.

About the Company

Serve Robotics

Los Angeles, CA, USA

101-250

<p>Why deliver a 2-pound burrito in a 2-ton car? Serve is the future of sustainable, self-driving delivery. Our zero-emissions robots are designed to serve people in public spaces, starting with food delivery. We partner with platforms and merchants to help local businesses reach more customers.</p>

Similar Jobs

Lead Engineer, Systems Verification & Validation, Autonomy

Serve robotics
California, US
Hybrid, Remote
Full time role

Zero-emissions robots revolutionizing sustainable, self-driving food delivery.

About 2 months ago

Director of Field Services

Hayden ai
Remote
Full time role

AI-driven solutions for safer, faster transit and sustainable urban development.

About 2 months ago

Head of Network Operation

Dcbel
Montreal, QC, CA
Hybrid
Full time role

Empowering homes with sustainable energy through smart AI-integrated technology.

28 days ago

Senior Site Reliability Engineer

Bidgely
Bengaluru, KA, IN
Hybrid, Remote
Full time role

Revolutionizing home energy savings with appliance-specific analytics to combat climate change.

28 days ago

Senior Staff / Principal Software Engineer - Simulation Metrics Platform

Zoox
Foster City, CA, US
Hybrid
Full time role

Pioneering electric autonomous vehicles for low-carbon, congestion-free urban transportation.

25 days ago

Senior Team Lead, Rider Operations

Zoox
Las Vegas, NV, US
In-person
Full time role

Pioneering electric autonomous vehicles for low-carbon, congestion-free urban transportation.

22 days ago

Director, Fleet Operations

Serve robotics
Los Angeles, CA, US
In-person
Full time role

Zero-emissions robots revolutionizing sustainable, self-driving food delivery.

21 days ago

Director, IT Operations

Priority power management
The Woodlands, TX, US
In-person
Full time role

Optimizing energy and infrastructure for decarbonization, sustainability, and efficiency through innovative technology.

12 days ago

Senior Software Engineer - Site Reliability

Workiva
United States
Hybrid, Remote
Full time role

"Streamlining integrated ESG reporting for transparent climate impact and compliance."

11 days ago

Director of Engineering, Fleet Management Delivery Platform

Serve robotics
Los Angeles, CA, US, San Francisco, CA, US
Hybrid, Remote
Full time role

Zero-emissions robots revolutionizing sustainable, self-driving food delivery.

9 days ago