About the Job
Essential Job Duties and Responsibilities
- Partner with engineering, DevOps, and product teams to understand system requirements, communicate reliability best practices, and embed a culture of shared ownership. Strong communication, empathy, and influence are key to success.
- Lead incident response efforts, facilitate root cause analysis, and drive continuous improvements post-incident. Requires composure under pressure, clear decision-making, and the ability to bring teams together in critical moments.
- Identify opportunities to reduce manual work by building and maintaining internal tools and automation pipelines. Emphasizes problem-solving, initiative, and a continuous improvement mindset.
- Leverage DataDog to enhance system visibility, improve alerting strategies, and ensure observability across services. Requires proactive thinking, a focus on end-user impact, and the ability to coach teams on effective usage of monitoring tools.
- Develop and maintain documentation including runbooks, service readiness guides, and knowledge articles to support operational excellence. Strong written communication and a focus on clarity are essential.
- Collaborate with teams to support scaling initiatives and optimize system performance using data-informed insights. Requires strategic thinking, collaboration, and attention to long-term growth needs.
Required Skills, Knowledge and Abilities
- Solid understanding of the Software Development Lifecycle (SDLC), including source control, defect tracking, automated build systems, and production control processes
- Strong knowledge of CI/CD and DevOps principles, tools, and integrations
- Hands-on experience with Amazon Web Services (AWS), including services such as DynamoDB, CloudFormation, CloudFront, S3, Route53, Lambda, and YAML configuration
- Proficiency with containerization and serverless technologies
- Experience with infrastructure as code tools, particularly Terraform and Kubernetes
- Strong understanding of observability concepts, including tracing, structured logging, and metrics
- Experience using application and infrastructure monitoring tools—specifically DataDog—to ensure system health and performance
- Familiarity with designing and implementing self-healing, fault-tolerant, and autoscaling systems
- Experience working with SQL and relational databases; familiarity with MongoDB Cloud Atlas is a plus
- Proficiency with Git and source control workflows; understanding of change management best practices
- Demonstrated problem-solving and analytical skills in fast-paced environments
- Excellent verbal and written communication skills, with the ability to explain complex technical topics to both technical and non-technical stakeholders
- Self-motivated with a strong sense of ownership, accountability, and follow-through
About the Company

GoodLeap
<p>GoodLeap is a technology company delivering best-in-class financing and software products for sustainable solutions, from solar panels and batteries to energy-efficient HVAC, heat pumps, roofing, windows, and more. Over 1 million homeowners have benefited from our simple, fast, and frictionless technology that makes the adoption of these products more affordable, accessible, and easier to understand. Thousands of professionals deploying home efficiency and solar solutions rely on GoodLeap’s proprietary, AI-powered applications and developer tools to drive more transparent customer communication, deeper business intelligence, and streamlined payment and operations. Our platform has led to more than $27 billion in financing for sustainable solutions since 2018. GoodLeap is also proud to support our award-winning nonprofit, GivePower, which is building and deploying life-saving water and clean electricity systems, changing the lives of more than 1.6 million people across Africa, Asia, and South America.</p>
Similar Jobs

Site Reliability Engineer
Site Reliability Engineer
- Pano ai
- San Francisco, CA, US
- Hybrid, Remote
- Full time role
Revolutionizing disaster management for climate resilience with AI-powered wildfire detection.
About 2 months ago

Senior Site Reliability Engineer
Senior Site Reliability Engineer
- Charge point
- Campbell, CA, US
- Hybrid, Remote
- Full time role
Empowering widespread EV adoption with the world's largest open charging network.
About 2 months ago

Senior Site Reliability Engineer
Senior Site Reliability Engineer
- Crusoe
- Dublin, D, IE
- Hybrid, Remote
- Full time role
Transforming stranded energy into eco-friendly power for data centers, reducing environmental impact significantly.
17 days ago