Confluent

Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)

Reposted 9 Hours Ago

Be an Early Applicant

Remote

2 Locations

Expert/Leader

Remote

2 Locations

Expert/Leader

This role focuses on driving reliability improvements in a multi-cloud environment, involving hands-on engineering, strategic program ownership, and coaching teams through incident management processes.

The summary above was generated by AI

We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.

About the Role:

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale—data in motion, exactly-once semantics, and cascading failure modes that require deep systems thinking. We need an expert-level engineer who can drive proactive reliability improvements that prevent these incidents before they occur.

This role combines hands-on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post-mortems, training incident commanders, and evolving our incident response practices.

You'll be part of a global team with follow-the-sun coverage, with clean handoffs that keep everyone working sustainable hours. This role sits within Cloud Architecture and Reliability - Supportability, a horizontal team that owns reliability standards and tooling across engineering. You're the person who makes us need incident management less.

What You Will Do:

Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
Own standards, practices, and continuous improvement of incident response across engineering
Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
Develop and deliver training programs; coach teams through post-mortems
Partner with engineering leaders to elevate reliability practices org-wide

What You Will Bring:

10+ years of relevant experience in SRE, incident management, or reliability engineering
Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
Experience navigating reliability/incident programs at 500+ engineer organizations
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
Strong understanding of distributed systems and failure modes at scale
Deep experience with observability: metrics, logging, tracing
Kubernetes and container orchestration experience
Understanding of CI/CD pipelines and release processes
Strong written communication (design docs, runbooks, post-mortems)
Experience driving org-wide process and cultural changes
Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Ready to build what's next? Let’s get in motion.

Come As You Are

Belonging isn’t a perk here. It’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what’s possible.

We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.

Top Skills

AWS

Azure

Ci/Cd

Confluence

GCP

JIRA

Kafka

Kubernetes

Pagerduty

Rootly

Slack

Similar Jobs

Toast

Designer

2 Hours Ago

Remote

Canada

Senior level

Cloud • Fintech • Food • Information Technology • Software • Hospitality

Design end-to-end enterprise product experiences for web and mobile, simplifying multi-location/menu workflows. Partner with product, engineering, research, and design systems to build scalable, AI-enabled tools, prototypes, and specs. Support research, localization, and design critique to elevate customer-centered solutions.

Top Skills: Figma

CrowdStrike

Business Resilience Manager - Crisis Management (Remote)

2 Hours Ago

Remote or Hybrid

Senior level

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity

The Business Resilience Manager will oversee Crisis Management programs, ensuring effective response and recovery from disruptions through strategic planning, training, and collaboration across teams.

Top Skills: Collaboration ToolsCrisis Management Software Platforms

Webflow

Staff Product Designer

6 Hours Ago

Easy Apply

Remote

Easy Apply

Senior level

Artificial Intelligence • Enterprise Web • Software • Design • Generative AI

Lead end-to-end design for new, AI-native products within the Develop pillar. Partner with Engineering, Product, and Insights to define ambiguous problems, prototype solutions, engage customers, present to stakeholders, and drive scalable, high-quality experiences and design strategy across surfaces.

What you need to know about the Ottawa Tech Scene

The capital city of Canada and the nation's fourth-largest urban area, Ottawa has proven a rapidly growing global tech hub. With over 1,800 tech companies, many of which are leaders in their sectors, the city's tech talent now makes up more than 13 percent of its total workforce. This growth is driven not only by the big players like UL Solutions and Dropbox, but also by a thriving startup ecosystem, as new businesses emerge to follow in the footsteps of those that came before them.