Astreya Jobs

Business Analyst IV - Alert Management & Observability Standards Lead

Astreya

Business Analyst IV - Alert Management & Observability Standards Lead

Posted 10 Hours Ago

Be an Early Applicant

Remote

Hiring Remotely in CA

Senior level

Remote

Hiring Remotely in CA

Senior level

The Business Analyst IV leads alert management and observability standards, ensuring effective alert governance and operational reliability across IT operations by defining standards, evaluating alerts, and maintaining response instructions.

The summary above was generated by AI

What this Job Entails:

The Business Analyst IV will provide solutions that help attain business outcomes. The Alert Management & Observability Standards Lead is responsible for rationalizing and governing all system alerts to ensure they align with department priorities, operational coverage models, and service reliability goals. This role defines alerting standards, reviews and approves alerts before they are routed to the 24x7 Eyes-on-Glass Operations team, and establishes a scalable approach to cataloging alert response instructions (runbooks/playbooks) so responders can take consistent, high-quality actions.

This position operates at the intersection of the IT Operations Command Center (OCC), engineering/application teams, platform/monitoring tool owners, and service owners, ensuring alerts are actionable, prioritized, and paired with clear response guidance.

Your Roles and Responsibilities:

1) Alert Rationalization & Prioritization (Core)

Establish and maintain a department-wide alert rationalization framework that evaluates alerts for:

Business/service criticality and operational priority
Actionability (clear operator action available)
Signal-to-noise (duplicate/low-value alerts removed or suppressed)
Ownership and escalation paths

Perform regular alert reviews (new + existing) to ensure alert quality, correct routing, and alignment with operational coverage.

Lead continuous improvement efforts to reduce alert fatigue while preserving detection of true incidents and high-impact degradation.

2) Standards, Policies, and Guardrails

Define and enforce alerting standards including:

Severity definitions and thresholds
Required metadata (service, CI, owner, runbook link, escalation)
Naming conventions and tagging taxonomy
Routing rules and “when to page vs. when to ticket”

Create a standardized Alert Design Checklist and approval workflow (e.g., “Definition of Done” for alert onboarding).

Partner with tool/platform owners to ensure standards are embedded in monitoring tooling (templates, required fields, automated validation).

3) Routing Decisions to 24x7 Eyes-on-Glass

Act as gatekeeper (or lead the governance process) for determining which alerts should:

Go to 24x7 Eyes-on-Glass for immediate triage
Route to on-call engineering directly
Create tickets for business-hours handling
Be suppressed, aggregated, or converted to dashboards/health indicators

Ensure routing aligns with:

Operational responsibilities and skills of the Eyes-on-Glass team
Department priorities (e.g., safety, reliability, customer impact)
Service ownership and support models

4) Runbook / Response Instruction Cataloging (Knowledge System)

Establish a consistent approach to cataloging response instructions for every actionable alert, including:

“What does this alert mean?” (symptoms + impact)
“What to check first” (triage steps)
“What actions to take” (standard remediation)
“When to escalate and to whom” (clear escalation triggers)
Links to dashboards, logs, SOPs, and known issues

Own the runbook template and ensure runbooks are versioned, maintained, and reviewed on a defined cadence.

Partner with service owners to ensure runbooks stay current as systems change.

5) Reporting & Operational Outcomes

Define and publish KPIs that demonstrate alerting health and operational performance, such as:

Alert volume trends by service and severity
Percentage of alerts with runbooks and valid ownership
Alert “actionability rate” and noise reduction
Mean time to acknowledge / triage effectiveness (as applicable)

Facilitate governance forums (weekly/monthly) with service owners and engineering leads to review alert quality and backlog.

6) Cross-Functional Enablement

Coach service teams on best practices: SLIs/SLOs, alert thresholds, dependency monitoring, and incident correlation.

Drive adoption of observability patterns (golden signals, health indicators, multi-signal alerting).

Support major incident learning by feeding post-incident insights back into improved alerts and runbooks.

7) Able to Deliver the following in the first 45 days:

Alerting standards (severity model, metadata, naming, routing policy) published and adopted

Intake and approval workflow established for new/changed alerts

Top 20 noisy services rationalized (dedupe/suppress/threshold tuning) with measurable noise reduction

Runbook template launched; minimum runbook coverage targets set (e.g., 80% of paged alerts)

Central alert catalog created (ownership + routing + runbook link + last review date)

Required Qualifications/Skills:

5+ years in IT Operations, SRE, Observability, Monitoring Engineering, or Incident Management

Demonstrated success reducing noise and improving actionability across enterprise alerting ecosystems

Experience with common monitoring/observability tools (e.g., Splunk, AppDynamics, Dynatrace, Datadog, Prometheus/Grafana, Azure Monitor, CloudWatch, ServiceNow Event Mgmt or similar)

Strong understanding of:

Incident response workflows and operational coverage models (24x7 vs. business hours)
CMDB/service ownership concepts and dependency mapping
Standard operating procedures/runbooks and knowledge management

Excellent stakeholder management and ability to drive standards across teams

Preferred Qualifications:

Experience designing or operating an Operations Command Center / NOC / SOC-style “eyes-on-glass” model
Familiarity with ITIL Event Management, SRE principles, and service reliability practices
Experience with automation for alert enrichment, correlation, and routing (e.g., event correlation, deduplication, noise suppression)
Background in governance frameworks and operating rhythm design (cadences, controls, compliance traceability)

Physical Demand & Work Environment:

Must have the ability to perform office-related tasks which may include prolonged sitting or standing
Must have the ability to move from place to place within an office environment
Must be able to use a computer
Must have the ability to communicate effectively
Some positions may require occasional repetitive motion or movements of the wrists, hands, and/or fingers

What can Astreya offer you?

Employment in the fast-growing IT space providing you with a variety of career options
Opportunity to work with some of the biggest firms in the world as part of the Astreya delivery network
Introduction to new ways of working and awesome technologies
Career paths to help you establish where you want to go
Focus on internal promotion and internal mobility - we love to build teams from within
Free 24/7 accessible Professional Development through LinkedIn Learning and other online courses to give you opportunities to upskill at your own pace
Education Assistance
Dedicated management to provide you with on point leadership and care
Numerous on the job perks
Market competitive compensation and insurance, health and wellness benefits

Salary Range

$98,040.00 - $154,800.00 USD (Salary)

Please note that the salary information provided herein is base pay only (gross); it does not include other forms of compensation which may or may not apply to this specific position, namely, performance-based bonuses, benefits-related payments, or other general incentives - none of which are guaranteed, may be subject to specific eligibility requirements, and are wholly within the discretion of Astreya to remit.
Further, the salary information noted above is a range that consists of a minimum and maximum rate of pay for this specific position. Where an applicant or employee is placed on this range will depend and be contingent on objective, documented work-related considerations like education, experience, certifications, licenses, preferred qualifications, among other factors.

Astreya offers comprehensive benefits to all Regular, Full-Time Employees, including:

Medical provided through UHC (PPO, HSA, Surest options) / Medical provided through Kaiser (HMO option only) for California employees only
Dental provided through UHC
Nationwide Vision provided by UHC
Flexible Spending Account for Health & Dependent Care
Pre-Tax Account for Commuter Benefit/Parking & Transit (location-specific)
Continuing Education and Professional Development via various integrated platforms, e.g. Udemy and Coursera
Corporate Wellness Program provided by Goomi Group
Employee Assistance Program
Wellness Days
401k Plan
Basic and Supplemental Life Insurance
Short Term & Long Term Disability
Critical Illness, Critical Hospital, and Voluntary Accident Insurance
Tuition Reimbursement (available 6 months after start date, capped)
Paid Time Off (accrued and prorated, maximum of 120 hours annually)
Paid Holidays
Any other statutory leaves, paid time, or other ancillary benefits required under state and federal law

Similar Jobs

Inspiren

Senior Data Scientist

18 Hours Ago

Easy Apply

In-Office or Remote

Canada

Easy Apply

Senior level

Artificial Intelligence • Hardware • Healthtech • Software

The Senior Data Scientist will build models and analyses, design experiments, integrate datasets, and leverage AI for improved workflows and insights in data science.

Top Skills: DatabricksMlflowPandasPythonPyTorch

Zapier

Staff Engineer

18 Hours Ago

Remote

Canada

Senior level

Artificial Intelligence • Productivity • Software • Automation

As a Staff Engineer for Revenue, you'll shape technical vision and architecture for billing and pricing systems, ensuring correctness while enhancing cross-team collaboration.

Top Skills: APIsBilling SystemsPerformance OptimizationSubscription Management

Optum

Senior Software Engineer

18 Hours Ago

In-Office or Remote

Senior level

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics

The Senior Software Engineer will design, develop, and operate cloud-based services while maintaining production systems. Responsibilities include coding, CI/CD, and collaborating with cross-functional teams to improve engineering practices and developer productivity.

Top Skills: .NetAsp.Net CoreAWSAzureC#DockerGCPJavaScriptReactTypescript

What you need to know about the Ottawa Tech Scene

The capital city of Canada and the nation's fourth-largest urban area, Ottawa has proven a rapidly growing global tech hub. With over 1,800 tech companies, many of which are leaders in their sectors, the city's tech talent now makes up more than 13 percent of its total workforce. This growth is driven not only by the big players like UL Solutions and Dropbox, but also by a thriving startup ecosystem, as new businesses emerge to follow in the footsteps of those that came before them.