As a Site Reliability Engineer at Gauss Labs, you will ensure system reliability and performance through monitoring, automation, and incident responses while collaborating with various teams to optimize operations.
Gauss Labs is seeking a highly skilled Site Reliability Engineer to join our team in Vancouver. As an SRE at Gauss Labs, you will play a critical role in ensuring our industrial AI platform's reliability, performance, and scalability. You will be responsible for building and maintaining a robust solution that supports our growing business at customer sites. This role requires a high level of technical expertise, a collaborative mindset, and a strong desire to continuously improve systems and processes.
Responsibilities
- Monitoring and Alerting: Creating and maintaining robust monitoring systems to proactively identify and resolve issues before they impact customers. Implementing effective alerting mechanisms to ensure timely response to critical events.
- Incident Response: Participating in on-call rotations and leading incident response efforts to minimize downtime and restore service quickly.
- Automation: Developing and implementing automation tools and scripts to streamline operations, reduce manual effort, and improve efficiency.
- Capacity Planning: Forecasting resource needs, optimizing resource utilization, and ensuring customers' infrastructure can handle increasing workloads.
- Performance Optimization: Identifying and resolving performance bottlenecks, optimizing system performance, and improving response times.
- Collaboration: Partnering with software engineers, data scientists, and other teams to ensure alignment and efficient operations.
- Customer Focus: Working closely with the AI Program Manager and Technical Account Manager to understand customer issues, provide technical support, and improve customer satisfaction.
- Continuous Improvement: Driving a culture of continuous improvement by identifying opportunities to enhance system reliability, performance, and efficiency.
Basic Qualifications
- Bachelor's degree in computer science, engineering, or a related discipline
- 5+ years of industry experience as a Site Reliability Engineer
- Experience with cloud platforms (AWS, GCP, Azure), containerization technologies (Docker, Kubernetes), observability and alerting tools (Prometheus, Grafana, ElasticSearch, Jaeger)
- Experience with scripting languages (Python, Bash)
- Working knowledge of Github, Github actions, CI/CD concepts
- Experience in ticket management, issue resolution, and troubleshooting
- Strong problem-solving and troubleshooting skills
- Excellent customer communication and interpersonal skills, fluency in verbal and written English
Preferred Qualifications
- Knowledge of AI/ML infrastructure and workloads
- Knowledge of big data technologies (Kafka, Flink)
- Knowledge of database technologies (MongoDB, PostgreSQL)
[Hiring process]
Application review - Phone interview - Virtual onsite interview - VP interview/Core Value interview
Top Skills
AWS
Azure
Bash
Ci/Cd
Docker
Elasticsearch
Flink
GCP
Git
Grafana
Jaeger
Kafka
Kubernetes
MongoDB
Postgres
Prometheus
Python
Similar Jobs
Information Technology
As a Senior Site Reliability Engineer, you'll build and maintain infrastructure, tackle operational challenges, and implement automation to enhance reliability and performance.
Top Skills:
Cloud-Native ServicesDockerDocker ComposeGoLinuxPerlPython
Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
The Senior Financial Analyst will support Revenue and Growth teams with forecasting, analytics, and financial modeling, ensuring strategic decisions. Responsible for collaboration and reporting to enhance business growth.
Top Skills:
DatabricksExcelSnowflakeSQL
Cloud • Security • Software • Cybersecurity • Automation
As a Business Development Representative, you will lead outreach to target accounts, generate sales opportunities, and collaborate with marketing and sales teams.
Top Skills:
Linkedin Sales NavigatorSalesforce
What you need to know about the Ottawa Tech Scene
The capital city of Canada and the nation's fourth-largest urban area, Ottawa has proven a rapidly growing global tech hub. With over 1,800 tech companies, many of which are leaders in their sectors, the city's tech talent now makes up more than 13 percent of its total workforce. This growth is driven not only by the big players like UL Solutions and Dropbox, but also by a thriving startup ecosystem, as new businesses emerge to follow in the footsteps of those that came before them.