McCain Foods

Sr Eng Manager, SRE & Observability

Posted 11 Days Ago

Be an Early Applicant

Toronto, ON

Senior level

Toronto, ON

Senior level

The Sr Engineering Manager, SRE & Observability will lead the design, implementation, and monitoring of secure, fault-tolerant SRE and Observability infrastructure. Responsibilities include developing strategies, collaborating with teams, mentoring engineers, and driving operational excellence through advanced monitoring and automation techniques.

The summary above was generated by AI

Position Title: Sr Eng Manager, SRE & Observability
Position Type: Regular - Full-Time
Position Location: Toronto HQ
Requisition ID: 31044
At McCain, our Digital Technology team is dedicated to leveraging technology and data to drive profitable growth, enhance customer experience, and further our purpose of 'Celebrating real connections through delicious, planet-friendly food.' We have embarked on an ambitious digital transformation journey that spans our entire business, from Agriculture to Manufacturing, all with a focus on deepening our customer obsession.
As part of this transformation, we are making substantial investments in digital platforms, technology advancements, and fostering a data-driven culture. Our goal is to develop digital products that serve our customers, suppliers/growers, and McCain team members, enabling digital processes and data-driven automation. Through these investments, we aim to transform McCain into a company that empowers its teams with intuitive systems designed to enhance collaboration, productivity, and informed decision-making.
Are you ready to be part of this exciting journey? Join us and help shape the future of McCain through innovative technology solutions.
Leadership Principles.
Our principles, each with related practices, guide our actions across the organization. Together, they address how McCain interacts with our customers and employees, and how we work as individuals and collectively to find success. While each role adheres to the Leadership Principles, individual roles may focus more on a specific principle or principles.

We are customer obsessed. Customers are our starting point. By understanding their needs and leveraging data and consumer insights, we drive mutual success.
We think big and plan ahead. Through ambition, curiosity, and smart risks, we can accomplish goals, refine processes, and innovate to scale success.
We bring out the best in our people. We create safe spaces for our people so that trust and empowerment come naturally. Inclusion is about listening first, showing humility, and working together.
We act like owners. Together, we clear obstacles and do the work that makes us all successful and proud to be part of McCain.

JOB PURPOSE:
Reporting to the Director, Infrastructure Operations, the Sr Engineering Manager, SRE &Observability will be responsible for: Design, implement and monitor enterprise-grade secure fault-tolerant SRE and Observability infrastructure.
Senior manager is an engineering leader who will lead members of the engineering staff working across the organization to provide a friction-less experience to our customers and maintain the highest standards of reliability and availability. Our team thrives and succeeds in delivering high-quality technology products and services in a hyper-growth environment where priorities shift quickly. The ideal candidate has broad and deep technical knowledge experience to improve application's performance, capacity benchmarking, improve availability, security and reliability, design and evolve cloud/infrastructure architecture, and leverage engineering solutions to solve operational problems. Also should have deep technical expertise in software engineering, Kubernetes, Metrics, Logs, Traces, Synthetics, Digital Experience Monitoring, DevOps, Big data processing, and open-source Observability platform domain
JOB RESPONSIBILITIES:

Develop and implement a Observability and SRE strategy
Collaborate with the Infrastructure, applications and Data teams to understand their pain points around monitoring, performance, efficiency, reliability, availability, and formulate strategies to address recurring issues in a sustainable way.
Influence and build vision with application owners to ship quality products in a faster pace.
Ownership of the end-to-end delivery of team strategy and execution
Develop and motivate teams to solve complex problems and be a strong advocate for open-source technologies and solutions.
Be technically hands-on in coding as well as building highly available systems.
Be responsible for building and mentoring a new team of software engineers
Drive the team towards building solutions towards the long-term goals while ensuring that high priority tech debts are solved in an efficient way.
Be a strong thought leader in Site Reliability engineering, Observability, Operational excellence, Big Data processing, and DevOps Principles.
Consistently share best practices and improve processes within and across teams.
Hands-on Software engineering manager with strong understanding of Site Reliability Engineering, Big Data processing, Observability and DevOps principles.
Fluency with at least one modern language such as Python, Java, Go and experience with open-source software is a big plus.
Hands-on experience in managing infrastructure components through Infrastructure as Code using Terraform, Ansible
Strong technical acumen in Cloud Architecture, Observability, Performance Benchmarking, Capacity planning and Reliability tools.
Expert in Container orchestration (e.g., Kubernetes), container runtimes and OS (Operating System) optimization.
Experience in Observability platforms, application monitoring tools and performance analysis techniques.
Experience managing & growing technical leaders and teams.
In-depth knowledge of data structures and algorithms.
Expert in Open-source observability software like Grafana, Prometheus, and OTEL
Knowledge in ML and AI technologies
Develop and improve instrumentation for monitoring and logging the health and availability of services.
Proactively monitor systems, networks, and applications to provide input in improving the stability, security, efficiency, and scalability of systems.
Develop and maintain Monitoring and Logging Frameworks for all of ITX
Take personal responsibility for the quality, reliability and availability of global IT corporate infrastructure.
Own operations documentation of monitoring and logging for global IT production infrastructure.
Participate in rotating on-call incident response on the weekdays and on the weekends.
Improve operational efficiencies via scripting, bots and integrations.
Participate cross functionally with vendors and other IT engineering teams to ensure smooth service delivery.
Network and systems troubleshooting, fault analysis, and resolution.
Collaborate with Incident and Problem Management to reduce MTTR and Incident volume.
Design, implement, and maintain AIOps solutions to monitor and analyze IT systems, applications, and networks.
Deploy machine learning algorithms for anomaly detection, root cause analysis, and incident prediction.
Configure and manage observability tools and platforms to gain real-time visibility into system health and performance.
Develop monitoring dashboards, alerts, and reports to provide comprehensive insights into the IT environment.
Conduct root cause analysis for incidents using data from AIOps and observability tools to identify underlying issues.
Work closely with software engineers to instrument applications with appropriate logging, metrics, and tracing capabilities
Continuously analyze monitoring data to identify trends, anomalies, and opportunities for optimization.
Stay updated with industry trends and advancements in AIOps and observability practices, and recommend new tools or methodologies for adoption
Designing, developing, and implementing AI models and algorithms utilizing state-of-the-art techniques such as GPT, VAE, and GANs.
Collaborating with cross-functional teams to define AI project requirements and objectives, ensuring alignment with overall business goals.
Conducting research to stay up-to-date with the latest advancements in generative AI, machine learning, and deep learning techniques and identify opportunities to integrate them into our products and services.
Optimizing existing generative AI models for improved performance, scalability, and efficiency.
Developing and maintaining AI pipelines, including data preprocessing, feature extraction, model training, and evaluation.
Developing clear and concise documentation, including technical specifications, user guides, and presentations, to communicate complex AI concepts to both technical and non-technical stakeholders.
Contributing to the establishment of best practices and standards for generative AI development within the organization.
Providing technical mentorship and guidance to junior team members.
Apply trusted AI practices to ensure fairness, transparency, and accountability in AI models and systems
Drive DevOps and MLOps practices, covering continuous integration, deployment, and monitoring of AI
Utilize tools such as Docker, Kubernetes, and Git to build and manage AI pipelines
Implement monitoring and logging tools to ensure AI model performance and reliability
Collaborate seamlessly with software engineering and operations teams for efficient AI model integration and deployment.
Familiarity with DevOps and MLOps practices, including continuous integration, deployment, and monitoring of AI models.

KEY QUALIFICATION & EXPERIENCES:

Minimum 10 years of experience in Observability/Monitoring tools
Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field.
5+ years of industry experience in software development.
In-depth experience designing at scale monitoring and logging for corporate infrastructure services.
Expert level experience in monitoring and logging technologies, both open source and closed source (e.g. AppDynamics, Newrelic, Datadog, Prometheus, Grafana, LogicMonitor, SumoLogic, ELK)
Experience in implementing Metrics, Logs and Tracing for E2E observability
Experience in RBAC and user based security services such as ISE, Radius, LDAP, and AD.
Must have strong automation/scripting skills - proficiency in Python or Golang is a plus.
Proficient in developing and maintaining technical documentation, runbooks, and procedures.
A working knowledge in Network is needed. Fundamental knowledge of TCP/IP stack, application protocols (DHCP/DNS/HTTPs) and networking concepts (HSRP/NAT/VPN/VLANs/802.1x/Wireless/Clustering/High Availability/Load Balancing).
Understanding of enterprise networks using Cisco IOS/NXOS with a working knowledge of IP Protocols (TCP/UDP/ICMP) and Routing Protocols (BGP/OSPF/IS-IS).
Technology understanding of Cisco, Cloud Native Firewalls, including Firewall Policy Rules, URL-Filtering, App-ID, User-ID, etc.
Experience interacting with Telco and Global ISPs (WAN/DIA) and the monitoring of those services.
A working knowledge of systems is needed. Fundamental knowledge of Configuration Management and Automation tools, with experience in:
* Terraform, Ansible, Chef, Puppet, Jenkins
* Designing and implementing CI/CD pipelines
* Infrastructure provisioning and management
Strong in troubleshooting incidents in production environment.
A strong ownership attitude and a track record of taking responsibility for problems and pushing through to resolution.
Ability to communicate and coordinate with cross-functional engineering teams across multiple geographic regions.
Experience with AIOps and machine learning is highly desirable.
Knowledge of OpenTelemetry is an added advantage.
Experience with other monitoring tools like Prometheus, Grafana, etc.
Experience with Observability solutions like Dynatrace, DataDog, Instana etc. is highly desirable
Experience working with mainframe systems is a plus (willingness to learn is also acceptable).
Excellent problem-solving and analytical skills.
Strong communication and collaboration skills.
Ability to work independently and manage multiple projects simultaneously.
Passion for learning new technologies and continuous improvement.
In-depth knowledge of machine learning, deep learning, and generative AI techniques
Knowledge and experience in Generative AI
Proficiency in programming languages such as Python, R, and frameworks like TensorFlow or PyTorch
Strong understanding of NLP techniques and frameworks such as BERT, GPT, or Transformer models
Familiarity with computer vision techniques for image recognition, object detection, or image generation
Experience with cloud platforms such as Azure or AWS
Knowledge of IT operations concepts and processes, such as monitoring, incident management, root cause analysis, remediation.
Strong problem solving and analytical skills.
Strong interpersonal and written and verbal communication skills.
Highly adaptable to changing circumstances. Interest in continuously learning new skills and technologies.
Experience with programming and scripting languages (e.g. Java, C#, C++, Python, Bash, PowerShell).
Experience with incident and response management.

Qualifications

Bachelor's degree (or equivalent years of experience).
5+ years of relevant work experience. SRE experience required.
Background in Manufacturing, Platform/Tech companies is preferred.
Must have Public Cloud provider certifications (Azure, GCP or AWS)
Having CNCF certification is plus

OTHER INFORMATION

Travel: as required.
Job is primarily performed in a Hybrid office environment.

Key SRE and Observability Overview and Boundaries
Infrastructure Design: Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech Stacks
Non-Functional Requirements : Security standards, frameworks, and methodologies (System Security Plan -SSP, Security Risk and
Compliance Review : SRCR etc.) To assist in creation of simple, modular, extensible and functional design for the product/solution in adherence to the requirements. Evaluate trade-offs while designing across multiple components in a system based on the business requirements. Convert HLD to create detailed design for specific modules / components of a product/system. Understand nuances of designing for disaster recovery. Undertake infrastructure coding automation.
Performance and Optimization : Requires knowledge of: Unix/Linux performance optimization tuning; Java/NodeJS/Tomcat/Apache tuning and optimization; Opensource Chaos tools (for example, Openblade, Chaos Monkey, Pumba, Chaos Mesh, Litmus, Chaos Toolkit, ToxiProxy) To evaluate appropriate reliability models to evaluate and estimate complex reliability parameters. Designs and develops a reliability program plan for a complex site environment. Facilitates reliability testing procedures. Ensures reliability testing procedures align with site environment changes.
Integration : Integrates the business goals of site reliability engineering and site safety engineering. Trains team members on the development and implementation of tools and applications for reliability predictions and improvements. Decides criteria selection and evaluation for site reliability analysis and assessment. Facilitates Opensource Chaos experiments to test and validate the resiliency of applications.
Solution Design : Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech Stacks; Minimum Viable Product- MVP; Non-Functional Requirements; Telemetry To create simple, modular, extensible and functional design in adherence to the requirements for multiple products/solutions within a domain. Understand Customer requirements and analyze the gaps between existing architecture and customer requirements. Analyze system performance impacting the complete product for non-functional requirements like reliability, operability, performance efficiency and security. Create detailed design using mock screens, pseudo codes and detailed functional logic of the modules for an entire product. Finalize the tech stack (For example MEAN, LAMP etc.) - for products/systems based on the business needs. Review the MVP to uncover risks and check for performance and usability; guide the team during MVP creation. Drive design of software, production and preproduction environments and deployment pipeline to continuously generate records for telemetry.
Coding : Requires knowledge of: Coding standards and guidelines; Coding languages (E.g. JavaScript, Python, C# etc.), frameworks(E.g. ActiveX, .Net, Cocoa, Android application framework etc.), tools(E.g. Monday.com, Linx, Embold etc.) and Platforms (E.g. Microsoft Azure, AWS , Apple IOS etc.); Quality, Safety and Security (PCI etc) standards; Emerging tools and technologies; Telemetry. To create/configure minimalistic code for entire component/application and ensure the components are meeting business/technical requirements, non-functional requirements, low-maintainability, high-availability and high-scalability needs. Assist in the selection of appropriate languages (E.g. JavaScript, Python, C# etc.), development standards and tools (E.g. Monday.com, Linx, Embold etc.)for software coding/configuration. Take initiative to learn the fundamentals of different coding languages and frameworks that would be useful for future scope of work. Build scripts for automation of repetitive and routine tasks in CI/CD (Continuous Integration/Continuous Delivery), Testing or any other process (as applicable). Implement telemetry features as required independently. Ensure security policy requirements are properly applied to components/application during code development/configuration.
Triaging and Troubleshooting : Possesses knowledge of: Regression testing; Root cause analysis (RCA); Root cause corrective action (RCCA) To analyze defects from past projects/solutions to avoid recurrence. Troubleshoots performance and availability bottlenecks for assigned application independently. Triages to detect and determine symptom versus cause of defects. Actively provides data for and participates in RCA.
Disaster Recovery Planning : Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To work with business partners to identify and document critical applications. Interprets and follows procedures in contingency plans. Explains the contingency and disaster recovery plans for assigned environment. Executes established procedures necessary to continue operations in an emergency. Participates in the design of a minimum operating environment for a computer-based facility.
Monitoring and Alerting : Requires knowledge of: Monitoring and alerting tools; Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic. To suggest metrics to monitor software or system performance. Monitors current performance data to ensure compliance with defined SLOs for multiple applications/systems. Determines thresholds for monitoring metrics and triggers alerts based on thresholds. Supervises specific procedures to proactively check the health of applications and infrastructure, including a variety of operating systems, hardware, and software. Makes recommendations regarding situational awareness and alerting. Make recommendations regarding instrumentation gaps and alerting logic.

Drives the execution of multiple business plans and projects by identifying customer and operational needs; developing and communicating business plans and priorities; removing barriers and obstacles that impact performance; providing resources; identifying performance standards; measuring progress and adjusting performance; accordingly, developing contingency plans; and demonstrating adaptability and supporting continuous learning.
Provides supervision and development opportunities for associates by selecting and training; mentoring; assigning duties; building a team-based work environment; establishing performance expectations and conducting regular performance evaluations; providing recognition and rewards; coaching for success and improvement; and ensuring diversity awareness.
Promotes and supports company policies, procedures, mission, values, and standards of ethics and integrity by training and providing direction to others in their use and application; ensuring compliance with them; and utilizing and supporting the Open Door Policy.
Ensures business needs are being met by evaluating the ongoing effectiveness of current plans, programs, and initiatives; consulting with business partners, managers, co-workers, or other key stakeholders; soliciting, evaluating, and applying suggestions for improving efficiency and cost effectiveness; and participating in and supporting community outreach events.

The above information indicates the general nature and level of work performed by employees within this classification. It is not a comprehensive inventory of all duties, responsibilities and qualifications required of employees assigned to this job.
Compensation Package : 111,700 - 149,000 CAD annually + bonus eligibility
The above reflects the target compensation range for the position at the time of posting. Hiring compensation will be determined based on experience, skill set, education/training, and other organizational needs.
Benefits : At McCain, we're on a mission to create a winning culture that puts employee safety and wellbeing at the heart of what we do, every day. We understand and appreciate that each person's needs are unique and ensure our benefits & wellbeing programs reflect that. Employees are eligible for the following benefits: health coverage (medical, dental, vision, prescription drug), retirement savings benefits, and leave support including medical, family and bereavement. Wellbeing programs include vacation and holidays, company-supported volunteering time, and mental health resources. Coverages are aligned to country, provincial and state governing plans and can vary by work level, location and nature of the role. Additional benefit details available during the application process.
Your well-being matters to us, and we're here to provide you with the necessary resources to support you in being your best self at work - and at home.
McCain Foods is an equal opportunity employer. We see value in ensuring we have a diverse, antiracist, inclusive, merit-based, and equitable workplace. As a global family-owned company we are proud to reflect the diverse communities around the world in which we live and work. We recognize that diversity drives our creativity, resilience, and success and makes our business stronger.
McCain is an accessible employer. If you require an accommodation throughout the recruitment process (including alternate formats of materials or accessible meeting rooms), please let us know and we will work with you to meet your needs.
Your privacy is important to us. By submitting personal data or information to us, you agree this will be handled in accordance with the Global Employee Privacy Policy
Job Family: Information Technology
Division: Global Digital Technology
Department: Infrastructure and Operations
Location(s): CA - Canada : New Brunswick : Florenceville-Bristol || CA - Canada : Ontario : Toronto
Company: McCain Foods (Canada)

Top Skills

Java

Python

Similar Jobs at McCain Foods

McCain Foods

Functional Engineer

18 Hours Ago

Toronto, ON, CAN

Senior level

Food • Retail • Agriculture • Manufacturing

The Functional Engineer will analyze requirements, design solutions, implement SAP S4HANA configurations, conduct testing, and document processes. They will collaborate with stakeholders and support technology finance transformations.

Top Skills: SAP

McCain Foods

Principal Data Architect

8 Days Ago

Toronto, ON, CAN

Senior level

Food • Retail • Agriculture • Manufacturing

The Principal Data Architect will lead the development of the enterprise data model at McCain Foods, collaborating with various teams to ensure data quality, security, and governance while designing data models and implementing data warehouses and lakes that support AI and analytics projects.

Top Skills: SQL

McCain Foods

OT Solution Architect

11 Days Ago

Toronto, ON, CAN

Senior level

Food • Retail • Agriculture • Manufacturing

The OT Solution Architect will design and implement architectural solutions for OT systems, integrating them with cloud platforms while ensuring performance, security, and compliance. Responsibilities include optimizing system performance, collaborating with cross-functional teams, managing vendor relationships, and providing troubleshooting expertise.

Top Skills: GoPowershellPython

What you need to know about the Ottawa Tech Scene

The capital city of Canada and the nation's fourth-largest urban area, Ottawa has proven a rapidly growing global tech hub. With over 1,800 tech companies, many of which are leaders in their sectors, the city's tech talent now makes up more than 13 percent of its total workforce. This growth is driven not only by the big players like UL Solutions and Dropbox, but also by a thriving startup ecosystem, as new businesses emerge to follow in the footsteps of those that came before them.