Voltage Park Logo

Voltage Park

Infrastructure Operations Engineer

Posted Yesterday
Remote
Hiring Remotely in USA
Senior level
Remote
Hiring Remotely in USA
Senior level
The Infrastructure Operations Engineer is responsible for ensuring the stability and performance of AI compute infrastructure, collaborating with various teams, and deploying system updates while participating in an on-call rotation.
The summary above was generated by AI

Voltage Park is your enterprise AI factory. We offer scalable compute power, on-demand and reserved bare metal AI infrastructure using NVIDIA GPUs, with world-class service, performance and value. Founded with the mission of making accessible AI computing for all – our flexible, affordable GPU solutions power everyone from builders to enterprises.

We are seeking a highly skilled and proactive Infrastructure Operations Engineer to be part of our 24/7 Infrastructure Operations team responsible for the stability, scalability, and performance of compute, storage, and platform infrastructure. This role plays a key part in delivering always-on, high-performance environments that support AI/ML training, inference, and HPC workloads at scale. The ideal candidate combines technical depth with strong interpersonal skills and a passion for operational excellence. 

This position offers full remote flexibility, although candidates must be based in the continental US and available to work during PST hours. Unfortunately, we are unable to provide sponsorship for this role.

Responsibilities

  • At the direction of the Manager of Infrastructure Operations, design, build, and roll out new platforms and patterns to minimize incidents and enable customer facing and internal features.

  • Deploy updates and improvements to support both Voltage Park’s internal and end customer use cases.

  • Collaborate with colleagues in Infrastructure Engineering, Network Operations, Customer Success and Software and Platform Development Teams.

  • Participate in the on-call rotation which is evenly distributed across all team members in a primary / secondary pattern where you are primary then move to a secondary position.

Qualifications

  • 8+ years working with Linux as a server / hosting platform, extra points for Ubuntu experience.

  • 5+ years experience with AWS.

  • 2+ years experience with Kubernetes and strong container fundamentals.

  • 2+ years experience with Terraform and Ansible

  • 2+ years with network attached storage management (via NFS, ceph, or other protocols). Extra points for experience with VAST storage systems.

  • Experience working in a Slack-first, asynchronous remote work environment.

  • Experience with monitoring systems (Prometheus, ELK stack).

  • Familiarity with the gitops workflow. 

  • Software development experience using Python, Go, bash,  or other languages for the purposes of automation & connecting systems & APIs together.

  • Deep networking fundamentals, extra points for experience with datacenter level networks, 400Gb ethernet, and Infiniband.

  • Experience building and delivering complex systems.

  • Effective at navigating tradeoffs between design, risk, cost, and outcomes.

  • Comfortable with navigating ambiguity.

  • Strong written and oral communication.

Ideal Experiences

  • Experience with bare metal hardware troubleshooting and provisioning, extra points for working with Dell hardware.

  • Experience with GPU servers, both in bare metal form or under virtualization.

  • Deep experience with network switches, routers, and firewalls, particularly SONiC switches, Palo Alto firewalls and Juniper Networks as vendors.

  • Experience with VAST storage systems

Culture

  • You enjoy working with a small group of friendly, highly motivated, execution focused colleagues.

  • You’re comfortable with a high degree of autonomy. We expect you to independently prioritize your work and understand how it maps to the overall needs and goals of the company.

  • You’re knowledgeable in your domain but also enjoy wearing multiple hats and venturing outside of your comfort zone when the need arises.

  • You value the ability to write well and understand the importance of good documentation.

Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter. 

Compensation Range: $140K - $200K


#BI-Remote

Top Skills

Ansible
AWS
Bash
Ceph
Elk Stack
Go
Gpu
Kubernetes
Linux
Nfs
Prometheus
Python
Terraform
Vast Storage

Similar Jobs at Voltage Park

Yesterday
Remote
USA
Senior level
Senior level
Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
The Storage Engineer will manage and optimize a customer-facing multi-petabyte VAST storage system, including performance tuning, troubleshooting, and collaboration with teams.
Top Skills: AnsibleHpc Storage SystemsLinuxNfsTerraformVast Storage Systems
12 Days Ago
Remote
2 Locations
Senior level
Senior level
Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
As a Platform Engineer, you'll maintain platforms, develop automation software, and ensure system reliability, leveraging strong Linux administration and scripting skills.
Top Skills: AnsibleBashCephDebianDockerElk StackGrafanaKubernetesLibvirtLinuxMaasNfsPostgresPrometheusPythonReactRedisTailwindTerraformUbuntu
18 Hours Ago
Remote
USA
Mid level
Mid level
Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
The Technical Support Manager leads a team of Technical Support Analysts, ensuring high-quality customer support and optimizing support processes in a remote environment.
Top Skills: Linux

What you need to know about the Ottawa Tech Scene

The capital city of Canada and the nation's fourth-largest urban area, Ottawa has proven a rapidly growing global tech hub. With over 1,800 tech companies, many of which are leaders in their sectors, the city's tech talent now makes up more than 13 percent of its total workforce. This growth is driven not only by the big players like UL Solutions and Dropbox, but also by a thriving startup ecosystem, as new businesses emerge to follow in the footsteps of those that came before them.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account