Job Description

Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. As well as supporting their extremely exciting new products coming to the market!

This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.

If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Get in touch and apply today!

Responsibilities:

Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Salary & Benefits:

$300,000 gross per year
Equity

Job Tags

Permanent employment

Similar Jobs

Rainbow

Sales Associate/Stock - Union City, NJ Job at Rainbow

...oriented, fashion forward managers with a retail background to join our winning team! Job Summary: Sales Associate - This is our entry level management position, which will allow you to learn our business, while giving you the skills to operate one of our stores. At...

Telenett

Telehealth Nurse Job at Telenett

...Position: Telehealth Nurse Company Overview: Telenett is a leading provider of telehealth services, offering innovative and... ...the use of technology. Location: This position is fully remote and can be based in Alexander City, AL or anywhere in the United...

Opportunities in Public Affairs

Senior AI Strategy Researcher Job at Opportunities in Public Affairs

...leading global platform for game development, operations and publishing, and the largest online game community in China. Tencent Games... ...with AI experts in the gaming industry, technology sector, or academic research community. Publications in top-tier AI or computer...

JPMorgan Chase & Co.

Branch Manager - Salt Lake Central Market - Salt Lake City, Utah Job at JPMorgan Chase & Co.

...expectations, and embracing diversity and inclusion. As a Branch Manager in a Chase Branch, you will be at the forefront of delivering... ...everything we do. We also help small businesses, nonprofits and cities grow, delivering solutions to solve all their financial needs....

Aequor

Laboratory Support Job at Aequor

...customary scientific/lab duties. Participate in the execution of routine experiments with assistance or independently. Performs all work in conformance with applicable regulations. Performs all work in a safe manner. Works primarily within the laboratory environment. PRIMARY...

Senior Site Reliability Engineer (SRE) - AI Inftastructure Job at Confidential, San Francisco, CA

bi9wOTNCMzFia2V1UkxnVTU1Slc3U3lzNVE9PQ==