Applied Methods
~The MetaEngineeringInfrastructure & Platform Engineer

Infrastructure & Platform Engineer

Engineers in this role architect and operate the systems that power AI research and product development at scale. They design distributed infrastructure for training, serving, and orchestrating AI workloads across GPU clusters, build internal platforms that accelerate developer velocity, and optimize the critical path from code to production. This role bridges deep systems engineering expertise—in areas like Kubernetes, build systems, data pipelines, and performance tuning—with the unique demands of AI workloads, combining hands-on infrastructure work with close collaboration with researchers and product teams to eliminate bottlenecks that slow down innovation.

$ titles --canonical
Senior Software Engineer, InfrastructureSoftware Engineer, PlatformSoftware Engineer, AI Platform
Open Jobs555
Companies Hiring87
$02

Skills

What companies are looking for in this role.

$ skills --core

Designing and deploying cloud-based machine learning training and inference clusters at scale

95%

Designing and operating Kubernetes clusters, including schedulers, control planes, and custom controllers for specialized workloads

92%

Implementing Infrastructure as Code for reproducible resource provisioning and configuration management

90%

Building and maintaining CI/CD pipelines for machine learning workflows and distributed systems

88%

Optimizing system performance including GPU utilization, latency, and throughput at scale

88%

Diagnosing and resolving distributed systems issues including performance bottlenecks and hardware failures

87%

Managing and optimizing network-based distributed file systems and blob storage solutions for machine learning workloads

85%

Designing and building tools for monitoring, observability, and operational visibility across infrastructure

82%

Provisioning bare metal servers and managing hardware lifecycle across data centers and edge environments

82%

Developing custom autoscaling solutions for machine learning and compute-intensive workloads

80%

Implementing security best practices across infrastructure stacks without impeding research velocity

75%
$ skills --emerging

Building abstractions and developer-friendly tools that accelerate research iteration and reduce infrastructure friction

78%

Architecting multi-region and multi-cloud infrastructure for distributed training and inference

70%

Designing systems for measuring and evaluating large-scale machine learning workloads to determine production readiness

68%

Integrating artificial intelligence capabilities into developer workflows and productivity tools

62%
$ skills --soft

Collaborating with research and product teams to translate workload requirements into infrastructure solutions

85%

Owning technical strategy, roadmaps, and long-term architectural decisions for infrastructure systems

82%

Taking ownership of production systems and participating in incident diagnosis and resolution

80%

Communicating complex technical concepts across teams with different expertise and priorities

78%

Mentoring engineers and establishing best practices for building and operating large-scale systems

72%
$03

Technology

The tools and technologies that define this role.

$ tech --language
Pythonhigh
Gomoderate
$ tech --framework
CUDAmoderate
$ tech --platform
Kubernetesvery high
Linuxvery high
AWShigh
Slurmhigh
GCPmoderate
NVIDIAmoderate
Azurelow
GitLablow
$ tech --tool
Dockerhigh
BMCmoderate
etcdmoderate
Gitmoderate
Grafanamoderate
Helmmoderate
IPMImoderate
Prometheusmoderate
S3moderate
Terraformmoderate
Ansiblelow
Datadoglow
Jenkinslow
MAASlow
$ tech --concept
Distributed systemsvery high
Infrastructure as Codevery high
Cloud-nativehigh
Data centerhigh
High-performance computinghigh
Networkinghigh
Schedulerhigh
Storage systemshigh
API servermoderate
Edge computingmoderate
Load balancingmoderate
Multi-cloudmoderate
PXEmoderate
Service discoverymoderate
$04

Open Jobs

555 open Infrastructure & Platform Engineer jobs across 87 companies.

Graphcore1d
Staff Engineer
Austin, Texas, United States; US - Milpitas·Engineering
Graphcore2d
Observability, Staff Telemetry Engineer
Gdańsk, Pomeranian Voivodeship, Poland·Engineering
Notion2d
Software Engineer, Infrastructure
Hyderabad, India·Engineering
OpenAI4d
Software Engineer, RL Training Infra
San Francisco·Engineering
Twelve Labs4d
Staff Site Reliability Engineer
San Francisco·Engineering
Graphcore4d
Staff Engineer
Austin, Texas, United States; US - Milpitas·Engineering
OpenAI4d
Software Engineer, ML Systems & Training Architecture
San Francisco·Engineering
Graphcore5d
AI Platform Architect
Austin, Texas, United States·Engineering
Crusoe5d
Senior Staff Network Engineer, Deployment
San Francisco, CA - US·Engineering
Graphcore5d
Staff Cloud Engineer
London, UK·Engineering
Graphcore5d
Staff Cloud Engineer
Bristol, UK·Engineering
Isomorphic Labs5d
Software Engineer (TechOps), London
London·Engineering
MongoDB5d
Senior Platform Engineer
Gurugram·Engineering
Nscale5d
Senior Principal Network Engineer, AI Infrastructure
US·Engineering
Graphcore5d
Senior Cloud Platform Engineer
Bristol, UK·Engineering
Graphcore5d
Senior Cloud Platform Engineer
London, UK·Engineering
Graphcore5d
Senior Cloud Network Engineer
Bristol, UK·Engineering
Graphcore5d
Senior Cloud Network Engineer
London, UK·Engineering
Graphcore5d
Senior Cloud Engineer (K8S)
London, UK·Engineering
Graphcore5d
Senior Cloud Engineer (K8S)
Bristol, UK·Engineering