AI Infrastructure

Setting Up GPU Clusters for AI Workloads on Bare Metal

February 28, 2026 10 min read

An NVIDIA A100 on AWS costs roughly $3.50/hour on-demand. Run 8 of them 24/7 for training workloads, and you're looking at $20K+/month in GPU compute alone. At that scale, on-premise bare metal starts looking very attractive — if you can manage the infrastructure.

We've set up GPU clusters in data centers for several AI-focused companies. Here's what the architecture looks like, what the pitfalls are, and when it actually makes financial sense to go bare metal.

When bare metal GPUs make sense

The break-even point depends on your utilization. If your GPU utilization is consistently above 60%, bare metal is almost certainly cheaper. Below 40%, cloud spot instances are hard to beat. The gray zone (40–60%) depends on your tolerance for spot interruptions and your workload's ability to checkpoint.

Training workloads that run for hours or days → bare metal wins
Inference with consistent traffic → bare metal wins
Burst inference or experimentation → cloud wins
Hybrid — steady-state on bare metal, burst to cloud → optimal for most

The architecture

Hardware layer

A typical setup we deploy:

4–8 GPU nodes (NVIDIA A100 or H100, 4–8 GPUs per node)
High-bandwidth networking (InfiniBand or 100GbE RoCE for multi-node training)
Shared storage (NVMe-oF or parallel filesystem like BeeGFS/Lustre for datasets)
Management nodes (2–3 nodes for Kubernetes control plane, monitoring, storage controllers)

Kubernetes with GPU support

We use vanilla Kubernetes (kubeadm or RKE2) with:

NVIDIA GPU Operator — automatically installs and manages GPU drivers, container runtime, and device plugins across all nodes
NVIDIA Network Operator — manages RDMA and GPUDirect for multi-node training
Time-slicing or MIG — for inference workloads, partition A100s into smaller virtual GPUs
Topology-aware scheduling — ensures multi-GPU jobs land on GPUs connected via NVLink rather than PCIe

ML platform layer

On top of Kubernetes, we deploy:

Kubeflow or MLflow for experiment tracking and pipeline orchestration
JupyterHub for interactive notebooks with GPU access
Volcano or Kueue for batch job scheduling and fair-share queuing
Prometheus + DCGM Exporter for GPU utilization, temperature, and memory monitoring
Harbor for container image registry (large ML images shouldn't traverse the WAN)

Common pitfalls

1. Underestimating networking

Multi-node training (e.g., training a large model across 32 GPUs on 4 nodes) is bottlenecked by inter-node communication. Standard 10GbE Ethernet won't cut it. You need InfiniBand or 100GbE with RDMA. This is often the most expensive and hardest-to-get-right part of the setup.

2. Storage performance

ML training reads datasets in tight loops. A slow storage layer means GPUs sit idle waiting for data. We typically deploy a parallel filesystem (BeeGFS or Lustre) on NVMe drives with at least 10 GB/s aggregate read throughput per training node.

3. Power and cooling

An 8xA100 node draws 5–6 kW. A rack of 4 such nodes draws 20–24 kW. Most standard data center racks are provisioned for 6–10 kW. You need high-density racks with adequate cooling — often liquid cooling for the GPU nodes. Plan this with your data center provider before ordering hardware.

Cost comparison

For a typical 8xA100 setup running 24/7:

Cloud (AWS p4d.24xlarge on-demand): ~$24K/month
Cloud (reserved 1-year): ~$16K/month
Bare metal (colocation + hardware amortized over 3 years): ~$6K–$8K/month

The bare metal option is 2–3x cheaper at high utilization. But it requires upfront capital ($200K–$400K for hardware) and ongoing management. That's where we come in — we handle the infrastructure so your ML team can focus on models.

Planning a GPU cluster?

We'll help you spec the hardware, set up Kubernetes with GPU operators, and build the ML platform layer. Free consultation to start.

Book Free Consultation

← Back to all articles