An NVIDIA A100 on AWS costs roughly $3.50/hour on-demand. Run 8 of them 24/7 for training workloads, and you're looking at $20K+/month in GPU compute alone. At that scale, on-premise bare metal starts looking very attractive — if you can manage the infrastructure.
We've set up GPU clusters in data centers for several AI-focused companies. Here's what the architecture looks like, what the pitfalls are, and when it actually makes financial sense to go bare metal.
When bare metal GPUs make sense
The break-even point depends on your utilization. If your GPU utilization is consistently above 60%, bare metal is almost certainly cheaper. Below 40%, cloud spot instances are hard to beat. The gray zone (40–60%) depends on your tolerance for spot interruptions and your workload's ability to checkpoint.
- Training workloads that run for hours or days → bare metal wins
- Inference with consistent traffic → bare metal wins
- Burst inference or experimentation → cloud wins
- Hybrid — steady-state on bare metal, burst to cloud → optimal for most
The architecture
Hardware layer
A typical setup we deploy:
- 4–8 GPU nodes (NVIDIA A100 or H100, 4–8 GPUs per node)
- High-bandwidth networking (InfiniBand or 100GbE RoCE for multi-node training)
- Shared storage (NVMe-oF or parallel filesystem like BeeGFS/Lustre for datasets)
- Management nodes (2–3 nodes for Kubernetes control plane, monitoring, storage controllers)
Kubernetes with GPU support
We use vanilla Kubernetes (kubeadm or RKE2) with:
- NVIDIA GPU Operator — automatically installs and manages GPU drivers, container runtime, and device plugins across all nodes
- NVIDIA Network Operator — manages RDMA and GPUDirect for multi-node training
- Time-slicing or MIG — for inference workloads, partition A100s into smaller virtual GPUs
- Topology-aware scheduling — ensures multi-GPU jobs land on GPUs connected via NVLink rather than PCIe
ML platform layer
On top of Kubernetes, we deploy:
- Kubeflow or MLflow for experiment tracking and pipeline orchestration
- JupyterHub for interactive notebooks with GPU access
- Volcano or Kueue for batch job scheduling and fair-share queuing
- Prometheus + DCGM Exporter for GPU utilization, temperature, and memory monitoring
- Harbor for container image registry (large ML images shouldn't traverse the WAN)
Common pitfalls
1. Underestimating networking
Multi-node training (e.g., training a large model across 32 GPUs on 4 nodes) is bottlenecked by inter-node communication. Standard 10GbE Ethernet won't cut it. You need InfiniBand or 100GbE with RDMA. This is often the most expensive and hardest-to-get-right part of the setup.
2. Storage performance
ML training reads datasets in tight loops. A slow storage layer means GPUs sit idle waiting for data. We typically deploy a parallel filesystem (BeeGFS or Lustre) on NVMe drives with at least 10 GB/s aggregate read throughput per training node.
3. Power and cooling
An 8xA100 node draws 5–6 kW. A rack of 4 such nodes draws 20–24 kW. Most standard data center racks are provisioned for 6–10 kW. You need high-density racks with adequate cooling — often liquid cooling for the GPU nodes. Plan this with your data center provider before ordering hardware.
Cost comparison
For a typical 8xA100 setup running 24/7:
- Cloud (AWS p4d.24xlarge on-demand): ~$24K/month
- Cloud (reserved 1-year): ~$16K/month
- Bare metal (colocation + hardware amortized over 3 years): ~$6K–$8K/month
The bare metal option is 2–3x cheaper at high utilization. But it requires upfront capital ($200K–$400K for hardware) and ongoing management. That's where we come in — we handle the infrastructure so your ML team can focus on models.