A B2B SaaS company came to us spending $45K/month on AWS with a committed-use discount expiring in 3 months. Their ML team had standardized on Vertex AI and BigQuery for analytics, but their production workloads were still on EKS. They wanted to consolidate everything on GCP — without any customer-visible downtime.
Six weeks later, they were fully running on GKE. Here's how we did it.
Phase 1: Architecture mapping (Week 1)
Before touching anything, we mapped every service, dependency, and data flow:
- 12 microservices on EKS across 3 node groups
- RDS PostgreSQL (primary + read replica) — 800GB
- ElastiCache Redis — session store and job queue
- S3 — 2TB of user-uploaded assets
- CloudFront — CDN for static assets
- Route 53 — DNS with health checks
- SQS + Lambda — event processing pipeline
Every component got a GCP equivalent assigned: EKS→GKE, RDS→Cloud SQL, ElastiCache→Memorystore, S3→Cloud Storage, CloudFront→Cloud CDN, Route 53→Cloud DNS, SQS+Lambda→Pub/Sub+Cloud Run.
Phase 2: Parallel infrastructure (Week 2)
We stood up the entire GCP infrastructure using Terraform — identical to the AWS setup but with GCP services. Key decisions:
- GKE Autopilot for the Kubernetes cluster (less node management overhead)
- Cloud SQL with high availability (regional) and automated backups
- VPN tunnel between AWS VPC and GCP VPC for the migration period
Phase 3: Data migration (Weeks 3–4)
The hardest part. We ran both systems in parallel:
Database
We used pglogical for continuous logical replication from RDS to Cloud SQL. Initial sync took 6 hours for the 800GB database. After that, changes replicated in near-real-time (sub-second lag). We monitored replication lag continuously and set alerts for anything above 5 seconds.
Object storage
We used rclone with 64 parallel transfers to sync S3 to Cloud Storage. Initial sync: 4 hours. Then a continuous sync job running every 15 minutes to catch new uploads. We configured the application to dual-write to both S3 and GCS during the migration window.
Redis
Redis data is ephemeral (sessions and cache). We didn't migrate it — we let it rebuild naturally after cutover. The application handled cache misses gracefully.
Phase 4: Application deployment (Week 4)
We deployed all 12 services to GKE using the same Helm charts (Kubernetes is Kubernetes). The only changes were:
- Database connection strings → Cloud SQL proxy
- S3 SDK calls → GCS SDK (we had already abstracted storage behind an interface)
- SQS consumers → Pub/Sub consumers
- IAM roles → GCP Workload Identity
Phase 5: DNS cutover (Week 5)
The zero-downtime moment. Our strategy:
- Lower DNS TTL to 60 seconds (done 48 hours before cutover)
- Verify GKE is serving traffic correctly via a staging domain
- Verify database replication lag is under 1 second
- Switch DNS from AWS ALB to GCP load balancer
- Monitor error rates for 30 minutes
- If clean → stop replication and decommission AWS read path
- If errors → revert DNS (60-second TTL means recovery in ~1 minute)
The cutover was clean. Zero errors, zero customer complaints. We kept the AWS infrastructure running in read-only mode for 1 week as a safety net, then decommissioned it.
Results
- Downtime: Zero (DNS-based cutover with 60s TTL)
- Migration duration: 5 weeks end-to-end
- Cost savings: $45K → $31K/month (31% reduction, mostly from GKE Autopilot and committed-use discounts)
- Performance: 15% improvement in API latency (co-located with Vertex AI and BigQuery)
"We were terrified of the migration. DevOps Team made it feel routine. Not a single customer noticed."