Migrating from AWS to GCP: A Zero-Downtime Playbook

A B2B SaaS company came to us spending $45K/month on AWS with a committed-use discount expiring in 3 months. Their ML team had standardized on Vertex AI and BigQuery for analytics, but their production workloads were still on EKS. They wanted to consolidate everything on GCP — without any customer-visible downtime.

Six weeks later, they were fully running on GKE. Here's how we did it.

Phase 1: Architecture mapping (Week 1)

Before touching anything, we mapped every service, dependency, and data flow:

  • 12 microservices on EKS across 3 node groups
  • RDS PostgreSQL (primary + read replica) — 800GB
  • ElastiCache Redis — session store and job queue
  • S3 — 2TB of user-uploaded assets
  • CloudFront — CDN for static assets
  • Route 53 — DNS with health checks
  • SQS + Lambda — event processing pipeline

Every component got a GCP equivalent assigned: EKS→GKE, RDS→Cloud SQL, ElastiCache→Memorystore, S3→Cloud Storage, CloudFront→Cloud CDN, Route 53→Cloud DNS, SQS+Lambda→Pub/Sub+Cloud Run.

Phase 2: Parallel infrastructure (Week 2)

We stood up the entire GCP infrastructure using Terraform — identical to the AWS setup but with GCP services. Key decisions:

  • GKE Autopilot for the Kubernetes cluster (less node management overhead)
  • Cloud SQL with high availability (regional) and automated backups
  • VPN tunnel between AWS VPC and GCP VPC for the migration period

Phase 3: Data migration (Weeks 3–4)

The hardest part. We ran both systems in parallel:

Database

We used pglogical for continuous logical replication from RDS to Cloud SQL. Initial sync took 6 hours for the 800GB database. After that, changes replicated in near-real-time (sub-second lag). We monitored replication lag continuously and set alerts for anything above 5 seconds.

Object storage

We used rclone with 64 parallel transfers to sync S3 to Cloud Storage. Initial sync: 4 hours. Then a continuous sync job running every 15 minutes to catch new uploads. We configured the application to dual-write to both S3 and GCS during the migration window.

Redis

Redis data is ephemeral (sessions and cache). We didn't migrate it — we let it rebuild naturally after cutover. The application handled cache misses gracefully.

Phase 4: Application deployment (Week 4)

We deployed all 12 services to GKE using the same Helm charts (Kubernetes is Kubernetes). The only changes were:

  • Database connection strings → Cloud SQL proxy
  • S3 SDK calls → GCS SDK (we had already abstracted storage behind an interface)
  • SQS consumers → Pub/Sub consumers
  • IAM roles → GCP Workload Identity

Phase 5: DNS cutover (Week 5)

The zero-downtime moment. Our strategy:

  1. Lower DNS TTL to 60 seconds (done 48 hours before cutover)
  2. Verify GKE is serving traffic correctly via a staging domain
  3. Verify database replication lag is under 1 second
  4. Switch DNS from AWS ALB to GCP load balancer
  5. Monitor error rates for 30 minutes
  6. If clean → stop replication and decommission AWS read path
  7. If errors → revert DNS (60-second TTL means recovery in ~1 minute)

The cutover was clean. Zero errors, zero customer complaints. We kept the AWS infrastructure running in read-only mode for 1 week as a safety net, then decommissioned it.

Results

  • Downtime: Zero (DNS-based cutover with 60s TTL)
  • Migration duration: 5 weeks end-to-end
  • Cost savings: $45K → $31K/month (31% reduction, mostly from GKE Autopilot and committed-use discounts)
  • Performance: 15% improvement in API latency (co-located with Vertex AI and BigQuery)

"We were terrified of the migration. DevOps Team made it feel routine. Not a single customer noticed."

Planning a cloud migration?

We'll map your current architecture and give you a realistic timeline and cost estimate — for free.

Book Free Assessment
← Back to all articles