"We use Terraform" has become something companies say about their infrastructure the way they say "we have CI/CD" when they have a Jenkins job that someone built in 2019 and nobody touches anymore. Having Terraform files in a repository is not the same as managing your infrastructure as code.
The distinction matters in production. Terraform written badly creates a false sense of security — you think your infrastructure is reproducible and auditable because there's a .tf file somewhere, but the state file hasn't been updated since someone applied it manually from their laptop six months ago. That's worse than no IaC, because at least without IaC you know to be careful.
This is a practical guide to structuring Terraform for teams — the state management, module design, and CI/CD integration patterns that make infrastructure changes safe and reviewable.
State Management: The Foundation Everything Else Depends On
Terraform state is not a file you commit to git. It contains plaintext resource IDs, IP addresses, and sometimes sensitive output values. It's also a locking mechanism — if two engineers run terraform apply simultaneously against the same state, they will corrupt your infrastructure.
The standard AWS setup: state in S3 with DynamoDB locking.
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/eks/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
State file structure: one state file per logical unit. "Logical unit" means a group of resources that are always changed together and share a lifecycle. Your VPC is one state file. Your EKS cluster is a separate state file. Your application's RDS instance is a third. The reason: a single state file with 500 resources means every terraform plan refreshes all 500 resources — slow, and a blast radius that's larger than it needs to be.
The critical rule: never share state between unrelated resources. The moment you put your production database and your staging application in the same state file, a botched staging apply can lock out production changes. State files are cheap — use one per logical unit.
For GCP, replace S3 with Cloud Storage and DynamoDB with a Cloud Storage bucket with object versioning (GCS provides native state locking in Terraform's GCS backend). For Azure, use Azure Blob Storage with a storage account.
Module Structure That Scales
The flat-modules debate: should you build shallow, single-purpose modules or deep, opinionated modules that bundle a full service? The answer depends on your team's abstraction requirements.
For most platform teams (5-20 engineers managing shared infrastructure), shallow modules with clear interfaces work better than deep opinionated ones. Deep modules hide complexity until the abstraction breaks — and then you need to understand all of it at once. Shallow modules are easy to reason about and compose.
A practical repo structure that works for growing teams:
infrastructure/
modules/
vpc/
main.tf # subnets, route tables, NAT gateways
variables.tf # input variables with types and validation
outputs.tf # values other modules consume
eks/
main.tf # cluster, node groups, IRSA, add-ons
variables.tf
outputs.tf
rds/
main.tf # instance, parameter group, subnet group, backups
variables.tf
outputs.tf
iam/
main.tf # roles and policies
variables.tf
outputs.tf
environments/
staging/
main.tf # calls modules with staging-specific values
variables.tf
backend.tf
terraform.tfvars
production/
main.tf
variables.tf
backend.tf
terraform.tfvars
The output discipline rule: every module's outputs.tf should expose every value other modules or environments might need. VPC module outputs subnet IDs. EKS module outputs cluster endpoint and OIDC issuer URL. RDS module outputs endpoint and secret ARN. Don't use data sources to look up things you just created — pass them through outputs.
Variable Discipline
Sloppy variable definitions are infrastructure debt. The worst offender:
# This is terrible:
variable "anything" {}
# This is fine:
variable "instance_type" {
type = string
description = "EC2 instance type for the EKS node group"
default = "t3.medium"
validation {
condition = contains(["t3.medium", "t3.large", "m5.large", "m5.xlarge"], var.instance_type)
error_message = "instance_type must be one of the approved instance types."
}
}
Type constraints prevent the class of bugs where a number gets passed as a string and Terraform silently coerces it in ways that don't match what you intended. Validation blocks catch configuration errors at terraform plan time rather than at apply time — faster feedback, less blast radius.
Sensitive variables: mark them sensitive = true. Terraform will redact their values from plan output and state file display. They're still stored in the state file — another reason to encrypt your S3 state bucket and restrict access to it.
The terraform.tfvars file is for environment-specific values. Don't commit sensitive values here — reference them from environment variables (TF_VAR_) or a secrets manager. If terraform.tfvars contains a database password, it's going to end up in git eventually.
Terraform in CI/CD
terraform plan as a PR check, plan output as a PR comment, terraform apply on merge to main. This workflow is the minimum viable CI/CD for Terraform and you can implement it with GitHub Actions in an afternoon.
# .github/workflows/terraform.yml (simplified)
on:
pull_request:
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']
jobs:
plan:
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::ACCOUNT:role/terraform-plan
aws-region: us-east-1
- run: terraform init && terraform plan -out=tfplan
- uses: actions/github-script@v7
with:
script: |
// Post plan output as PR comment
apply:
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- run: terraform init && terraform apply -auto-approve
The OIDC approach for AWS credentials is important: the GitHub Actions job assumes an AWS IAM role via OIDC federation. No long-lived access keys stored as GitHub secrets. The plan role has read-only permissions; the apply role has the permissions to create and modify resources.
For larger teams where self-hosted GitHub Actions feels like too much to maintain: Atlantis and Spacelift both provide Terraform CI/CD as a managed service with plan/apply workflows, PR comments, and policy enforcement. Spacelift adds policy-as-code (OPA) for enforcing organizational standards across all Terraform runs — useful when you have multiple teams managing different parts of the infrastructure.
Drift Detection and Prevention
Drift happens when someone modifies infrastructure outside of Terraform — manually in the AWS console, via AWS CLI in response to an incident, or via another automation tool that doesn't update Terraform state. The result: your state file claims one thing, the actual infrastructure is another.
The simple drift detection: a scheduled terraform plan in CI that runs every 24 hours and creates an alert if it detects changes. Not a full apply — just a plan. The plan output shows you what has drifted without changing anything.
# Scheduled workflow: daily drift detection
on:
schedule:
- cron: '0 9 * * 1-5' # weekdays at 9am
jobs:
drift-check:
steps:
- run: terraform plan -detailed-exitcode
continue-on-error: true
id: plan
- name: Alert on drift
if: steps.plan.outputs.exitcode == '2'
uses: actions/github-script@v7
with:
script: |
// Create issue or Slack alert
When you detect drift, you have two choices: terraform import to bring the drifted resource under Terraform management (with whatever configuration the console change produced), or terraform apply to restore the Terraform-defined state. The right choice depends on whether the manual change was intentional — which is why your post-incident process should always ask "was this change applied to Terraform?"
The terraform refresh command updates state to match current infrastructure without making changes. Use it sparingly — it updates your state file to match reality even if reality is wrong. Better to understand the drift before reconciling it.
Testing Your Infrastructure
Three levels of testing that are worth the time investment:
Static analysis before every PR: tflint for Terraform-specific linting (unused variables, deprecated syntax, provider-specific rules), tfsec or checkov for security policy scanning. Checkov has rules for common misconfigurations — S3 buckets with public access, security groups open to 0.0.0.0/0, RDS without encryption. These run in seconds and catch 80% of common mistakes before anything is deployed.
# Add to your CI pipeline
- name: Run tflint
run: tflint --recursive
- name: Run checkov
run: checkov -d . --framework terraform --compact
Module unit tests with Terraform test framework: Terraform 1.6+ ships a built-in test framework. Write .tftest.hcl files that provision your module in a test account and assert on the outputs. Not every module needs this — start with the ones that are most complex or most frequently modified.
Integration tests with Terratest: Go-based test framework that provisions real infrastructure, runs assertions against it, and destroys it. Expensive to run (real AWS resources cost money and take time), but catches the class of bugs that static analysis misses — IAM permission errors, availability zone constraints, resource limit issues. Run these on a dedicated test account on a schedule, not on every PR.
Want this applied to your infrastructure?
We audit existing Terraform codebases and implement the state management, module structure, and CI/CD patterns that make infrastructure changes safe to review and apply. If you're starting from scratch or cleaning up years of accumulated IaC debt, we've done both.
Book Free Audit