CI/CD Pipeline Best Practices for Engineering Teams

There's a difference between a CI/CD pipeline and a CI/CD pipeline you'd trust at 11pm when you need to push a hotfix and your manager is watching. The first kind runs jobs, produces green checks, and deploys things. The second kind does all that plus gives you fast feedback, a deployment you can observe in real time, and a rollback you can execute in under 5 minutes if the metrics go wrong.

Most pipelines we inherit are the first kind. They work, but only if nothing goes wrong. The moment something does go wrong, the pipeline becomes an obstacle instead of a tool: 20-minute CI runs that you have to wait on before you can push a fix, a rollback that involves manually editing Kubernetes deployments, secrets scattered in environment variables that require someone with console access to rotate.

These are the patterns we apply to every pipeline we build or fix.

Fast Feedback Above Everything

The number one pipeline antipattern: treating CI time as a constant. We've taken pipelines from 40 minutes to 8 minutes without removing any meaningful test coverage. The techniques that work:

Parallelise everything that can be parallelised. Unit tests don't need to wait for lint. Lint doesn't need to wait for dependency install on a different runner. Split your test suite into parallel jobs:

# GitHub Actions parallel test jobs
jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - run: npm run lint

  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - run: npm test -- --testPathPattern=unit

  integration-tests:
    runs-on: ubuntu-latest
    steps:
      - run: npm test -- --testPathPattern=integration

Cache dependencies aggressively. Node modules, Python virtualenvs, Go module caches, Docker layer caches — cache everything that doesn't change between commits. A 3-minute dependency install that could be a 10-second cache hit is 3 minutes of developer time multiplied by every push, every PR, every day.

- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
    restore-keys: |
      ${{ runner.os }}-node-

Fail fast on cheap checks. Lint and formatting checks are seconds. Unit tests are minutes. Integration tests are tens of minutes. Run them in that order — don't wait 15 minutes for integration tests to tell you there was a missing semicolon.

The 10-minute ceiling matters psychologically. Pipelines under 10 minutes get treated as a gate — engineers wait for them before moving on. Pipelines over 10 minutes get treated as a background process — engineers context-switch and don't come back promptly. Slow pipelines get bypassed.

Trunk-Based Development vs Long-Lived Branches

Long-lived feature branches are a merge problem factory. The longer a branch lives, the more divergence accumulates from main, and the harder the merge becomes. We've seen branches that took three days to merge — longer than the feature took to build — because of accumulated conflicts.

Trunk-based development (commit directly to main, or short-lived branches that merge in hours, not days) eliminates this class of problem. The objection: "we can't merge half-finished features to main." The answer: feature flags.

Feature flags let you deploy code to production before it's user-facing. The new feature is behind a flag, disabled by default. It goes through the same CI/CD pipeline, runs in production, but users don't see it. When you're ready to release, flip the flag — no deployment required. When something goes wrong, flip it back — rollback in seconds, no git revert, no deployment.

The tooling: LaunchDarkly for teams that want a managed service with targeting rules and analytics. Unleash for teams that want open-source and self-hosted. For simple use cases, feature flags in environment variables or a config service work fine without additional infrastructure.

Deployment Strategies

Rolling deployments are the Kubernetes default and are appropriate for most services. New pods are brought up before old ones are terminated, with a configurable maxUnavailable and maxSurge. Simple, built-in, and zero-downtime for stateless services.

Blue-green deployments maintain two identical environments — only one receives live traffic at a time. The cutover is instant (a DNS or load balancer change). The rollback is equally instant. Appropriate when you can afford the cost of running double the production infrastructure and when you need a clean rollback path for stateful changes.

Canary deployments route a percentage of traffic to the new version. Start at 5%, watch the error rate and latency, scale up to 25%, 50%, 100% if metrics stay healthy. This is the right strategy for high-traffic services where you want to validate real production traffic before full rollout — but it requires metrics that can distinguish canary traffic from baseline traffic.

Argo Rollouts implements both blue-green and canary natively in Kubernetes with Prometheus metric analysis:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 5
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 50
      - pause: {duration: 5m}
      - setWeight: 100
  analysis:
    successCondition: result[0] >= 0.99   # 99% success rate required

If the success-rate analysis fails, Argo Rollouts automatically rolls back to the previous version. No human intervention required.

Rollback in Under 5 Minutes

Rollback speed is an architectural decision, not an emergency procedure. If rollback takes 30 minutes, you will hesitate to use it — which means you'll be running a broken deployment longer than you should while you try to fix-forward. If rollback takes 5 minutes, you use it immediately and fix the problem in staging.

GitOps makes this trivial: revert the merge commit that caused the problem, push to main, ArgoCD reconciles the cluster back to the previous state. From commit to production rollback: under 5 minutes, no manual kubectl commands, fully auditable.

git revert HEAD --no-edit
git push origin main
# ArgoCD detects the change and rolls back the deployment automatically

The prerequisite: your deployments are driven from git manifests, not from ad-hoc kubectl set image commands. If the cluster state isn't in git, you can't roll back by reverting git. This is the most important operational reason to adopt GitOps.

For database migrations: forward-only where possible, using expand/contract patterns. Add a column (expand), deploy application that can handle both old and new schema, migrate data, remove old column (contract). Rolling back an applied database migration is almost always more dangerous than rolling it forward — design your schema changes so rollback means deploying the previous app version, not reversing the migration.

Secret Management in Pipelines

Long-lived credentials in GitHub secrets or GitLab CI variables are a compliance finding waiting to happen. When an engineer who had access to the repo leaves the company, did anyone rotate the AWS access key stored in that secret? Probably not.

OIDC roles eliminate this problem. GitHub Actions can assume an AWS IAM role via OIDC federation — the role is scoped to a specific repository and branch, the credential is valid for the duration of the job only, and there's nothing to rotate because nothing is stored.

# GitHub Actions OIDC with AWS
- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789:role/github-actions-deploy
    role-session-name: deploy-${{ github.sha }}
    aws-region: us-east-1
    # No access-key-id or secret-access-key — uses OIDC

The corresponding IAM role has a trust policy that only allows assumption from your specific GitHub repository and the main branch. A fork of your repo can't assume the role. A PR branch can't assume the production deploy role.

For secrets that need to be available to the application at runtime (not the pipeline): External Secrets Operator pulling from AWS Secrets Manager, Vault, or GCP Secret Manager. Secrets are fetched at pod startup. If you need to rotate a database password, update it in Secrets Manager and bounce the pods — no pipeline changes, no code changes.

The Pipeline Security Layer

Security scanning in CI catches the problems that would otherwise show up in a penetration test or a security audit — usually at a more expensive and more stressful moment. The three checks worth adding to every pipeline:

SAST (Static Application Security Testing): Semgrep is the most practical — open-source, fast, rule sets for every major language, and the community rules are genuinely useful (they catch real vulnerabilities, not just style issues). Runs in seconds to minutes depending on codebase size. Add as a required check on every PR.

Dependency scanning: Dependabot for GitHub repositories — it automatically opens PRs for dependency updates with security advisories. npm audit or pip-audit as a pipeline check fails the build if a direct dependency has a high-severity CVE. Both take seconds and require no configuration to add meaningful coverage.

Container image scanning: Trivy before every production deploy. Point it at the built image and fail the pipeline if it finds HIGH or CRITICAL CVEs that have available fixes:

- name: Scan container image
  run: |
    trivy image \
      --severity HIGH,CRITICAL \
      --ignore-unfixed \
      --exit-code 1 \
      myapp:${{ github.sha }}

The --ignore-unfixed flag is important: it skips CVEs that have no available fix yet. Without it, you'll be failing pipelines for vulnerabilities that the OS vendor hasn't patched — noise that trains engineers to ignore the scanner.

Want this applied to your pipeline?

We build and rebuild CI/CD pipelines — GitHub Actions, GitLab CI, ArgoCD. If your current pipeline is slow, fragile, or missing security controls, we can audit it and deliver a prioritized improvement plan. We've taken pipelines from 45 minutes to under 10 and replaced manual deployment steps with GitOps-driven rollouts.

Book Free Audit

Related: DevOps Services · Kubernetes Consulting

← Back to all articles