A production-style GitOps + DevSecOps reference architecture for AWS that demonstrates how a small/medium team can ship infrastructure-as-code changes safely through automated security gates, ephemeral dev environments, and human-gated promotion to production.
What problem does this solve? "Someone wants to add a new service to our AWS platform. How do we make sure their change doesn't break production, doesn't introduce vulnerabilities, follows our policies, and gets reviewed by the right people β all automatically, with a clear audit trail?"
This repo is the answer.
- What's inside
- Architecture
- The GitOps flow
- Tech stack
- Repository layout
- Quick start
- How a developer uses this
- Security controls
- OPA policies enforced
- Cost
- Customisation
- Roadmap
- Troubleshooting
- License
A multi-region, multi-cluster AWS Kubernetes platform with:
- 3 EKS clusters across 3 regions β clear separation of concerns
- Self-hosted GitHub Actions runners inside the security cluster (no third-party CI access to AWS)
- GitOps pipeline driven by GitHub PRs β no manual
terragrunt applyonce set up - Security gates at every step: secret scanning, IaC scanning, container scanning, policy enforcement
- Ephemeral dev environments β created on merge to
dev, destroyed on merge tomain - Centralised vulnerability management β every scan feeds into DefectDojo
- IRSA everywhere β pods get AWS credentials through OIDC, no static keys
- Cost-aware NAT with fck-nat instead of $32/month NAT Gateways
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Paris (eu-west-3) Ireland (eu-west-1) Frankfurt β
β βββββββββββββββββ βββββββββββββββββ (eu-central-1) β
β security-eks (PERMANENT) dev-eks (EPHEMERAL) prod-eks β
β (PERMANENT) β
β βββββββββββββββββββββββ βββββββββββββββββββ β
β β SonarQube β β Test workloads β ββββββββββββ β
β β DefectDojo β scan β created by PR β β β Real β β
β β Dependency Track β βββββΊ β destroyed on β βββΊ β services β β
β β Vault β β merge to main β β β β
β β Harbor β βββββββββββββββββββ ββββββββββββ β
β β Prometheus+Grafana β β
β β Atlantis β β
β β GitHub Runner (ARC) β β
β βββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β² β² β²
β β β
βββββ runs CI workflows ββββββββββ΄ββββ deployed by βββββββββ
GitOps
| Region | Purpose | Why this region |
|---|---|---|
eu-west-3 (Paris) |
Security tooling β permanent | Central scanning, separate blast radius |
eu-west-1 (Ireland) |
Ephemeral dev β short-lived | Lots of capacity, fast spin-up |
eu-central-1 (Frankfurt) |
Production workloads β permanent | Low-latency for EU users |
ββββββββββββββββ
β Developer β
β feature/xyz β
ββββββββ¬ββββββββ
β git push
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β PR feature/xyz β dev β
ββββββββββββββββββββββββββββββββββββββββ€
β β’ Security Scan workflow runs β
β - TruffleHog (secrets) β
β - KICS (IaC) β
β - Checkov (IaC) β
β - findings β DefectDojo β
β β’ PR Checks workflow runs β
β - terragrunt fmt/validate/plan β
β - OPA/Conftest policies β
β - SonarQube SAST β
β - Trivy scan β
β - findings β DefectDojo β
ββββββββββββββββ¬ββββββββββββββββββββββββ
β β
reviewer approves
β merge
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β push β dev β
ββββββββββββββββββββββββββββββββββββββββ€
β Deploy Dev workflow β
β 1. assume AWS role via OIDC β
β 2. terragrunt apply environments/dev
β 3. smoke tests β
β β Cluster dev (Ireland) is up β
ββββββββββββββββ¬ββββββββββββββββββββββββ
β Manual testing in dev
β (kubectl, integration tests, ...)
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β PR dev β main β
ββββββββββββββββββββββββββββββββββββββββ€
β β’ Same scans rerun against final β
β state β
β β’ Reviewer approves β
β β’ Merge β
ββββββββββββββββ¬ββββββββββββββββββββββββ
β pull_request closed + merged
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Deploy Prod workflow β
ββββββββββββββββββββββββββββββββββββββββ€
β Job 1: destroy-dev β
β terragrunt destroy environments/dev
β (dev was just for validation) β
β β
β Job 2: deploy-prod β
β β environment: production β
β (manual approval gate) β
β terragrunt apply environments/prodβ
β β Prod (Frankfurt) updated β
ββββββββββββββββββββββββββββββββββββββββ
Two reasons:
- Cost β running dev 24/7 doubles your bill. Spin it up only when needed.
- State hygiene β every PR gets a fresh cluster. No "works on my dev" because the dev was tweaked by hand three releases ago.
The trade-off: if multiple PRs are in flight, they queue (only one dev at a time in this repo). For multi-team scale, see the Roadmap section about preview environments per PR.
| Layer | Tool | Why |
|---|---|---|
| IaC | Terraform 1.9 + Terragrunt 1.0 | Module reuse, remote state, dependencies |
| State | S3 + DynamoDB lock | Standard AWS pattern |
| Orchestration | EKS 1.31 | Managed Kubernetes |
| NAT | fck-nat | $3/mo instead of $32/mo for NAT Gateway |
| Ingress | AWS Load Balancer Controller | Native ALB/NLB integration |
| Storage | EBS CSI driver | Persistent volumes for stateful tools |
| Cluster auth | EKS Access Entries (API mode) | Replaces deprecated aws-auth ConfigMap |
| Workload auth | IRSA (OIDC) | No static keys in pods |
| SAST | SonarQube | Code quality + security |
| IaC scan | KICS + Checkov | Two scanners cover different rule sets |
| Secret scan | TruffleHog | Verified-only mode, low false positives |
| Container scan | Trivy | Filesystem + image scanning |
| Policy | OPA + Conftest | Reusable policies tested in CI |
| Vuln mgmt | DefectDojo | Aggregates findings from all scanners |
| SCA | Dependency Track | Component analysis |
| Secrets | Vault | Dynamic credentials, encryption-as-a-service |
| Registry | Harbor | Private container registry with Trivy built-in |
| Observability | kube-prometheus-stack | Prometheus + Grafana + Alertmanager |
| PR automation | Atlantis | Terraform apply via PR comments |
| Runners | Actions Runner Controller | Self-hosted runners as K8s pods |
.
βββ bootstrap/ # One-time setup: S3 backends, DynamoDB, GitHub OIDC
β βββ main.tf
β βββ variables.tf
β βββ terraform.tfvars.example # Copy to terraform.tfvars and set github_repo
β βββ outputs.tf
βββ environments/
β βββ root.hcl # Shared terragrunt config (backend, provider)
β βββ dev/ # Ireland β ephemeral
β β βββ networking/terragrunt.hcl
β β βββ eks/terragrunt.hcl
β β βββ eks-addons/terragrunt.hcl
β βββ prod/ # Frankfurt β permanent
β β βββ networking/
β β βββ eks/
β β βββ eks-addons/
β β βββ prod-services/
β βββ security/ # Paris β permanent
β βββ networking/
β βββ eks/
β βββ eks-addons/
β βββ security-tools/
βββ modules/
β βββ networking/ # VPC, subnets, IGW, fck-nat, flow logs
β βββ eks/ # Cluster, node group, OIDC provider, access entries
β βββ eks-addons/ # EBS CSI, CoreDNS, kube-proxy, LBC + IRSA
β βββ security-tools/ # IRSA roles + Harbor S3 + KMS
β βββ prod-services/ # Generic IRSA role for prod apps
βββ kubernetes/
β βββ namespaces/ # security-tools, dev-services, prod-services
β βββ helm/ # values.yaml for each tool
β βββ manifests/ # github-runner, ingresses, network policies
βββ policies/ # OPA/Rego policies (run by Conftest)
β βββ *.rego
β βββ tests/
βββ scripts/
β βββ smoke-tests.sh # Post-deploy validation
βββ .github/
β βββ workflows/
β β βββ security-scan.yaml # Runs on PR open
β β βββ pr-checks.yaml # Runs on PR open (terraform validate/plan)
β β βββ deploy-dev.yaml # Runs on push to dev (after PR merge)
β β βββ deploy-prod.yaml # Runs on PR merge to main
β βββ CODEOWNERS
βββ README.md
βββ SETUP.md # Step-by-step setup guide
βββ Makefile
βββ LICENSE
Full instructions with copy-paste commands, troubleshooting, and screenshots are in SETUP.md. The below is a high-level summary.
- Fork this repo to your GitHub account.
- Configure AWS (paid account; Free Plan strict won't allow
t3.large). - Bootstrap:
cd bootstrap cp terraform.tfvars.example terraform.tfvars # set github_repo = "you/your-fork" terraform init && terraform apply
- Deploy permanent clusters (security + prod, ~25 min each):
terragrunt run --all apply --working-dir environments/security terragrunt run --all apply --working-dir environments/prod
- Install security tools (Helm) β see SETUP.md Β§6.
- Configure GitHub β secrets, environments, branch protection.
- Try the flow: branch off, open PR to
dev, watch the magic.
Day-to-day, a developer only ever does this:
# 1. Branch off main
git checkout -b feature/add-service-x
# 2. Make changes (e.g. add a new IAM role in modules/prod-services/main.tf)
$EDITOR modules/prod-services/main.tf
# 3. Push and open PR to dev
git push origin feature/add-service-x
gh pr create --base dev --head feature/add-service-x
# 4. Wait for green checks (scans + plan + policies)
# 5. Approve & merge β dev cluster gets created automatically with the change
# 6. Test manually in dev (kubectl, curl, etc.)
# 7. If happy, open PR dev β main, approve, merge
# 8. Prod gets updated, dev gets destroyed automaticallyThat's it. No terraform apply from anyone's laptop. Ever.
- S3 state buckets: KMS-encrypted, versioning, public access blocked
- DynamoDB locks: server-side encryption
- EKS secrets: optional KMS envelope encryption (opt-in to avoid recreating existing clusters)
- EBS volumes: encrypted by default via launch template
- EKS API endpoint: private + public with configurable CIDR allowlist (default open for PoC; restrict per env)
- All workloads: traffic stays in VPC unless explicitly routed out via fck-nat
- IRSA for every pod that needs AWS access β no instance-profile shortcuts, no static keys
- OIDC trust scoped to specific GitHub repo (
repo:owner/repo:*) for GitHub Actions - EKS Access Entries in API mode (no aws-auth ConfigMap drift)
- CODEOWNERS + branch protection forces human review for every change
- VPC flow logs to CloudWatch (7-day retention by default)
- EKS control plane logs (api, audit, authenticator, controllerManager, scheduler)
- Prometheus scraping all clusters
- DefectDojo as single pane of glass for findings across scanners
- Self-hosted runners in private subnets inside the security cluster
- No third-party CI service holds AWS credentials
- GitHub OIDC β AWS STS β assume role (short-lived credentials)
- PRs to
mainrequire approving review from@hallllow29(or configured CODEOWNER)
Run by Conftest in pr-checks.yaml:
| Policy | What it blocks |
|---|---|
no_public_s3 |
S3 buckets without PublicAccessBlock |
enforce_imdsv2 |
EC2 instances allowing IMDSv1 |
require_encryption |
EBS volumes without encryption |
eks_private_endpoint |
EKS clusters without endpoint_private_access |
no_privileged_containers |
Privileged pods |
require_tags |
Resources without Name and Environment tags |
no_wide_ingress |
Security groups with SSH open to 0.0.0.0/0 |
eks_secrets_encryption |
EKS clusters without KMS encryption |
Each policy has a corresponding test in policies/tests/.
Running everything 24/7 (all 3 clusters up):
| Resource | $/month |
|---|---|
| 3Γ EKS control plane ($0.10/h Γ 730h) | $216 |
| 6Γ t3.large SPOT nodes | ~$130 |
| ALBs (1 per exposed workload) | $20β60 |
| EBS volumes (PVCs for stateful tools) | ~$20 |
| 3Γ fck-nat (t4g.nano) | ~$5 |
| S3 + KMS + DynamoDB + CloudWatch logs | ~$10 |
| Total | ~$400β500 |
Cost-saving notes:
- Dev is ephemeral β only pays during testing windows. Realistic monthly: ~$350.
- All node groups use SPOT instances (~70% cheaper than on-demand).
- Replace fck-nat with NAT Gateway only if you need 99.99% NAT uptime.
- For learning/portfolio: spin up, capture screenshots, destroy. ~$15 one-off.
See docs/COST.md for a detailed optimisation matrix (Fargate, Karpenter, instance
type choices, etc.).
- Add Terraform IAM role and dependencies in
modules/prod-services/. - Add Helm chart values in
kubernetes/helm/<your-app>/. - Add Kubernetes manifests in
kubernetes/manifests/prod-services/. - Open PR to
devβ scans run, dev gets the change, test, then PR tomain.
Override in the env's eks/terragrunt.hcl:
inputs = {
environment = "prod"
instance_types = ["t3.xlarge"] # default is t3.large
desired_size = 4
max_size = 8
min_size = 2
}inputs = {
endpoint_public_access_cidrs = ["<YOUR_OFFICE_CIDR>/32"]
}inputs = {
enable_secrets_encryption = true
}- Preview environments per PR β
dev-pr-<number>instead of single shared dev - Karpenter for node autoscaling (replace fixed ASG)
- ArgoCD as the source of truth for Kubernetes manifests
- External DNS for Route53 automation
- cert-manager with Let's Encrypt for public TLS
- Slack alert routing in Alertmanager
- Custom runner image with
awscli,kubectl,helm,jqpre-installed - Terratest module tests
- Renovate / Dependabot for Helm chart and module updates
- Cost dashboard with AWS Cost Explorer integration
- Disaster recovery runbook in
docs/DR.md
See docs/TROUBLESHOOTING.md for solutions to common issues:
terragrunt run --all applyfailing with state-checksum mismatch- IAM roles surviving after partial destroy ("EntityAlreadyExists")
- EKS Access Entry rejecting assumed-role ARNs
- Self-hosted runner stuck in
Pending - Helm release
cannot re-use a name that is still in use - DefectDojo returning HTML on
/api/v2/import-scan/(ALLOWED_HOSTS)
PRs welcome. The flow this repo demonstrates is also the flow used to develop it:
- Branch off
main - PR to
devβ scans must pass, reviewer approves - Merge β dev cluster validates the change
- PR
devβmainβ final review - Merge β prod updated, dev torn down
See CONTRIBUTING.md for code style and commit message conventions.
MIT β use, modify, distribute. See LICENSE.
- fck-nat by Andrew Guenther β the $32/month NAT Gateway killer
- DefectDojo β vulnerability management done right
- Atlantis β for showing that Terraform deserves PR-driven workflows
- SonarSource, Checkmarx KICS, Bridgecrew Checkov, Aqua Trivy β for free, open-source scanners