CrawlWall: Caddy AI Crawler Access Control

Important

CrawlWall is alpha and experimental. It is useful for local testing, demos, and careful shadow-mode trials, but it is not yet a battle-tested production security boundary. Review the policy, verifier, and ledger behavior before enforcing blocks on real traffic.

A self-hosted Caddy module for AI crawler blocking, bot verification, rate limiting, metered access, and signed crawl receipts.

CrawlWall sits in front of your application and turns robots.txt-style crawler policy into enforceable HTTP-edge rules using YAML and CEL. It identifies crawlers, verifies their identity, evaluates policy, records what happened, and can sign receipts for metered access.

The short version is:

robots.txt is advisory
CrawlWall is enforcement
YAML is the config container
CEL is the policy language
Caddy is the runtime

Why this exists

Sites increasingly need something more precise than:

"please do not crawl this"
"this bot says it is Google"
"this path should maybe cost money"

That is awkward to express with robots.txt, awkward to audit in application code, and annoying to keep consistent across services.

CrawlWall moves that logic into the HTTP edge and gives it a stable shape:

Concern	CrawlWall answer
Is this crawler known?	Match on `User-Agent`
Is it really that crawler?	Verify by reverse DNS or IP ranges
What should happen?	Evaluate CEL rules in priority order
Need proof later?	Write a ledger event with a stable event ID
Need metering?	`allow_metered` + signed receipts

The point is not to be clever. The point is to be explicit, inspectable, and replaceable.

Mental model

Think of CrawlWall as four subsystems glued together inside a Caddy handler:

Bot identification: map a request to a known bot definition or unknown.
Verification: decide whether the claimed crawler identity is trustworthy.
Policy evaluation: run CEL rules against the request context.
Audit trail: write the event and optionally sign a receipt.

That means the project is not "a YAML parser" and not "a crawler blocklist." It is a policy runtime.

Architecture

flowchart LR
    A["HTTP request"] --> B["Bot identifier"]
    B --> C["Verifier"]
    C --> D["Policy engine (CEL)"]
    D --> E["Decision"]
    E --> F["Allow / Block / Rate limit / Allow metered"]
    E --> H["Signed receipt (optional)"]
    H --> G["Ledger writer"]
    F --> I["Upstream app"]

The startup path matters as much as the request path.

At startup CrawlWall:

loads crawlwall.yaml
validates the config
compiles CEL expressions
opens the ledger backend
prepares verifiers
loads the receipt signer

If a CEL expression is broken, startup should fail. That is the right pain location.

How a request is handled

Step	What happens
1	Read `User-Agent` and identify the claimed crawler
2	Verify the request source according to that crawler's verifier
3	Build the policy input: `bot`, `request`, `site`, `sets`, `labels`
4	Evaluate rules by ascending priority
5	Enforce the first matching action
6	If requested, sign a receipt over the stable event ID
7	Write one ledger record containing the decision and receipt metadata

The policy input is intentionally small and boring. It is easier to extend a plain model than to untangle a magical one.

Getting started

Do not start from a blank file.

This repo ships with two starter policies:

File	Use it when
`examples/minimal.yaml`	You want a readable starter with no receipt signing
`examples/full.yaml`	You want the full V1 shape with metering and signed receipts
`examples/policy-fixtures.yaml`	You want regression tests for policy behavior

There is also a scaffold command:

go run ./cmd/crawlwall init --profile minimal
go run ./cmd/crawlwall init --profile full

That writes:

crawlwall.yaml
Caddyfile
.gitignore
crawlwall.key and crawlwall.pub unless you disable key generation

If you want the scaffold without keys yet:

go run ./cmd/crawlwall init --profile minimal --generate-keys=false

Requirements

Go matching the version in go.mod
xcaddy to build a Caddy binary with the CrawlWall module
Caddy for config validation and runtime
A SQLite ledger path when ledger.enabled is true

Build

Build a custom Caddy binary with xcaddy.

From this local checkout:

go mod tidy
xcaddy build --with github.com/jolovicdev/crawlwall=.

From a published module version:

xcaddy build --with github.com/jolovicdev/crawlwall@latest

Check that the module is present:

caddy list-modules | grep crawlwall

Validate the config:

go run ./cmd/crawlwall policy check --config ./crawlwall.yaml
caddy validate --config ./Caddyfile --adapter caddyfile

Run it:

caddy run --config ./Caddyfile --adapter caddyfile

Try a few requests:

curl http://localhost:8080/
curl http://localhost:8080/archive/a
curl -A "GPTBot/1.1" http://localhost:8080/archive/a

Note

The docs use plain executable names on purpose. Use whatever binary name your environment produces.

Keys and receipts

Receipt signing uses Ed25519.

The private key is sensitive and should never be committed. This repo ignores it by default in .gitignore.

You have two normal ways to create keys:

let crawlwall init generate them
generate them yourself with openssl

openssl genpkey -algorithm Ed25519 -out crawlwall.key
openssl pkey -in crawlwall.key -pubout -out crawlwall.pub

Receipt config looks like this:

receipts:
  enabled: true
  signer:
    type: ed25519
    key_file: ./crawlwall.key

Receipts are for proving what decision was made for a request. In V1 they are used for metered access and audit, not settlement.

Policy shape

The top-level config model is stable even if the individual rules change:

Section	Purpose
`site`	Site identity and mode
`runtime`	Failure behavior and default action
`ledger`	Event recording settings
`receipts`	Receipt signer configuration
`bots`	Known crawler definitions and verifier settings
`sets`	Reusable policy data
`rules`	CEL expressions plus actions

site.mode controls enforcement:

Mode	Effect
`shadow`	Log decisions without enforcing blocks or rate limits
`observe`	Alias for `shadow`, kept for older configs
`enforce`	Enforce policy decisions

Use shadow before blocking crawlers on a production site. It lets you inspect the ledger first, which is less exciting than debugging a self-inflicted 403 storm.

Writing policy rules

Start with the policy guide. It explains the available CEL inputs, rule priority, shadow mode, common recipes, verifier cache status, and fixture tests.

Example Caddyfile

{
  order crawlwall before reverse_proxy
}

:8080 {
  crawlwall {
    policy ./crawlwall.yaml
    ledger sqlite://./crawlwall.db
    fail_mode block
  }

  reverse_proxy localhost:3000
}

Example rule

- id: meter_training_on_protected_paths
  priority: 200
  when: >
    bot.verified &&
    bot.class == "ai_training" &&
    sets.protected_paths.exists(p, request.path.startsWith(p))
  action:
    type: allow_metered
    price:
      amount: 0.002
      currency: USD
      unit: request
  audit:
    receipt: true
    tags: ["ai_training", "metered"]

Example config

Full V1 policy example:

version: crawlwall.io/v1

site:
  id: local-dev
  host: localhost
  mode: enforce

runtime:
  fail_mode: block
  default_action:
    type: allow

ledger:
  enabled: true

receipts:
  enabled: true
  signer:
    type: ed25519
    key_file: ./crawlwall.key

bots:
  - id: googlebot
    name: Googlebot
    class: search
    match:
      user_agents:
        - "Googlebot"
    verify:
      type: reverse_dns
      allowed_suffixes:
        - ".googlebot.com"
        - ".google.com"

  - id: gptbot
    name: GPTBot
    class: ai_training
    match:
      user_agents:
        - "GPTBot"
    verify:
      type: ip_ranges
      sources:
        - "https://openai.com/gptbot.json"
      refresh: 1h
      stale_action: fail_closed
      max_stale: 0s

  - id: unknown
    name: Unknown
    class: unknown
    match:
      default: true
    verify:
      type: none

sets:
  protected_paths:
    - "/archive"
    - "/datasets"
    - "/reports"

  known_ai_training:
    - "gptbot"
    - "claudebot"

rules:
  - id: block_spoofed_known_bots
    priority: 10
    when: >
      bot.claimed && !bot.verified
    action:
      type: block
      status: 403
      reason: spoofed_bot
    audit:
      receipt: true
      tags: ["spoofed", "security"]

  - id: allow_verified_search
    priority: 100
    when: >
      bot.verified && bot.class == "search"
    action:
      type: allow
    audit:
      receipt: false
      tags: ["search"]

  - id: meter_training_on_protected_paths
    priority: 200
    when: >
      bot.verified &&
      bot.class == "ai_training" &&
      sets.protected_paths.exists(p, request.path.startsWith(p))
    action:
      type: allow_metered
      price:
        amount: 0.002
        currency: USD
        unit: request
    audit:
      receipt: true
      tags: ["ai_training", "metered"]

  - id: rate_limit_ai_training_elsewhere
    priority: 300
    when: >
      bot.verified && bot.class == "ai_training"
    action:
      type: rate_limit
      limit:
        key: "bot.id"
        rpm: 120
    audit:
      receipt: true
      tags: ["ai_training"]

  - id: block_unknown_protected_paths
    priority: 900
    when: >
      bot.class == "unknown" &&
      sets.protected_paths.exists(p, request.path.startsWith(p))
    action:
      type: block
      status: 403
      reason: unknown_crawler_protected_path
    audit:
      receipt: true
      tags: ["unknown", "blocked"]

Verifiers

V1 ships with three verifier types:

Verifier	What it means
`none`	No verification step; useful for the `unknown` catch-all bot
`reverse_dns`	Verify by PTR lookup and forward-confirm the result
`ip_ranges`	Verify by matching the request IP against fetched CIDR ranges

`reverse_dns`

This is the standard pattern used for bots like Googlebot:

resolve remote IP to PTR names
require a configured suffix match
resolve the PTR hostname back to A/AAAA
require the original IP to be present

Completed reverse-DNS decisions are cached per IP for five minutes to avoid doing PTR and forward lookups on every request from a claimed crawler.

`ip_ranges`

This is the simpler model for bots that publish source ranges:

fetch remote JSON
extract CIDRs
cache them in memory
refresh on the configured interval
match the request IP against the cache

Important

A GPTBot request only verifies as true if the actual source IP falls inside OpenAI's published GPTBot ranges at evaluation time.

There is one unavoidable freshness tradeoff: CrawlWall can only know about an IP range rotation after it refreshes the provider document. A shorter refresh reduces that window but makes more network calls.

When a refresh is due and the provider document cannot be fetched, stale_action controls whether the expired cache is still trusted:

Field	Default	Meaning
`refresh`	`12h`	How often to refetch the range document
`stale_action`	`fail_closed`	Refuse expired ranges after refresh failure
`max_stale`	`0s`	Extra stale-cache time for `use_stale`

Security-first config:

verify:
  type: ip_ranges
  sources:
    - "https://openai.com/gptbot.json"
  refresh: 1h
  stale_action: fail_closed
  max_stale: 0s

Availability-first config:

verify:
  type: ip_ranges
  sources:
    - "https://openai.com/gptbot.json"
  refresh: 1h
  stale_action: use_stale
  max_stale: 24h

Use fail_closed when spoof resistance matters more than crawler availability. Use use_stale only when temporarily blocking a legitimate crawler is worse than trusting a bounded stale range cache.

Client IP and trusted proxies

CrawlWall verifies crawlers against Caddy's trusted-proxy-aware client IP. If Caddy receives traffic directly, that is the socket remote address. If Caddy is behind a CDN, load balancer, or reverse proxy, configure Caddy's server-level trusted_proxies and client_ip_headers options so forwarded client IP headers are trusted only from known proxy ranges.

Example:

{
  servers {
    trusted_proxies static private_ranges
    client_ip_headers X-Forwarded-For CF-Connecting-IP
  }

  order crawlwall before reverse_proxy
}

Do not trust arbitrary X-Forwarded-For headers from the public internet. That turns crawler verification into wishful thinking with a header parser.

Actions

V1 supports four actions:

Action	Effect
`allow`	Let the request pass
`block`	Return an error response immediately
`rate_limit`	Allow within a configured rate, then return `429`
`allow_metered`	Allow the request and record pricing metadata

allow_metered is intentionally narrow. It does not try to settle payment, issue invoices, or do 402 handshakes. It records the metering event and signs a receipt so that payment can be built later without changing the core decision engine.

CLI

The CLI exists to make policy iteration less miserable:

go run ./cmd/crawlwall init --profile minimal
go run ./cmd/crawlwall policy check --config ./crawlwall.yaml
go run ./cmd/crawlwall policy eval \
  --config ./crawlwall.yaml --ua "GPTBot/1.1" \
  --path "/archive/a" --ip 20.125.66.81
go run ./cmd/crawlwall policy test \
  --config ./crawlwall.yaml --fixtures ./examples/policy-fixtures.yaml
go run ./cmd/crawlwall verifiers status --config ./crawlwall.yaml
go run ./cmd/crawlwall ledger report --db ./crawlwall.db --since 24h
go run ./cmd/crawlwall ledger export --db ./crawlwall.db --format jsonl
go run ./cmd/crawlwall ledger vacuum --db ./crawlwall.db --older-than 30d
go run ./cmd/crawlwall receipts verify \
  --file ./ledger-export.jsonl --public-key ./crawlwall.pub

Useful split:

init: create a starting point
policy check: validate and compile
policy eval: answer "what would happen to this request?"
policy test: run fixture-based policy regression tests
verifiers status: show IP range verifier cache health
ledger report: summarize observed traffic
ledger export: dump the event log
ledger vacuum: delete old events and compact the SQLite file
receipts verify: validate signed receipt output

Project layout

cmd/crawlwall/         CLI
docs/                  usage guides
examples/              starter policies
internal/bot/          user-agent matching and bot registry
internal/config/       YAML load and validation
internal/ledger/       ledger interface and SQLite backend
internal/policy/       CEL environment, compile, evaluate
internal/ratelimit/    in-memory limiter
internal/receipt/      canonical receipts and Ed25519 signing
internal/scaffold/     starter templates for init
internal/verify/       reverse DNS and IP range verifiers

The main interface worth caring about is the ledger boundary. Request handling depends only on an EventWriter contract: one fully-formed event in, storage error out. Reporting and export are separate interfaces, so a Postgres, webhook, or queue-backed writer does not have to pretend it is SQLite.

Scope

Included in V1:

Caddy handler
CEL policy engine
reverse DNS verification
IP range verification
verifier cache status checks
shadow mode for dry-run policy rollout
SQLite ledger
ledger retention cleanup
signed receipts
local reporting and export
policy fixture tests

Deliberately not included in V1:

payment processing
dashboards
distributed quotas
extra webserver integrations
policy languages beyond CEL

Status

The current implementation has been exercised with:

go test ./...
custom Caddy builds through xcaddy
Caddy config validation
live requests through Caddy to a local upstream
verifier cache status checks
policy fixture tests
ledger export
receipt verification
integration tests for blocked, shadowed, metered, and rate-limited flows

That means the current claim is modest but honest:

CrawlWall is a self-hosted Caddy crawler access-control layer. It identifies bots, verifies identity, evaluates CEL policy rules, enforces allow/block/rate-limit/metered decisions, stores a crawler ledger, and exports signed crawl receipts.

Help

Use GitHub Issues for bugs, security-relevant behavior questions, and integration reports. Include the Caddyfile, CrawlWall policy, request path, user agent, and observed ledger row when possible.

License

MIT. See LICENSE.

Inspiration

Cloudflare's pay-per-crawl docs were useful inspiration for separating metering from payment, but CrawlWall stays much smaller and self-hosted:

What is pay per crawl?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrawlWall: Caddy AI Crawler Access Control

Contents

Why this exists

Mental model

Architecture

How a request is handled

Getting started

Requirements

Build

Keys and receipts

Policy shape

Writing policy rules

Example Caddyfile

Example rule

Example config

Verifiers

`reverse_dns`

`ip_ranges`

Client IP and trusted proxies

Actions

CLI

Project layout

Scope

Status

Help

License

Inspiration

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cmd/crawlwall		cmd/crawlwall
docs		docs
examples		examples
internal		internal
.gitattributes		.gitattributes
.gitignore		.gitignore
Caddyfile		Caddyfile
LICENSE		LICENSE
README.md		README.md
caddyfile.go		caddyfile.go
crawlwall.yaml		crawlwall.yaml
go.mod		go.mod
go.sum		go.sum
module.go		module.go
module_integration_test.go		module_integration_test.go

Folders and files

Latest commit

History

Repository files navigation

CrawlWall: Caddy AI Crawler Access Control

Contents

Why this exists

Mental model

Architecture

How a request is handled

Getting started

Requirements

Build

Keys and receipts

Policy shape

Writing policy rules

Example Caddyfile

Example rule

Example config

Verifiers

reverse_dns

ip_ranges

Client IP and trusted proxies

Actions

CLI

Project layout

Scope

Status

Help

License

Inspiration

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`reverse_dns`

`ip_ranges`