Important
CrawlWall is alpha and experimental. It is useful for local testing, demos, and careful shadow-mode trials, but it is not yet a battle-tested production security boundary. Review the policy, verifier, and ledger behavior before enforcing blocks on real traffic.
A self-hosted Caddy module for AI crawler blocking, bot verification, rate limiting, metered access, and signed crawl receipts.
CrawlWall sits in front of your application and turns robots.txt-style crawler policy into enforceable HTTP-edge rules using YAML and CEL. It identifies crawlers, verifies their identity, evaluates policy, records what happened, and can sign receipts for metered access.
The short version is:
robots.txtis advisory- CrawlWall is enforcement
- YAML is the config container
- CEL is the policy language
- Caddy is the runtime
- Why this exists
- Mental model
- Architecture
- How a request is handled
- Getting started
- Requirements
- Keys and receipts
- Policy shape
- Writing policy rules
- Verifiers
- Client IP and trusted proxies
- Actions
- CLI
- Project layout
- Scope
- Status
- Help
- License
Sites increasingly need something more precise than:
- "please do not crawl this"
- "this bot says it is Google"
- "this path should maybe cost money"
That is awkward to express with robots.txt, awkward to audit in application
code, and annoying to keep consistent across services.
CrawlWall moves that logic into the HTTP edge and gives it a stable shape:
| Concern | CrawlWall answer |
|---|---|
| Is this crawler known? | Match on User-Agent |
| Is it really that crawler? | Verify by reverse DNS or IP ranges |
| What should happen? | Evaluate CEL rules in priority order |
| Need proof later? | Write a ledger event with a stable event ID |
| Need metering? | allow_metered + signed receipts |
The point is not to be clever. The point is to be explicit, inspectable, and replaceable.
Think of CrawlWall as four subsystems glued together inside a Caddy handler:
- Bot identification: map a request to a known bot definition or
unknown. - Verification: decide whether the claimed crawler identity is trustworthy.
- Policy evaluation: run CEL rules against the request context.
- Audit trail: write the event and optionally sign a receipt.
That means the project is not "a YAML parser" and not "a crawler blocklist." It is a policy runtime.
flowchart LR
A["HTTP request"] --> B["Bot identifier"]
B --> C["Verifier"]
C --> D["Policy engine (CEL)"]
D --> E["Decision"]
E --> F["Allow / Block / Rate limit / Allow metered"]
E --> H["Signed receipt (optional)"]
H --> G["Ledger writer"]
F --> I["Upstream app"]
The startup path matters as much as the request path.
At startup CrawlWall:
- loads
crawlwall.yaml - validates the config
- compiles CEL expressions
- opens the ledger backend
- prepares verifiers
- loads the receipt signer
If a CEL expression is broken, startup should fail. That is the right pain location.
| Step | What happens |
|---|---|
| 1 | Read User-Agent and identify the claimed crawler |
| 2 | Verify the request source according to that crawler's verifier |
| 3 | Build the policy input: bot, request, site, sets, labels |
| 4 | Evaluate rules by ascending priority |
| 5 | Enforce the first matching action |
| 6 | If requested, sign a receipt over the stable event ID |
| 7 | Write one ledger record containing the decision and receipt metadata |
The policy input is intentionally small and boring. It is easier to extend a plain model than to untangle a magical one.
Do not start from a blank file.
This repo ships with two starter policies:
| File | Use it when |
|---|---|
examples/minimal.yaml |
You want a readable starter with no receipt signing |
examples/full.yaml |
You want the full V1 shape with metering and signed receipts |
examples/policy-fixtures.yaml |
You want regression tests for policy behavior |
There is also a scaffold command:
go run ./cmd/crawlwall init --profile minimal
go run ./cmd/crawlwall init --profile fullThat writes:
crawlwall.yamlCaddyfile.gitignorecrawlwall.keyandcrawlwall.pubunless you disable key generation
If you want the scaffold without keys yet:
go run ./cmd/crawlwall init --profile minimal --generate-keys=false- Go matching the version in
go.mod xcaddyto build a Caddy binary with the CrawlWall module- Caddy for config validation and runtime
- A SQLite ledger path when
ledger.enabledistrue
Build a custom Caddy binary with xcaddy.
From this local checkout:
go mod tidy
xcaddy build --with github.com/jolovicdev/crawlwall=.From a published module version:
xcaddy build --with github.com/jolovicdev/crawlwall@latestCheck that the module is present:
caddy list-modules | grep crawlwallValidate the config:
go run ./cmd/crawlwall policy check --config ./crawlwall.yaml
caddy validate --config ./Caddyfile --adapter caddyfileRun it:
caddy run --config ./Caddyfile --adapter caddyfileTry a few requests:
curl http://localhost:8080/
curl http://localhost:8080/archive/a
curl -A "GPTBot/1.1" http://localhost:8080/archive/aNote
The docs use plain executable names on purpose. Use whatever binary name your environment produces.
Receipt signing uses Ed25519.
The private key is sensitive and should never be committed. This repo ignores
it by default in .gitignore.
You have two normal ways to create keys:
- let
crawlwall initgenerate them - generate them yourself with
openssl
openssl genpkey -algorithm Ed25519 -out crawlwall.key
openssl pkey -in crawlwall.key -pubout -out crawlwall.pubReceipt config looks like this:
receipts:
enabled: true
signer:
type: ed25519
key_file: ./crawlwall.keyReceipts are for proving what decision was made for a request. In V1 they are used for metered access and audit, not settlement.
The top-level config model is stable even if the individual rules change:
| Section | Purpose |
|---|---|
site |
Site identity and mode |
runtime |
Failure behavior and default action |
ledger |
Event recording settings |
receipts |
Receipt signer configuration |
bots |
Known crawler definitions and verifier settings |
sets |
Reusable policy data |
rules |
CEL expressions plus actions |
site.mode controls enforcement:
| Mode | Effect |
|---|---|
shadow |
Log decisions without enforcing blocks or rate limits |
observe |
Alias for shadow, kept for older configs |
enforce |
Enforce policy decisions |
Use shadow before blocking crawlers on a production site. It lets you inspect
the ledger first, which is less exciting than debugging a self-inflicted 403
storm.
Start with the policy guide. It explains the available CEL inputs, rule priority, shadow mode, common recipes, verifier cache status, and fixture tests.
{
order crawlwall before reverse_proxy
}
:8080 {
crawlwall {
policy ./crawlwall.yaml
ledger sqlite://./crawlwall.db
fail_mode block
}
reverse_proxy localhost:3000
}- id: meter_training_on_protected_paths
priority: 200
when: >
bot.verified &&
bot.class == "ai_training" &&
sets.protected_paths.exists(p, request.path.startsWith(p))
action:
type: allow_metered
price:
amount: 0.002
currency: USD
unit: request
audit:
receipt: true
tags: ["ai_training", "metered"]Full V1 policy example:
version: crawlwall.io/v1
site:
id: local-dev
host: localhost
mode: enforce
runtime:
fail_mode: block
default_action:
type: allow
ledger:
enabled: true
receipts:
enabled: true
signer:
type: ed25519
key_file: ./crawlwall.key
bots:
- id: googlebot
name: Googlebot
class: search
match:
user_agents:
- "Googlebot"
verify:
type: reverse_dns
allowed_suffixes:
- ".googlebot.com"
- ".google.com"
- id: gptbot
name: GPTBot
class: ai_training
match:
user_agents:
- "GPTBot"
verify:
type: ip_ranges
sources:
- "https://openai.com/gptbot.json"
refresh: 1h
stale_action: fail_closed
max_stale: 0s
- id: unknown
name: Unknown
class: unknown
match:
default: true
verify:
type: none
sets:
protected_paths:
- "/archive"
- "/datasets"
- "/reports"
known_ai_training:
- "gptbot"
- "claudebot"
rules:
- id: block_spoofed_known_bots
priority: 10
when: >
bot.claimed && !bot.verified
action:
type: block
status: 403
reason: spoofed_bot
audit:
receipt: true
tags: ["spoofed", "security"]
- id: allow_verified_search
priority: 100
when: >
bot.verified && bot.class == "search"
action:
type: allow
audit:
receipt: false
tags: ["search"]
- id: meter_training_on_protected_paths
priority: 200
when: >
bot.verified &&
bot.class == "ai_training" &&
sets.protected_paths.exists(p, request.path.startsWith(p))
action:
type: allow_metered
price:
amount: 0.002
currency: USD
unit: request
audit:
receipt: true
tags: ["ai_training", "metered"]
- id: rate_limit_ai_training_elsewhere
priority: 300
when: >
bot.verified && bot.class == "ai_training"
action:
type: rate_limit
limit:
key: "bot.id"
rpm: 120
audit:
receipt: true
tags: ["ai_training"]
- id: block_unknown_protected_paths
priority: 900
when: >
bot.class == "unknown" &&
sets.protected_paths.exists(p, request.path.startsWith(p))
action:
type: block
status: 403
reason: unknown_crawler_protected_path
audit:
receipt: true
tags: ["unknown", "blocked"]V1 ships with three verifier types:
| Verifier | What it means |
|---|---|
none |
No verification step; useful for the unknown catch-all bot |
reverse_dns |
Verify by PTR lookup and forward-confirm the result |
ip_ranges |
Verify by matching the request IP against fetched CIDR ranges |
This is the standard pattern used for bots like Googlebot:
- resolve remote IP to PTR names
- require a configured suffix match
- resolve the PTR hostname back to A/AAAA
- require the original IP to be present
Completed reverse-DNS decisions are cached per IP for five minutes to avoid doing PTR and forward lookups on every request from a claimed crawler.
This is the simpler model for bots that publish source ranges:
- fetch remote JSON
- extract CIDRs
- cache them in memory
- refresh on the configured interval
- match the request IP against the cache
Important
A GPTBot request only verifies as true if the actual source IP falls
inside OpenAI's published GPTBot ranges at evaluation time.
There is one unavoidable freshness tradeoff: CrawlWall can only know about an
IP range rotation after it refreshes the provider document. A shorter
refresh reduces that window but makes more network calls.
When a refresh is due and the provider document cannot be fetched,
stale_action controls whether the expired cache is still trusted:
| Field | Default | Meaning |
|---|---|---|
refresh |
12h |
How often to refetch the range document |
stale_action |
fail_closed |
Refuse expired ranges after refresh failure |
max_stale |
0s |
Extra stale-cache time for use_stale |
Security-first config:
verify:
type: ip_ranges
sources:
- "https://openai.com/gptbot.json"
refresh: 1h
stale_action: fail_closed
max_stale: 0sAvailability-first config:
verify:
type: ip_ranges
sources:
- "https://openai.com/gptbot.json"
refresh: 1h
stale_action: use_stale
max_stale: 24hUse fail_closed when spoof resistance matters more than crawler availability.
Use use_stale only when temporarily blocking a legitimate crawler is worse
than trusting a bounded stale range cache.
CrawlWall verifies crawlers against Caddy's trusted-proxy-aware client IP. If
Caddy receives traffic directly, that is the socket remote address. If Caddy is
behind a CDN, load balancer, or reverse proxy, configure Caddy's server-level
trusted_proxies and client_ip_headers options so forwarded client IP headers
are trusted only from known proxy ranges.
Example:
{
servers {
trusted_proxies static private_ranges
client_ip_headers X-Forwarded-For CF-Connecting-IP
}
order crawlwall before reverse_proxy
}Do not trust arbitrary X-Forwarded-For headers from the public internet. That
turns crawler verification into wishful thinking with a header parser.
V1 supports four actions:
| Action | Effect |
|---|---|
allow |
Let the request pass |
block |
Return an error response immediately |
rate_limit |
Allow within a configured rate, then return 429 |
allow_metered |
Allow the request and record pricing metadata |
allow_metered is intentionally narrow. It does not try to settle payment,
issue invoices, or do 402 handshakes. It records the metering event and signs a
receipt so that payment can be built later without changing the core decision
engine.
The CLI exists to make policy iteration less miserable:
go run ./cmd/crawlwall init --profile minimal
go run ./cmd/crawlwall policy check --config ./crawlwall.yaml
go run ./cmd/crawlwall policy eval \
--config ./crawlwall.yaml --ua "GPTBot/1.1" \
--path "/archive/a" --ip 20.125.66.81
go run ./cmd/crawlwall policy test \
--config ./crawlwall.yaml --fixtures ./examples/policy-fixtures.yaml
go run ./cmd/crawlwall verifiers status --config ./crawlwall.yaml
go run ./cmd/crawlwall ledger report --db ./crawlwall.db --since 24h
go run ./cmd/crawlwall ledger export --db ./crawlwall.db --format jsonl
go run ./cmd/crawlwall ledger vacuum --db ./crawlwall.db --older-than 30d
go run ./cmd/crawlwall receipts verify \
--file ./ledger-export.jsonl --public-key ./crawlwall.pubUseful split:
init: create a starting pointpolicy check: validate and compilepolicy eval: answer "what would happen to this request?"policy test: run fixture-based policy regression testsverifiers status: show IP range verifier cache healthledger report: summarize observed trafficledger export: dump the event logledger vacuum: delete old events and compact the SQLite filereceipts verify: validate signed receipt output
cmd/crawlwall/ CLI
docs/ usage guides
examples/ starter policies
internal/bot/ user-agent matching and bot registry
internal/config/ YAML load and validation
internal/ledger/ ledger interface and SQLite backend
internal/policy/ CEL environment, compile, evaluate
internal/ratelimit/ in-memory limiter
internal/receipt/ canonical receipts and Ed25519 signing
internal/scaffold/ starter templates for init
internal/verify/ reverse DNS and IP range verifiers
The main interface worth caring about is the ledger boundary. Request handling
depends only on an EventWriter contract: one fully-formed event in, storage
error out. Reporting and export are separate interfaces, so a Postgres,
webhook, or queue-backed writer does not have to pretend it is SQLite.
Included in V1:
- Caddy handler
- CEL policy engine
- reverse DNS verification
- IP range verification
- verifier cache status checks
- shadow mode for dry-run policy rollout
- SQLite ledger
- ledger retention cleanup
- signed receipts
- local reporting and export
- policy fixture tests
Deliberately not included in V1:
- payment processing
- dashboards
- distributed quotas
- extra webserver integrations
- policy languages beyond CEL
The current implementation has been exercised with:
go test ./...- custom Caddy builds through
xcaddy - Caddy config validation
- live requests through Caddy to a local upstream
- verifier cache status checks
- policy fixture tests
- ledger export
- receipt verification
- integration tests for blocked, shadowed, metered, and rate-limited flows
That means the current claim is modest but honest:
CrawlWall is a self-hosted Caddy crawler access-control layer. It identifies bots, verifies identity, evaluates CEL policy rules, enforces allow/block/rate-limit/metered decisions, stores a crawler ledger, and exports signed crawl receipts.
Use GitHub Issues for bugs, security-relevant behavior questions, and integration reports. Include the Caddyfile, CrawlWall policy, request path, user agent, and observed ledger row when possible.
MIT. See LICENSE.
Cloudflare's pay-per-crawl docs were useful inspiration for separating metering from payment, but CrawlWall stays much smaller and self-hosted: