env0 inherits from the mock environments in ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. env0 provides local, stateful mock services for productivity-agent evalulation and training: Gmail, Calendar, Drive, Docs, Slack, and more. The services expose REST APIs, web UIs, OpenAPI docs, MCP servers, deterministic seeds, and evaluation control endpoints for reset, snapshots, diffs, and action logs.
The repo is intentionally focused on environment runtime work: mock service development, local tooling, seed contracts, API parity, example task fixtures, and the shared Docker base image. Benchmark dashboards, canonical task authoring, and scoring policy live in downstream benchmark packages.
Prerequisites:
- Python 3.12+
uv- Docker, only for Docker/base-image smoke checks
Start all configured services and the devhub:
scripts/dev.shOpen the devhub at http://127.0.0.1:9060. It links to each service UI,
OpenAPI docs, admin endpoints, and dev dashboard.
Useful development entry points:
scripts/dev.sh task gdrive-archive-stale-drafts # start only the services declared by one task
scripts/smoke_dev.sh # launcher/control/devhub smoke test
python3 devhub/app.py --render-once # render devhub once without starting servicesStop the local stack with Ctrl-C. Runtime databases are written under
.data/dev/; remove that directory for a clean local slate.
Service metadata is defined in config.toml. Control scripts,
the devhub, Docker generation, and service CLIs read from that file.
| Service | Port | Environment variable | API surface | Golden fixtures |
|---|---|---|---|---|
mock-gmail |
9001 | MOCK_GMAIL_URL |
Gmail API v1 | 35 |
mock-gcal |
9002 | MOCK_GCAL_URL |
Calendar API v3 | 31 |
mock-gdrive |
9003 | MOCK_GDRIVE_URL |
Drive API v3 | 42 |
mock-gdoc |
9004 | MOCK_GDOC_URL |
Docs API v1 plus comments | 6 |
mock-slack |
9005 | MOCK_SLACK_URL |
Slack Web API | 57 |
The fixture counts above are tracked in the current release gate documented in
docs/parity-audit/AUDIT_RESULTS.md.
Every service exposes the same operational shape:
/- product-style web UI over the live local state/docs- OpenAPI reference for the replicated API/health- liveness probe/_admin/state- full state dump for evaluators/_admin/diff- changes since the initial seed snapshot/_admin/action_log- ordered API actions taken by the agent/_admin/snapshot/{name}and/_admin/restore/{name}- named snapshots/dev/*- development dashboards, API explorers, and DB viewers/mcp- MCP tools for agent clients, when enabled by the service CLI
example_tasks/ contains runnable env0 fixtures. Each task
uses BenchFlow's native task.md package layout: one frontmatter-plus-prompt
document, optional seed data, an oracle, a verifier, and a thin Dockerfile.
example_tasks/gdrive-archive-stale-drafts/
|-- task.md
|-- environment/Dockerfile
|-- data/needles.py
|-- oracle/solve.sh
`-- verifier/evaluate.py
env0's local launcher reads service selection from the benchflow.env0
extension namespace in task.md:
benchflow:
env0:
services:
- mock-gdriveThe public launcher UX stays task-name based:
scripts/dev.sh task gdrive-archive-stale-draftsEvaluators should score the final service state, the diff from the initial snapshot, and the action log. They should not depend on agent transcript text.
tasks/ contains additional BenchFlow-format task packages kept as a
public reference set. They are not the source of truth for benchmark policy.
The shared base image is generated from this repo and tagged as:
ghcr.io/benchflow-ai/env0:<VERSION>
VERSION is the base-image semver source of truth. Thin task images
should inherit from the base image and keep hidden task payload under
/var/lib/task.
Docker validation commands:
docker/build-base.sh
PULL_BASE=0 scripts/smoke_docker_examples.sh
docker/build-base.sh --pushRun the push command only when GHCR package permissions are configured.
- Use
config.tomlas the single source of truth for service metadata. - Use current
mock-*service names andMOCK_*_URLenvironment variables. - Expose evaluator services through
task.mdfrontmatterbenchflow.environment.manifest. - Keep
benchflow.env0.servicesfor repo-local dev launcher task seeding only. - Do not infer services from Dockerfile text.
- Keep raw
--task-dataand task-data-path plumbing internal to env CLIs, control scripts, and Dockerfiles. - Do not copy env source code into task images.
- Keep hidden task data unreadable by the normal
agentuser. - Update docs when changing seed, Docker, launcher, or devhub contracts.
Run the checks that match the change:
scripts/smoke_dev.sh
python3 devhub/app.py --render-once
cd packages/environments/mock-gdrive && uv run --extra dev pytest tests -q
cd packages/environments/mock-gdrive && uv run --extra dev pytest tests/test_conformance.py -q
PULL_BASE=0 scripts/smoke_docker_examples.sh
BENCHFLOW_REWARD_LENIENT=1 bench eval run \
--tasks-dir example_tasks --agent oracle --sandbox docker \
--context-root . --jobs-dir .local/bf-jobs-public-examplesUse the per-service pytest command for the service you changed. Docker checks are
required before and after Dockerfile or base-image changes. The bench eval run
command is the end-to-end task check: it builds task images, starts the public
mock services through tasks/_manifests/env-0.toml, runs each oracle, and scores
each verifier.
env0/
|-- packages/environments/ # mock-gmail, mock-gcal, mock-gdoc, mock-gdrive, mock-slack
|-- devhub/ # local dev dashboard on port 9060
|-- docker/ # base-image generation and gws wrapper
|-- docs/ # guides, parity audit, validated workflows
|-- example_tasks/ # runnable env0 task fixtures
|-- tasks/ # public BenchFlow-format reference tasks
|-- scripts/ # dev.sh, env0_control.py, smoke tests
|-- config.toml # service and port metadata
`-- VERSION # base-image version
- Docs index
- Local dev and devhub
- Adding a new environment
- API validation playbook
- Parity audit
- Validated workflows
- Run public tasks with BenchFlow
- Good first contributions
- Contributing
- Security policy
- benchflow - evaluation framework, task standard, and agent runners.
- ClawsBench - public benchmark built on env0 environments.
env0 is licensed under the GNU Affero General Public License v3.0 only
(AGPL-3.0-only). See LICENSE.
