env0

Open-source mock workspace runtime for agent evaluation and local development.

env0 inherits from the mock environments in ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. env0 provides local, stateful mock services for productivity-agent evalulation and training: Gmail, Calendar, Drive, Docs, Slack, and more. The services expose REST APIs, web UIs, OpenAPI docs, MCP servers, deterministic seeds, and evaluation control endpoints for reset, snapshots, diffs, and action logs.

The repo is intentionally focused on environment runtime work: mock service development, local tooling, seed contracts, API parity, example task fixtures, and the shared Docker base image. Benchmark dashboards, canonical task authoring, and scoring policy live in downstream benchmark packages.

Quick Start

Prerequisites:

Python 3.12+
uv
Docker, only for Docker/base-image smoke checks

Start all configured services and the devhub:

scripts/dev.sh

Open the devhub at http://127.0.0.1:9060. It links to each service UI, OpenAPI docs, admin endpoints, and dev dashboard.

Useful development entry points:

scripts/dev.sh task gdrive-archive-stale-drafts  # start only the services declared by one task
scripts/smoke_dev.sh                             # launcher/control/devhub smoke test
python3 devhub/app.py --render-once              # render devhub once without starting services

Stop the local stack with Ctrl-C. Runtime databases are written under .data/dev/; remove that directory for a clean local slate.

Example Services

Service metadata is defined in config.toml. Control scripts, the devhub, Docker generation, and service CLIs read from that file.

Service	Port	Environment variable	API surface	Golden fixtures
`mock-gmail`	9001	`MOCK_GMAIL_URL`	Gmail API v1	35
`mock-gcal`	9002	`MOCK_GCAL_URL`	Calendar API v3	31
`mock-gdrive`	9003	`MOCK_GDRIVE_URL`	Drive API v3	42
`mock-gdoc`	9004	`MOCK_GDOC_URL`	Docs API v1 plus comments	6
`mock-slack`	9005	`MOCK_SLACK_URL`	Slack Web API	57

The fixture counts above are tracked in the current release gate documented in docs/parity-audit/AUDIT_RESULTS.md.

Every service exposes the same operational shape:

/ - product-style web UI over the live local state
/docs - OpenAPI reference for the replicated API
/health - liveness probe
/_admin/state - full state dump for evaluators
/_admin/diff - changes since the initial seed snapshot
/_admin/action_log - ordered API actions taken by the agent
/_admin/snapshot/{name} and /_admin/restore/{name} - named snapshots
/dev/* - development dashboards, API explorers, and DB viewers
/mcp - MCP tools for agent clients, when enabled by the service CLI

Tasks

example_tasks/ contains runnable env0 fixtures. Each task uses BenchFlow's native task.md package layout: one frontmatter-plus-prompt document, optional seed data, an oracle, a verifier, and a thin Dockerfile.

example_tasks/gdrive-archive-stale-drafts/
|-- task.md
|-- environment/Dockerfile
|-- data/needles.py
|-- oracle/solve.sh
`-- verifier/evaluate.py

env0's local launcher reads service selection from the benchflow.env0 extension namespace in task.md:

benchflow:
  env0:
    services:
      - mock-gdrive

The public launcher UX stays task-name based:

scripts/dev.sh task gdrive-archive-stale-drafts

Evaluators should score the final service state, the diff from the initial snapshot, and the action log. They should not depend on agent transcript text.

tasks/ contains additional BenchFlow-format task packages kept as a public reference set. They are not the source of truth for benchmark policy.

Docker Base Image

The shared base image is generated from this repo and tagged as:

ghcr.io/benchflow-ai/env0:<VERSION>

VERSION is the base-image semver source of truth. Thin task images should inherit from the base image and keep hidden task payload under /var/lib/task.

Docker validation commands:

docker/build-base.sh
PULL_BASE=0 scripts/smoke_docker_examples.sh
docker/build-base.sh --push

Run the push command only when GHCR package permissions are configured.

Runtime Contracts

Use config.toml as the single source of truth for service metadata.
Use current mock-* service names and MOCK_*_URL environment variables.
Expose evaluator services through task.md frontmatter benchflow.environment.manifest.
Keep benchflow.env0.services for repo-local dev launcher task seeding only.
Do not infer services from Dockerfile text.
Keep raw --task-data and task-data-path plumbing internal to env CLIs, control scripts, and Dockerfiles.
Do not copy env source code into task images.
Keep hidden task data unreadable by the normal agent user.
Update docs when changing seed, Docker, launcher, or devhub contracts.

Validation

Run the checks that match the change:

scripts/smoke_dev.sh
python3 devhub/app.py --render-once
cd packages/environments/mock-gdrive && uv run --extra dev pytest tests -q
cd packages/environments/mock-gdrive && uv run --extra dev pytest tests/test_conformance.py -q
PULL_BASE=0 scripts/smoke_docker_examples.sh
BENCHFLOW_REWARD_LENIENT=1 bench eval run \
  --tasks-dir example_tasks --agent oracle --sandbox docker \
  --context-root . --jobs-dir .local/bf-jobs-public-examples

Use the per-service pytest command for the service you changed. Docker checks are required before and after Dockerfile or base-image changes. The bench eval run command is the end-to-end task check: it builds task images, starts the public mock services through tasks/_manifests/env-0.toml, runs each oracle, and scores each verifier.

Repo Layout

env0/
|-- packages/environments/   # mock-gmail, mock-gcal, mock-gdoc, mock-gdrive, mock-slack
|-- devhub/                  # local dev dashboard on port 9060
|-- docker/                  # base-image generation and gws wrapper
|-- docs/                    # guides, parity audit, validated workflows
|-- example_tasks/           # runnable env0 task fixtures
|-- tasks/                   # public BenchFlow-format reference tasks
|-- scripts/                 # dev.sh, env0_control.py, smoke tests
|-- config.toml              # service and port metadata
`-- VERSION                  # base-image version

Documentation

Related Repos

benchflow - evaluation framework, task standard, and agent runners.
ClawsBench - public benchmark built on env0 environments.

License

env0 is licensed under the GNU Affero General Public License v3.0 only (AGPL-3.0-only). See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
devhub		devhub
docker		docker
docs		docs
example_tasks		example_tasks
packages		packages
scripts		scripts
tasks		tasks
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
VERSION		VERSION
config.toml		config.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

env0

Quick Start

Example Services

Tasks

Docker Base Image

Runtime Contracts

Validation

Repo Layout

Documentation

Related Repos

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

env0

Quick Start

Example Services

Tasks

Docker Base Image

Runtime Contracts

Validation

Repo Layout

Documentation

Related Repos

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages