Skip to content

benchflow-ai/env0

env0

Open-source mock workspace runtime for agent evaluation and local development.

CI License: AGPL-3.0

Seeded env0 mock services: Gmail, Slack, Calendar, and Drive

env0 inherits from the mock environments in ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. env0 provides local, stateful mock services for productivity-agent evalulation and training: Gmail, Calendar, Drive, Docs, Slack, and more. The services expose REST APIs, web UIs, OpenAPI docs, MCP servers, deterministic seeds, and evaluation control endpoints for reset, snapshots, diffs, and action logs.

The repo is intentionally focused on environment runtime work: mock service development, local tooling, seed contracts, API parity, example task fixtures, and the shared Docker base image. Benchmark dashboards, canonical task authoring, and scoring policy live in downstream benchmark packages.

Quick Start

Prerequisites:

  • Python 3.12+
  • uv
  • Docker, only for Docker/base-image smoke checks

Start all configured services and the devhub:

scripts/dev.sh

Open the devhub at http://127.0.0.1:9060. It links to each service UI, OpenAPI docs, admin endpoints, and dev dashboard.

Useful development entry points:

scripts/dev.sh task gdrive-archive-stale-drafts  # start only the services declared by one task
scripts/smoke_dev.sh                             # launcher/control/devhub smoke test
python3 devhub/app.py --render-once              # render devhub once without starting services

Stop the local stack with Ctrl-C. Runtime databases are written under .data/dev/; remove that directory for a clean local slate.

Example Services

Service metadata is defined in config.toml. Control scripts, the devhub, Docker generation, and service CLIs read from that file.

Service Port Environment variable API surface Golden fixtures
mock-gmail 9001 MOCK_GMAIL_URL Gmail API v1 35
mock-gcal 9002 MOCK_GCAL_URL Calendar API v3 31
mock-gdrive 9003 MOCK_GDRIVE_URL Drive API v3 42
mock-gdoc 9004 MOCK_GDOC_URL Docs API v1 plus comments 6
mock-slack 9005 MOCK_SLACK_URL Slack Web API 57

The fixture counts above are tracked in the current release gate documented in docs/parity-audit/AUDIT_RESULTS.md.

Every service exposes the same operational shape:

  • / - product-style web UI over the live local state
  • /docs - OpenAPI reference for the replicated API
  • /health - liveness probe
  • /_admin/state - full state dump for evaluators
  • /_admin/diff - changes since the initial seed snapshot
  • /_admin/action_log - ordered API actions taken by the agent
  • /_admin/snapshot/{name} and /_admin/restore/{name} - named snapshots
  • /dev/* - development dashboards, API explorers, and DB viewers
  • /mcp - MCP tools for agent clients, when enabled by the service CLI

Tasks

example_tasks/ contains runnable env0 fixtures. Each task uses BenchFlow's native task.md package layout: one frontmatter-plus-prompt document, optional seed data, an oracle, a verifier, and a thin Dockerfile.

example_tasks/gdrive-archive-stale-drafts/
|-- task.md
|-- environment/Dockerfile
|-- data/needles.py
|-- oracle/solve.sh
`-- verifier/evaluate.py

env0's local launcher reads service selection from the benchflow.env0 extension namespace in task.md:

benchflow:
  env0:
    services:
      - mock-gdrive

The public launcher UX stays task-name based:

scripts/dev.sh task gdrive-archive-stale-drafts

Evaluators should score the final service state, the diff from the initial snapshot, and the action log. They should not depend on agent transcript text.

tasks/ contains additional BenchFlow-format task packages kept as a public reference set. They are not the source of truth for benchmark policy.

Docker Base Image

The shared base image is generated from this repo and tagged as:

ghcr.io/benchflow-ai/env0:<VERSION>

VERSION is the base-image semver source of truth. Thin task images should inherit from the base image and keep hidden task payload under /var/lib/task.

Docker validation commands:

docker/build-base.sh
PULL_BASE=0 scripts/smoke_docker_examples.sh
docker/build-base.sh --push

Run the push command only when GHCR package permissions are configured.

Runtime Contracts

  • Use config.toml as the single source of truth for service metadata.
  • Use current mock-* service names and MOCK_*_URL environment variables.
  • Expose evaluator services through task.md frontmatter benchflow.environment.manifest.
  • Keep benchflow.env0.services for repo-local dev launcher task seeding only.
  • Do not infer services from Dockerfile text.
  • Keep raw --task-data and task-data-path plumbing internal to env CLIs, control scripts, and Dockerfiles.
  • Do not copy env source code into task images.
  • Keep hidden task data unreadable by the normal agent user.
  • Update docs when changing seed, Docker, launcher, or devhub contracts.

Validation

Run the checks that match the change:

scripts/smoke_dev.sh
python3 devhub/app.py --render-once
cd packages/environments/mock-gdrive && uv run --extra dev pytest tests -q
cd packages/environments/mock-gdrive && uv run --extra dev pytest tests/test_conformance.py -q
PULL_BASE=0 scripts/smoke_docker_examples.sh
BENCHFLOW_REWARD_LENIENT=1 bench eval run \
  --tasks-dir example_tasks --agent oracle --sandbox docker \
  --context-root . --jobs-dir .local/bf-jobs-public-examples

Use the per-service pytest command for the service you changed. Docker checks are required before and after Dockerfile or base-image changes. The bench eval run command is the end-to-end task check: it builds task images, starts the public mock services through tasks/_manifests/env-0.toml, runs each oracle, and scores each verifier.

Repo Layout

env0/
|-- packages/environments/   # mock-gmail, mock-gcal, mock-gdoc, mock-gdrive, mock-slack
|-- devhub/                  # local dev dashboard on port 9060
|-- docker/                  # base-image generation and gws wrapper
|-- docs/                    # guides, parity audit, validated workflows
|-- example_tasks/           # runnable env0 task fixtures
|-- tasks/                   # public BenchFlow-format reference tasks
|-- scripts/                 # dev.sh, env0_control.py, smoke tests
|-- config.toml              # service and port metadata
`-- VERSION                  # base-image version

Documentation

Related Repos

  • benchflow - evaluation framework, task standard, and agent runners.
  • ClawsBench - public benchmark built on env0 environments.

License

env0 is licensed under the GNU Affero General Public License v3.0 only (AGPL-3.0-only). See LICENSE.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors