Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions docs/design/agent-workflows/open-issues.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Open issues

Deferred TODOs and open questions for the agent-workflows project. Each entry carries enough
context and provenance to act on cold. See the `defer-todo` skill for the format.

## Open issues

### Supply secret values to tools during a standalone run

**Status:** open
**Added:** 2026-06-19
**Commit:** 6a812efb95 (branch `gitbutler/workspace`)
**Project:** [agent-workflows/sdk-local-tools](./sdk-local-tools/)
**Source:** sdk-local-tools design review session (answering reviewer comments on plan.md, Decision 3)

**The problem.** An agent's `code` tool declares the secrets it needs by name, for example
`secrets: ["GITHUB_TOKEN"]`. An MCP server declares an env-var-to-name map. The config stores
only the name, never the value. At run time something must turn the name into a value and
inject it as an env var, into the sandbox subprocess for a code tool or into the server
process for MCP. On the Agenta server path that consumer is meant to read the `custom_secret`
entries from the project vault by name, but it is not built yet. Custom secrets are
storage-only this iteration by design, so today a declared secret resolves to nothing and the
tool runs without it. For a standalone run the goal is different: do not depend on the Agenta
vault at all. The trouble is the SDK has no secret resolution of any kind today. Resolution
lives only in the service. So a standalone agent has no way to supply `GITHUB_TOKEN` to its
code tool.

**Why it is deferred.** The first slice is offline and narrow (built-in plus code tools with
env secrets). Env alone closes the gap for that slice, so the wider question of where secret
values come from does not block it. The vault path also depends on work another effort owns
(see below), so it cannot land here yet.

**What to decide or do.** Pick the local source of secret values for a standalone run. Three
options, and they are not exclusive.

1. Env, the offline default. Read the declared name straight from the process environment.
`GITHUB_TOKEN` comes from `os.environ`. Simple, offline, and enough for the first slice.
2. A pluggable `SecretResolver` interface. The user implements it. Env is the built-in
default, but they can back it with a `.env` file, a secret manager, or their own vault. A
small interface for a lot of flexibility.
3. The Agenta vault over HTTP. It reads `custom_secret` entries by name. This needs the
future runtime consumer endpoint to be built, and it is the connected-standalone path
only.

The lean: ship option 1 as the default and option 2 as the interface it plugs into, both in
the first slice. Treat option 3 as later work tied to the named-secrets effort building the
consumer. See [./sdk-local-tools/plan.md](./sdk-local-tools/plan.md) (Decision 3 and Phase
4), [./sdk-local-tools/research.md](./sdk-local-tools/research.md) (stage 3), and
[../vault-named-secrets/](../vault-named-secrets/).

### Batch the two vault round-trips on the agent invoke path

**Status:** open
**Added:** 2026-06-19
**Commit:** 6a812efb95 (branch `gitbutler/workspace`)
**Project:** [agent-workflows/sdk-local-tools](./sdk-local-tools/)
**Source:** xhigh code review of the sdk-local-tools first slice

**The problem.** The service resolves a code tool's named secrets inside `resolve_tools` and an
MCP server's named secrets inside `resolve_mcp_servers`. Each one builds its own
`_VaultSecretResolver` and makes its own `POST /secrets/resolve`. When a config has both a code
tool and an enabled MCP server that each declare secrets, a cold invoke makes two sequential
vault round-trips where one batched call over the union of names would do. The two functions
are also awaited one after another in `app.py`, not concurrently.

**Why it is deferred.** MCP is flag-gated off this release, so the second round-trip does not
happen on the default path yet. The cost only appears once MCP turns on. Fixing it now would be
optimizing a path no one runs.

**What to decide or do.** When MCP comes off the flag (sdk-local-tools Phase 5), resolve the
union of code-tool and MCP secret names in one vault call, and consider running the independent
resolve steps in `app.py` concurrently.

### Give the resolved-tool shape one source of truth

**Status:** open
**Added:** 2026-06-19
**Commit:** 6a812efb95 (branch `gitbutler/workspace`)
**Project:** [agent-workflows/sdk-local-tools](./sdk-local-tools/)
**Source:** xhigh code review of the sdk-local-tools first slice

**The problem.** `ResolvedTools` (the SDK resolver's return type) carries the same four fields,
`builtin_tools` / `custom_tools` / `tool_callback` / `mcp_servers`, that `SessionConfig` already
declares. The two shapes can drift: when a new per-tool wire field lands, a maintainer has to
add it in both places, and missing one silently drops the field from either the standalone path
or the service path.

**Why it is deferred.** The duplication is small and the two types serve different layers today
(a resolver result versus the full session bundle). It is a maintainability touch-point, not a
bug.

**What to decide or do.** Decide whether the resolver should return the `SessionConfig` tool
fields directly (or a shared sub-model both reuse), so the wire tool shape has one definition.
40 changes: 40 additions & 0 deletions docs/design/agent-workflows/sdk-local-tools/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# SDK local tools

This folder plans one feature: letting a standalone Agenta Python SDK user run an agent's
**tools** during a fully local run, with no Agenta backend service and no rivet sidecar.

It complements the sibling effort in
[`../scratch/sdk-local-backend/status.md`](../scratch/sdk-local-backend/status.md). That one
moves the agent runtime into the SDK and builds `LocalBackend` (the engine that runs a
harness on the user's own machine). This one builds the tool layer on top of that engine:
resolving the agent's tool references into runnable specs, and supplying the secrets those
tools need, all without calling Agenta.

This workspace covers the original feature plan, the organization proposal, its implementation,
and the completed review remediation. `status.md` is the source of truth.

## Read in this order

1. **[context.md](context.md)**: why this matters, the standalone promise it keeps, the
goal, and the non-goals. Start here.
2. **[research.md](research.md)**: the verified current-state map. Where every piece lives
today (with `file:line`), what works, and what is missing, broken down per tool kind.
3. **[organization-proposal.md](organization-proposal.md)**: the recommended package,
naming, exception, validation, and migration structure.
4. **[status.md](status.md)**: the source of truth for progress and the concrete first
steps for the next agent.
5. **[codebase-conventions.md](codebase-conventions.md)**: repository patterns learned during
the organization review, including strong precedents and legacy patterns not to copy.
6. **[plan.md](plan.md)**: the earlier phased behavior plan. Its product decisions remain
useful context; its proposed file and symbol names are not final.

## The one-sentence goal

A standalone SDK user pulls an agent config from Agenta, builds a `LocalBackend`, and runs
the agent locally **with its tools**: built-in, code, and (later) gateway, client, and MCP.

## Prerequisite

This work assumes `LocalBackend` exists. It does not yet; it is a stub that raises (see
research.md). Read `../scratch/sdk-local-backend/status.md` for that effort's state. The
phases here are sequenced so the first tool slice lands right after `LocalBackend` does.
201 changes: 201 additions & 0 deletions docs/design/agent-workflows/sdk-local-tools/codebase-conventions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# Codebase conventions learned

This is a fresh repository review. It records observed patterns and distinguishes the
patterns worth following from legacy or inconsistent code that should not become precedent.

## Organization

- Growing backend domains are directories, not a sequence of suffixed files.
`api/oss/src/core/tools/` uses `dtos.py`, `interfaces.py`, `service.py`, `registry.py`,
`exceptions.py`, and provider subpackages.
- The SDK also nests substantial subsystems. `sdks/python/agenta/sdk/engines/running/`
contains interfaces, registries, handlers, runners, errors, templates, and sandbox logic.
- `sdks/python/agenta/sdk/evaluations/runtime/` is a particularly relevant shared-runtime
package. API core imports those SDK model classes directly from
`api/oss/src/core/evaluations/runtime/types.py` and defines only API-specific additions.
- Small composition areas remain flat. `services/oss/src/agent/` has one file per concern:
`app.py`, `client.py`, `config.py`, `schemas.py`, `secrets.py`, `tools.py`, `tracing.py`.
- Tests are organized by runner, test boundary, then domain. Unit tests do not require a
running backend; HTTP adapter behavior belongs in integration tests.
- `__init__.py` is generally used as a curated import surface, while implementations remain
in role-specific modules.

## File naming

- Common role names in maintained API core domains are `dtos.py`, `interfaces.py`,
`service.py`, `types.py`, `utils.py`, and `exceptions.py`.
- Semantic names are preferable when the role is clearer than a generic bucket:
`streaming.py`, `registry.py`, `delivery.py`, `context.py`, `mappings.py`.
- Abbreviations such as `defs` are uncommon as file-role names and do not identify whether
values are configuration, DTOs, schemas, or runtime specifications.
- `utils.py` exists widely but often becomes a mixed bucket. New code should use a semantic
filename when one is available.
- A domain-qualified generic filename can be clear (`tools/registry.py`); the same generic
filename at a broad level (`agents/tool_resolution.py`) tends to accumulate unrelated
concerns.

## Class naming

- Pydantic classes are nouns that describe their lifecycle or API role:
`ToolConnectionCreate`, `ToolExecutionRequest`, `WorkflowRevisionData`,
`HarnessAgentConfig`.
- Abstract contracts use both plain domain names in the SDK (`Backend`, `Harness`,
`Session`) and `*Interface` in API core (`ToolsDAOInterface`,
`ToolsGatewayInterface`). Consistency within one subsystem matters more than copying one
suffix globally.
- Implementations identify their mechanism or provider:
`InProcessPiBackend`, `RivetBackend`, `ToolsGatewayRegistry`.
- Domain exceptions generally end in `Error` and often inherit from a domain base, as in
`ToolsError`, `EmbedError`, and `UnsupportedHarnessError`.
- The codebase usually capitalizes well-known acronyms in class names (`API`, `LLM`), though
there are mixed conventions for newer acronyms. A subsystem should choose one spelling
deliberately and keep it consistent.

## Function and method naming

- Functions use verb-first snake_case names: `resolve_json_pointer`,
`create_connection`, `list_integrations`, `make_oauth_state`.
- Async methods generally use the same domain verb as sync behavior. Some older SDK modules
use `a*` prefixes such as `acreate`; this is legacy style and should not be copied into the
new agent runtime.
- Boolean helpers read as predicates (`supports`, `is_active`, `is_valid`) or explicit
questions (`_mcp_enabled`).
- Mapper functions are clearest when named for the conversion (`decode_oauth_state`,
`_to_harness_config`). Generic verbs such as `attach` should name the data being attached.
- Private helpers use a leading underscore. A service importing another module's private or
implementation helper usually indicates the package boundary is wrong.

## Function structure

- New API code prefers keyword-only parameters using `*`.
- Long signatures and calls use visual grouping with `#` separators in API core.
- Services orchestrate dependencies; pure conversion belongs in focused helpers or mapping
modules.
- Adapters own provider-specific behavior. Core orchestration depends on interfaces rather
than concrete HTTP or database implementations.
- Constructors receive dependencies explicitly. `ToolsService` receives its DAO and adapter
registry; agent `Environment` receives a backend.
- Return contracts are typed models when they cross layers. Positional tuples are used in
places but are less extensible and less self-documenting.
- Mutating a result model after construction is not a dominant pattern for domain results;
constructing a complete return value is clearer.

## Variables

- Domain-specific names are preferred at boundaries: `project_id`, `provider_key`,
`integration_key`, `connection_create`.
- Short names such as `p`, `out`, `defs`, `entry`, and `values` appear in implementation
code but are weaker choices when the value crosses several stages.
- Names should preserve lifecycle: `tool_config`, `tool_spec`, `secret_names`,
`secret_values`, and `resolved_tools`.
- Wire-specific camelCase is isolated to serialization or explicit wire dictionaries.
Python domain fields remain snake_case.

## Models and DTOs

- API core keeps domain contracts in `dtos.py` or `types.py`; FastAPI request and response
envelopes live in API-layer `models.py`.
- Database entities do not leak through service or router contracts.
- Discriminated Pydantic unions are already used for tool references and workflow data.
- Request, response, config, and resolved/runtime values are usually distinguishable by
suffixes such as `Create`, `Request`, `Response`, `Config`, and `Data`.
- Reusing `Dict[str, Any]` for a stable cross-language contract weakens validation and makes
drift harder to detect. Stable wire objects should have typed models and golden
serialization tests.
- Canonical executable configuration should reject unexpected fields. Permissive legacy
handling belongs in a named compatibility adapter.
- Mutable Pydantic defaults such as `items: List[...] = []` exist in older code. New models
should use `Field(default_factory=list)` and `Field(default_factory=dict)`.

## Docstrings and comments

- The explicit API guidance says comments should explain non-obvious reasons, invariants,
workarounds, or concurrency/security hazards, not narrate the code.
- Good public docstrings state the contract and important behavior in one paragraph.
- Module docstrings are useful for dependency boundaries and protocol ownership.
- Detailed architecture history, phased decisions, and product rationale belong in
`docs/design/`, not in every source module.
- Section separators are common in large modules, but needing many sections can also be a
signal that the file should become a package.
- Comments around secret isolation, callback reachability, and compatibility aliases are
justified because they preserve security or backwards compatibility.

## Exceptions

- New API architecture defines domain exceptions in core and translates them at the API
boundary.
- Domain bases allow broad and narrow handling (`ToolsError` and its subclasses).
- Useful exceptions carry structured fields such as provider, connection, status, or
conflicting identifiers.
- `HTTPException` should not leak into core services or SDK runtime logic.
- Broad catches and best-effort behavior exist for cleanup and optional telemetry. They are
appropriate only when failure is intentionally non-fatal.
- Broad catches around required configuration, tool resolution, or declared secrets can
hide correctness failures and need an explicit policy.
- Raising generic `RuntimeError` while a typed domain error exists is inconsistent and loses
machine-readable context.

## Dependency direction

- API routing depends on core services; core services depend on interfaces; adapters
implement those interfaces.
- The agent service already depends on SDK runtime contracts. The SDK must not import the
service.
- The TypeScript runner consumes resolved runtime specs; it should not understand loose
playground compatibility shapes.
- Compatibility parsing should happen once, before resolution.
- HTTP gateway resolution and vault access are adapters, not properties of the tool models.
- When the SDK owns neutral runtime models, API code can import the same classes rather than
maintaining a parallel union.

## Testing

- SDK unit tests should cover parsing, pure mapping, policy, and serialization without a
backend.
- Service integration tests should cover HTTP status handling, authentication headers,
malformed responses, and timeouts.
- Golden files are already used for the Python-to-TypeScript runner contract and are the
right mechanism for tool-spec drift.
- Tests should be named for behavior, not internal helper names, when possible.
- Security behavior needs direct tests: secret scoping, missing-secret handling, redaction,
and no ambient environment leakage.
- Correlation and identity behavior also needs tests: duplicate names, built-in/custom
collisions, and provider responses returned in a different order.

## Strong patterns to copy

- `api/oss/src/core/tools/`: explicit domain roles and typed exceptions.
- `sdks/python/agenta/sdk/agents/interfaces.py` plus `agents/adapters/`: ports separated from
implementations.
- `sdks/python/agenta/sdk/engines/running/`: nested SDK subsystem with role-specific files.
- `sdks/python/agenta/sdk/evaluations/runtime/`: SDK-owned neutral runtime contracts reused
directly by API core.
- `api/oss/src/dbs/postgres/*/mappings.py`: conversions isolated from persistence and
services.
- `agents/utils/wire.py` plus TypeScript protocol golden tests: one explicit cross-language
boundary.

## Patterns to avoid treating as precedent

- Very large multipurpose modules such as `agents/dtos.py` and API
`core/tools/service.py`.
- Legacy SDK async naming with `a*` prefixes.
- `except:` or broad `except Exception` followed by `print`.
- Generic `utils.py` files that contain unrelated domain behavior.
- Silent coercion or dropping as the sole parsing contract.
- Mutable collection defaults in Pydantic models.
- Comments that restate the following code.

## Practical standard for this feature

For local tools, the repository evidence supports:

- one `agents/tools/` domain package.
- lifecycle-specific names (`Config`, `Spec`, `ResolvedToolSet`).
- explicit interfaces for injected secret and gateway dependencies.
- typed domain errors.
- strict default behavior with separately named compatibility behavior.
- pure private mapping helpers.
- one runner serialization boundary.
- tests split by parsing, resolution, secrets, gateway HTTP, and wire contract.
- a sibling MCP package only when MCP is an actual supported subsystem.
Loading
Loading