Skip to content

Tech debt: PR stage pollutes the dev Foundry project with candidate prompt-agent versions #214

@placerda

Description

@placerda

Tech debt: PR stage pollutes the dev Foundry project with candidate prompt-agent versions

Context

For Foundry Prompt Agent flows, the PR gate workflow
(agentops-pr-prompt-agent.yml) stages an ephemeral candidate
inside the dev Foundry project so that Cloud Evals can score
exactly the prompt the PR is proposing.

The staging step (agentops.pipeline.prompt_deploy stage) calls
client.agents.create_version(agent_name="travel-agent", body={...})
against the dev project's endpoint, creating a new numbered version
of the agent (e.g. travel-agent:3, :4, :5, …).

This is intentional and the template comments call it out:

# Each PR run creates or reuses a candidate version in the dev
# Foundry project. AgentOps deduplicates only when the prompt is
# byte-identical to the current seed version's instructions; PR
# candidates can therefore accumulate over time and may need to be
# cleaned up out-of-band.

Why this is ugly

  1. Pollution. Every PR with a prompt change creates a new numbered
    version in dev that lives forever until someone deletes it by
    hand (Foundry portal or SDK).
  2. Auditability. Opening the dev project's Agents → Versions view
    shows a mix of "deployed versions of record" and "abandoned PR
    candidates". You cannot tell them apart without cross-referencing
    foundry-agent.json artifacts.
  3. Risk for naive consumers. Any downstream app that resolves
    travel-agent by "latest published version" (instead of pinning
    via foundry-agent.json) can accidentally pick up an un-merged
    candidate.
  4. Conceptual smell. "Creating something in the shared dev
    project before the PR is approved" goes against the mental model
    most teams have for environment isolation.

Why we do it anyway (today)

The Foundry Prompt Agent API has no notion of draft/ephemeral
versions — only persistent, numbered ones. Cloud Evals API needs an
addressable agent: name:version reference. The dev project is the
only place where staging gives a high-fidelity preview of how the
prompt will actually run (same model deployment, same content
safety, same network rules, same RBAC).

Mitigations explored and discarded so far:

Option Why discarded today
Stage in the author's sandbox project Each developer has their own; CI cannot pick one; sandbox env can diverge from dev → eval result unrepresentative.
Use a dedicated *-pr-staging Foundry project More cost / RBAC / drift between staging and dev → eval result less faithful unless staging is kept identical to dev anyway.
Skip staging, evaluate the seed version already in dev Eval no longer tests what the PR changed — defeats the gate.
Local eval against model:gpt-4o-mini Bypasses Foundry Agents runtime (content safety, tools, instructions resolution).

Possible directions

In rough order of cost-vs-benefit:

  1. Scheduled cleanup workflow. Ship a generated
    agentops-cleanup-candidates.yml (cron, weekly) that lists
    travel-agent:* versions in dev, cross-references the PR they
    came from (via the version's metadata / git-sha tag we already
    write), and deletes candidates whose PR is closed/merged + older
    than N days. Keeps the current architecture; just stops the
    accumulation.
  2. Tag candidates explicitly in Foundry. When stage creates a
    version, add a metadata tag like agentops:candidate=true plus
    agentops:pr=#<number> so portal viewers can filter, and
    downstream consumers can refuse to resolve to a candidate.
  3. Dedicated PR-staging Foundry project. Add a new environment
    tier (pr-staging) between sandbox and dev. Generator gains a
    --stage-env option. Higher operational cost and risk of drift,
    but conceptually clean.
  4. Foundry product ask. Push Foundry team for a first-class
    "draft / preview version" concept on Prompt Agents that does not
    consume the version number sequence.

Acceptance criteria (for the next slice of work, whatever direction

we pick)

  • tutorial-prompt-agent-quickstart.md no longer needs the
    "candidates can accumulate" caveat; the chosen mechanism handles it.
  • A user who has run 10 PRs in a row sees at most 1 candidate version
    in dev's portal at any given time (or none, if we go to a separate
    staging project).
  • Any consumer that resolves travel-agent to a candidate version by
    mistake gets a clear signal (tag, refusal, or "not deployed of
    record" status).

References

  • Template: src/agentops/templates/workflows/agentops-pr-prompt-agent.yml
    (lines 13-19 spell out the known limitation).
  • Implementation: src/agentops/pipeline/prompt_deploy.py:312-333
    (_create_agent_versionclient.agents.create_version).
  • Tutorial section that depends on this behavior:
    docs/tutorial-prompt-agent-quickstart.md step 13.
  • Original discussion that surfaced this as tech debt: PR docs: restore accurate claim that workflow skill dispatches both PR and deploy-dev workflows #213 review
    conversation (PO walked through the mental model with the
    workflow runner output from a live recording).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesttech-debtKnown technical debt to revisit later

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions