Tech debt: PR stage pollutes the dev Foundry project with candidate prompt-agent versions

# Tech debt: PR stage pollutes the dev Foundry project with candidate prompt-agent versions

## Context

For Foundry Prompt Agent flows, the PR gate workflow
(`agentops-pr-prompt-agent.yml`) stages an ephemeral candidate
**inside the dev Foundry project** so that Cloud Evals can score
exactly the prompt the PR is proposing.

The staging step (`agentops.pipeline.prompt_deploy stage`) calls
`client.agents.create_version(agent_name="travel-agent", body={...})`
against the dev project's endpoint, creating a new numbered version
of the agent (e.g. `travel-agent:3`, `:4`, `:5`, …).

This is intentional and the template comments call it out:

```yaml
# Each PR run creates or reuses a candidate version in the dev
# Foundry project. AgentOps deduplicates only when the prompt is
# byte-identical to the current seed version's instructions; PR
# candidates can therefore accumulate over time and may need to be
# cleaned up out-of-band.
```

## Why this is ugly

1. **Pollution.** Every PR with a prompt change creates a new numbered
   version in dev that lives forever until someone deletes it by
   hand (Foundry portal or SDK).
2. **Auditability.** Opening the dev project's Agents → Versions view
   shows a mix of "deployed versions of record" and "abandoned PR
   candidates". You cannot tell them apart without cross-referencing
   `foundry-agent.json` artifacts.
3. **Risk for naive consumers.** Any downstream app that resolves
   `travel-agent` by "latest published version" (instead of pinning
   via `foundry-agent.json`) can accidentally pick up an un-merged
   candidate.
4. **Conceptual smell.** "Creating something in the shared dev
   project before the PR is approved" goes against the mental model
   most teams have for environment isolation.

## Why we do it anyway (today)

The Foundry Prompt Agent API has no notion of draft/ephemeral
versions — only persistent, numbered ones. Cloud Evals API needs an
addressable `agent: name:version` reference. The dev project is the
only place where staging gives a high-fidelity preview of how the
prompt will actually run (same model deployment, same content
safety, same network rules, same RBAC).

Mitigations explored and discarded so far:

| Option | Why discarded today |
|---|---|
| Stage in the author's sandbox project | Each developer has their own; CI cannot pick one; sandbox env can diverge from dev → eval result unrepresentative. |
| Use a dedicated `*-pr-staging` Foundry project | More cost / RBAC / drift between staging and dev → eval result less faithful unless staging is kept identical to dev anyway. |
| Skip staging, evaluate the seed version already in dev | Eval no longer tests what the PR changed — defeats the gate. |
| Local eval against `model:gpt-4o-mini` | Bypasses Foundry Agents runtime (content safety, tools, instructions resolution). |

## Possible directions

In rough order of cost-vs-benefit:

1. **Scheduled cleanup workflow.** Ship a generated
   `agentops-cleanup-candidates.yml` (cron, weekly) that lists
   `travel-agent:*` versions in dev, cross-references the PR they
   came from (via the version's metadata / git-sha tag we already
   write), and deletes candidates whose PR is closed/merged + older
   than N days. Keeps the current architecture; just stops the
   accumulation.
2. **Tag candidates explicitly in Foundry.** When `stage` creates a
   version, add a metadata tag like `agentops:candidate=true` plus
   `agentops:pr=#<number>` so portal viewers can filter, and
   downstream consumers can refuse to resolve to a candidate.
3. **Dedicated PR-staging Foundry project.** Add a new environment
   tier (`pr-staging`) between sandbox and dev. Generator gains a
   `--stage-env` option. Higher operational cost and risk of drift,
   but conceptually clean.
4. **Foundry product ask.** Push Foundry team for a first-class
   "draft / preview version" concept on Prompt Agents that does not
   consume the version number sequence.

## Acceptance criteria (for the next slice of work, whatever direction
we pick)

- `tutorial-prompt-agent-quickstart.md` no longer needs the
  "candidates can accumulate" caveat; the chosen mechanism handles it.
- A user who has run 10 PRs in a row sees at most 1 candidate version
  in dev's portal at any given time (or none, if we go to a separate
  staging project).
- Any consumer that resolves `travel-agent` to a candidate version by
  mistake gets a clear signal (tag, refusal, or "not deployed of
  record" status).

## References

- Template: `src/agentops/templates/workflows/agentops-pr-prompt-agent.yml`
  (lines 13-19 spell out the known limitation).
- Implementation: `src/agentops/pipeline/prompt_deploy.py:312-333`
  (`_create_agent_version` → `client.agents.create_version`).
- Tutorial section that depends on this behavior:
  `docs/tutorial-prompt-agent-quickstart.md` step 13.
- Original discussion that surfaced this as tech debt: PR #213 review
  conversation (PO walked through the mental model with the
  workflow runner output from a live recording).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tech debt: PR stage pollutes the dev Foundry project with candidate prompt-agent versions #214

Tech debt: PR stage pollutes the dev Foundry project with candidate prompt-agent versions

Context

Why this is ugly

Why we do it anyway (today)

Possible directions

Acceptance criteria (for the next slice of work, whatever direction

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Option	Why discarded today
Stage in the author's sandbox project	Each developer has their own; CI cannot pick one; sandbox env can diverge from dev → eval result unrepresentative.
Use a dedicated `*-pr-staging` Foundry project	More cost / RBAC / drift between staging and dev → eval result less faithful unless staging is kept identical to dev anyway.
Skip staging, evaluate the seed version already in dev	Eval no longer tests what the PR changed — defeats the gate.
Local eval against `model:gpt-4o-mini`	Bypasses Foundry Agents runtime (content safety, tools, instructions resolution).

Tech debt: PR stage pollutes the dev Foundry project with candidate prompt-agent versions #214

Description

Tech debt: PR stage pollutes the dev Foundry project with candidate prompt-agent versions

Context

Why this is ugly

Why we do it anyway (today)

Possible directions

Acceptance criteria (for the next slice of work, whatever direction

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions