Skip to content

Add together-rl skill (GRPO with sandboxed code-execution rewards)#22

Open
necoline wants to merge 1 commit into
mainfrom
add-together-rl-skill
Open

Add together-rl skill (GRPO with sandboxed code-execution rewards)#22
necoline wants to merge 1 commit into
mainfrom
add-together-rl-skill

Conversation

@necoline

@necoline necoline commented Jun 1, 2026

Copy link
Copy Markdown

Summary

Adds a new together-rl skill: a GRPO reinforcement-learning post-training loop driven by the Together RL training API (client.beta.rl — sessions + training.sample/forward_backward/optim_step), with rewards computed by executing model output in an isolated together-sandbox (write the candidate solution, run its test suite, exit 0 → reward 1.0).

This is the integration the rl-cookbook/grpo_train.py reference loop omits — that demo scores GSM8K with a local \boxed{} string match. Real coding/agentic RL has to run the model's output to score it, which must happen in a sandbox.

Files added

File Purpose
skills/together-rl/SKILL.md Routing, GRPO workflow, sync-RL/async-sandbox bridging rules, beta status
skills/together-rl/references/grpo-loop.md Session lifecycle, the three operations + polling, GRPO sample schema
skills/together-rl/references/sandbox-rewards.md Reward integration; hands off to the together-sandbox skill
skills/together-rl/scripts/grpo_sandbox_reward.py End-to-end loop with code-execution rewards
skills/together-rl/agents/openai.yaml UI metadata
quality/trigger-evals/together-rl.json 3 positive / 3 negative trigger cases

AGENTS.md / README.md / cursor plugin regenerated via publish.sh.

Design notes

  • Doesn't re-document the API. The RL half points to the RL API docs + the /rl/training-sessions OpenAPI surface; the reward half hands off to the together-sandbox skill rather than restating the sandbox API.
  • RL SDK is gated. client.beta.rl is in beta and not yet in the public together package (public together exposes beta.clusters/beta.jig only) and needs a service-specific base_url. The script targets the public surface and is marked beta; full end-to-end RL is deferred until that release.

Test plan

  • quick_validate.py skills/together-rl passes
  • quality_check.py passes
  • publish.sh --check clean (AGENTS.md/README.md/cursor up to date)
  • Sandbox reward path verified livecompute_rewards over a known-good and known-bad candidate returned [1.0, 0.0] via parallel fan-out
  • Full RL end-to-end — blocked on client.beta.rl shipping in public together

Dependency

Hands off to the together-sandbox skill (#16) for the reward-execution step; merge that first.

🤖 Generated with Claude Code

…wards

New `together-rl` skill that drives a GRPO post-training loop on the Together RL
training API (client.beta.rl: sessions + training.sample/forward_backward/optim_step)
and computes rewards by EXECUTING each sampled candidate in an isolated
together-sandbox (write solution, run pytest, pass/fail = reward) — the integration
the rl-cookbook's grpo_train.py omits (it scores GSM8K with a local string match).

- SKILL.md: routing, GRPO workflow, sync-RL/async-sandbox bridging rules, beta status
- references/grpo-loop.md: session lifecycle, the 3 operations + polling, GRPO sample schema
- references/sandbox-rewards.md: reward integration; hands off to the together-sandbox skill
- scripts/grpo_sandbox_reward.py: end-to-end loop with code-execution rewards
- quality/trigger-evals/together-rl.json: 3 positive / 3 negative cases
- regenerated AGENTS.md / README.md / cursor plugin

Validation: quick_validate + quality_check + publish.sh --check all pass. The
sandbox reward path was verified live (good candidate -> 1.0, bad -> 0.0). The RL
half targets the public `together` SDK surface; client.beta.rl is in beta and not
yet public, so end-to-end RL is deferred until that release.

Depends on the together-sandbox skill (PR #16) for the reward-execution hand-off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@necoline necoline changed the base branch from main to add-together-sandbox-skill June 1, 2026 19:09
@necoline necoline changed the base branch from add-together-sandbox-skill to main June 1, 2026 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant