Add together-rl skill (GRPO with sandboxed code-execution rewards) by necoline · Pull Request #22 · togethercomputer/skills

necoline · 2026-06-01T18:48:00Z

Summary

Adds a new together-rl skill: a GRPO reinforcement-learning post-training loop driven by the Together RL training API (client.beta.rl — sessions + training.sample/forward_backward/optim_step), with rewards computed by executing model output in an isolated together-sandbox (write the candidate solution, run its test suite, exit 0 → reward 1.0).

This is the integration the rl-cookbook/grpo_train.py reference loop omits — that demo scores GSM8K with a local \boxed{} string match. Real coding/agentic RL has to run the model's output to score it, which must happen in a sandbox.

Files added

File	Purpose
`skills/together-rl/SKILL.md`	Routing, GRPO workflow, sync-RL/async-sandbox bridging rules, beta status
`skills/together-rl/references/grpo-loop.md`	Session lifecycle, the three operations + polling, GRPO sample schema
`skills/together-rl/references/sandbox-rewards.md`	Reward integration; hands off to the `together-sandbox` skill
`skills/together-rl/scripts/grpo_sandbox_reward.py`	End-to-end loop with code-execution rewards
`skills/together-rl/agents/openai.yaml`	UI metadata
`quality/trigger-evals/together-rl.json`	3 positive / 3 negative trigger cases

AGENTS.md / README.md / cursor plugin regenerated via publish.sh.

Design notes

Doesn't re-document the API. The RL half points to the RL API docs + the /rl/training-sessions OpenAPI surface; the reward half hands off to the together-sandbox skill rather than restating the sandbox API.
RL SDK is gated. client.beta.rl is in beta and not yet in the public together package (public together exposes beta.clusters/beta.jig only) and needs a service-specific base_url. The script targets the public surface and is marked beta; full end-to-end RL is deferred until that release.

Test plan

quick_validate.py skills/together-rl passes
quality_check.py passes
publish.sh --check clean (AGENTS.md/README.md/cursor up to date)
Sandbox reward path verified live — compute_rewards over a known-good and known-bad candidate returned [1.0, 0.0] via parallel fan-out
Full RL end-to-end — blocked on client.beta.rl shipping in public together

Dependency

Hands off to the together-sandbox skill (#16) for the reward-execution step; merge that first.

🤖 Generated with Claude Code

…wards New `together-rl` skill that drives a GRPO post-training loop on the Together RL training API (client.beta.rl: sessions + training.sample/forward_backward/optim_step) and computes rewards by EXECUTING each sampled candidate in an isolated together-sandbox (write solution, run pytest, pass/fail = reward) — the integration the rl-cookbook's grpo_train.py omits (it scores GSM8K with a local string match). - SKILL.md: routing, GRPO workflow, sync-RL/async-sandbox bridging rules, beta status - references/grpo-loop.md: session lifecycle, the 3 operations + polling, GRPO sample schema - references/sandbox-rewards.md: reward integration; hands off to the together-sandbox skill - scripts/grpo_sandbox_reward.py: end-to-end loop with code-execution rewards - quality/trigger-evals/together-rl.json: 3 positive / 3 negative cases - regenerated AGENTS.md / README.md / cursor plugin Validation: quick_validate + quality_check + publish.sh --check all pass. The sandbox reward path was verified live (good candidate -> 1.0, bad -> 0.0). The RL half targets the public `together` SDK surface; client.beta.rl is in beta and not yet public, so end-to-end RL is deferred until that release. Depends on the together-sandbox skill (PR #16) for the reward-execution hand-off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

necoline changed the base branch from main to add-together-sandbox-skill June 1, 2026 19:09

necoline changed the base branch from add-together-sandbox-skill to main June 1, 2026 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add together-rl skill (GRPO with sandboxed code-execution rewards)#22

Add together-rl skill (GRPO with sandboxed code-execution rewards)#22
necoline wants to merge 1 commit into
mainfrom
add-together-rl-skill

necoline commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

necoline commented Jun 1, 2026

Summary

Files added

Design notes

Test plan

Dependency

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant