Add together-rl skill (GRPO with sandboxed code-execution rewards)#22
Open
necoline wants to merge 1 commit into
Open
Add together-rl skill (GRPO with sandboxed code-execution rewards)#22necoline wants to merge 1 commit into
necoline wants to merge 1 commit into
Conversation
…wards New `together-rl` skill that drives a GRPO post-training loop on the Together RL training API (client.beta.rl: sessions + training.sample/forward_backward/optim_step) and computes rewards by EXECUTING each sampled candidate in an isolated together-sandbox (write solution, run pytest, pass/fail = reward) — the integration the rl-cookbook's grpo_train.py omits (it scores GSM8K with a local string match). - SKILL.md: routing, GRPO workflow, sync-RL/async-sandbox bridging rules, beta status - references/grpo-loop.md: session lifecycle, the 3 operations + polling, GRPO sample schema - references/sandbox-rewards.md: reward integration; hands off to the together-sandbox skill - scripts/grpo_sandbox_reward.py: end-to-end loop with code-execution rewards - quality/trigger-evals/together-rl.json: 3 positive / 3 negative cases - regenerated AGENTS.md / README.md / cursor plugin Validation: quick_validate + quality_check + publish.sh --check all pass. The sandbox reward path was verified live (good candidate -> 1.0, bad -> 0.0). The RL half targets the public `together` SDK surface; client.beta.rl is in beta and not yet public, so end-to-end RL is deferred until that release. Depends on the together-sandbox skill (PR #16) for the reward-execution hand-off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
together-rlskill: a GRPO reinforcement-learning post-training loop driven by the Together RL training API (client.beta.rl— sessions +training.sample/forward_backward/optim_step), with rewards computed by executing model output in an isolatedtogether-sandbox(write the candidate solution, run its test suite,exit 0→ reward1.0).This is the integration the
rl-cookbook/grpo_train.pyreference loop omits — that demo scores GSM8K with a local\boxed{}string match. Real coding/agentic RL has to run the model's output to score it, which must happen in a sandbox.Files added
skills/together-rl/SKILL.mdskills/together-rl/references/grpo-loop.mdskills/together-rl/references/sandbox-rewards.mdtogether-sandboxskillskills/together-rl/scripts/grpo_sandbox_reward.pyskills/together-rl/agents/openai.yamlquality/trigger-evals/together-rl.jsonAGENTS.md/README.md/ cursor plugin regenerated viapublish.sh.Design notes
/rl/training-sessionsOpenAPI surface; the reward half hands off to thetogether-sandboxskill rather than restating the sandbox API.client.beta.rlis in beta and not yet in the publictogetherpackage (publictogetherexposesbeta.clusters/beta.jigonly) and needs a service-specificbase_url. The script targets the public surface and is marked beta; full end-to-end RL is deferred until that release.Test plan
quick_validate.py skills/together-rlpassesquality_check.pypassespublish.sh --checkclean (AGENTS.md/README.md/cursor up to date)compute_rewardsover a known-good and known-bad candidate returned[1.0, 0.0]via parallel fan-outclient.beta.rlshipping in publictogetherDependency
Hands off to the
together-sandboxskill (#16) for the reward-execution step; merge that first.🤖 Generated with Claude Code