feat(customizer): add RL backend in customizer plugin by anubhutivyas · Pull Request #479 · NVIDIA-NeMo/nemo-platform

anubhutivyas · 2026-06-25T23:16:51Z

Linear ticket - https://linear.app/nvidia/issue/AALGO-292/add-dpo-support-for-customizer-plugin

Summary by CodeRabbit

New Features
- Added support for a new RL workflow with dedicated submit, training, file transfer, and model-creation commands.
- Introduced support for multiple training formats, dataset validation, distributed execution, and progress reporting.
- Added new container build targets and images for RL training and task runtime steps.
Bug Fixes
- Improved handling of non-root runtime permissions and package cleanup in containers.
- Added clearer error reporting for training, dataset, and deployment failures.

Signed-off-by: anubhutiv <anubhutiv@nvidia.com>

coderabbitai · 2026-06-25T23:24:18Z

📝 Walkthrough

Walkthrough

Adds NeMo-RL build targets, plugin submission wiring, compiler/service schemas, dataset and prompt preparation, training runtime support, and file I/O/model-entity task entrypoints.

Changes

NeMo-RL support stack

Layer / File(s)	Summary
Build and packaging `docker-bake.hcl`, `docker/`, `docker/rl/`, `pyproject.toml`, `plugins/nemo-rl/README.md`, `plugins/nemo-rl/pyproject.toml`, `services/rl/README.md`, `services/rl/pyproject.toml`	Adds build targets, Dockerfiles, workspace slices, and package manifests for the RL images and plugin packages.
Submitter contract and plugin wiring `plugins/nemo-rl/src/nemo_rl_plugin/*`, `packages/nmp_customization_common/src/nmp/customization_common/contributor/jobs.py`	Adds submit JSON models, CLI overrides, contributor wiring, SDK exports, and the distributed runtime guard.
Compiler and service schemas `services/rl/src/nmp/rl/app/`, `services/rl/src/nmp/rl/schemas.py`, `services/rl/src/nmp/rl/compile.py`, `services/rl/src/nmp/rl/tasks/training/schemas.py`, `services/rl/tests/test_.py`	Adds RL service constants, public schemas, the compiler implementation, and compiler/schema tests.
Training support and error plumbing `services/rl/src/nmp/rl/tasks/training/{protocol.py,distributed.py,integrations.py,utils.py,errors/*,no_override_requirements.txt}`	Adds distributed coordination, backend contracts, image/config helpers, and structured error conversion.
Dataset and prompt compilation `services/rl/src/nmp/rl/tasks/training/{datasets/,chat_templates.py,templates/,backends/nemo_rl/dpo_config.py}`, `services/rl/tests/test_dpo_config.py`	Adds dataset schemas, loaders, validation, chat templates, and DPO config generation.
Training runtime `services/rl/src/nmp/rl/tasks/training/{__main__.py,runner.py,backends/nemo_rl/*}`	Adds the training runner, NeMo-RL backend, Ray bootstrap, drivers, logger, and checkpoint processing.
File I/O task `services/rl/src/nmp/rl/tasks/file_io/*`	Adds file I/O callbacks, runner, and module entrypoints for Fileset transfers.
Model entity task `services/rl/src/nmp/rl/tasks/model_entity/*`	Adds model-entity task entrypoints and deployment orchestration.

Sequence Diagram(s)

sequenceDiagram
  participant TrainingRunner
  participant DistributedContext
  participant NemoRLBackend
  participant RayClusterBootstrap
  TrainingRunner->>DistributedContext: from_env()
  TrainingRunner->>NemoRLBackend: compile_config()
  TrainingRunner->>NemoRLBackend: execute_training()
  NemoRLBackend->>RayClusterBootstrap: run_with_driver()
  TrainingRunner->>NemoRLBackend: find_best_checkpoint()
  TrainingRunner->>NemoRLBackend: process_checkpoint()

Possibly related PRs

NVIDIA-NeMo/nemo-platform#242: Touches the same packages/nmp_customization_common/src/nmp/customization_common/contributor/jobs.py guard module.

Suggested labels

feat

Suggested reviewers

soluwalana
gabwow

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title accurately summarizes the main change: adding an RL backend to the customizer plugin.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch aalgo-292-dpo-support/anubhutiv

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-06-25T23:29:23Z

Suite	Lines Covered	Line Rate	Branch Rate
Unit Tests	21323/27928	76.3%	61.4%
Integration Tests	12350/26697	46.3%	19.7%

coderabbitai

Actionable comments posted: 3

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docker/Dockerfile.nmp-rl-base (1)
12-45: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Don’t publish nmp-rl-base as root.

docker-bake.hcl tags this stage directly, so this is a shipped image, not just an internal builder. Add a non-root USER here and let child images switch back to root only for the install/chown steps that need it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docker/Dockerfile.nmp-rl-base` around lines 12 - 45, The nmp-rl-base stage is
being published directly, so it should not remain running as root. In the
Dockerfile.nmp-rl-base stage, add a non-root USER before the image is finalized,
and only switch back to root in downstream child images for the install or chown
steps that require it; keep the existing VIRTUAL_ENV, chmod, and vllm removal
logic in nmp-rl-base, but ensure the final shipped stage uses a non-root runtime
user.
Source: Linters/SAST tools

🟠 Major comments (36)

services/rl/src/nmp/rl/tasks/model_entity/run.py-154-163 (1)
154-163: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Let transient fileset lookup failures reach the retry wrapper.

This except Exception converts API timeouts/500s from filesets.retrieve() into ModelEntityCreationError, so @retry on create_model_entity() never retries them. Only translate permanent lookup failures here.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/model_entity/run.py` around lines 154 - 163, The
broad exception handling around fileset validation in create_model_entity() is
wrapping transient SDK/API failures into ModelEntityCreationError, preventing
the retry decorator from retrying them. Update the try/except around
self.sdk.files.filesets.retrieve() so only permanent “not found” or
access-denied cases are translated to ModelEntityCreationError, and let
timeouts/5xx/other transient exceptions propagate out of the fileset validation
block.
services/rl/src/nmp/rl/tasks/file_io/run.py-141-181 (1)
141-181: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Recreate progress state on each retry.

DownloadStats/UploadStats and CompositeCallback are built once, then reused across tenacity retries. After a partial attempt, BaseSingleFileCallback.close() has already incremented counts/bytes, so the retry double-counts progress and can leave final totals above the real transfer. Move stats/callback construction inside the retried path or reset them per attempt.

Also applies to: 218-251
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/file_io/run.py` around lines 141 - 181, The
download progress state is reused across retries, so a partial attempt can leave
DownloadStats and the callback chain carrying stale counts into the next try.
Update the retried flow in run.py around the download path that builds
CompositeCallback, TqdmPerFileDownloadCallback, and FileDownloadProgressCallback
so these objects are created fresh per attempt or explicitly reset before each
retry. Apply the same per-attempt reset pattern to the matching upload path
referenced by the comment so progress totals stay accurate after tenacity
retries.
services/rl/src/nmp/rl/tasks/model_entity/run.py-424-428 (1)
424-428: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Don’t log the raw model-entity config.

config.model_dump_json() will dump inline deployment_config.additional_envs verbatim. That can leak deployment secrets into task logs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/model_entity/run.py` around lines 424 - 428, The
startup logging in run() is dumping the full model-entity config, which can
expose deployment secrets via deployment_config.additional_envs. Update the
logger.info call that prints config.model_dump_json(indent=2) so it logs a
redacted or summarized config instead, and ensure any sensitive nested fields
are excluded or masked before logging.
services/rl/src/nmp/rl/tasks/model_entity/run.py-148-169 (1)
148-169: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Use the configured workspace for full-model outputs.

ModelEntityTaskConfig carries a target workspace, but this path hardcodes self.job_ctx.workspace and passes that into models.create()/models.update(). Cross-workspace submissions will create or update the wrong model entity.

Also applies to: 226-260
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/model_entity/run.py` around lines 148 - 169, The
full-model creation/update path is using the job context workspace instead of
the configured target workspace, so cross-workspace runs write to the wrong
place. Update the model entity flow in model_entity/run.py to use
ModelEntityTaskConfig.workspace for the full-model output path, including the
call chain through _create_or_update_full_entity and any
models.create/models.update logic, while keeping the fileset workspace fallback
separate.
services/rl/src/nmp/rl/tasks/model_entity/run.py-236-242 (1)
236-242: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Honor config.trust_remote_code.

The full-entity request always copies base_me.trust_remote_code, so the task config flag never affects full or merged model entities. That can persist the wrong trust policy.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/model_entity/run.py` around lines 236 - 242, The
full-entity request body is always using base_me.trust_remote_code, so the
task-level trust policy from config is ignored. Update the request construction
in the model entity run flow to use config.trust_remote_code for full/merged
model entity requests, and make sure the request_body assembly in this section
consistently honors the config flag instead of copying the base model value.
services/rl/src/nmp/rl/schemas.py-26-39 (1)
26-39: 🗄️ Data Integrity & Integration | 🟠 Major

Forbid extra fields in the nested training schemas. ParallelismParams and _TrainingBase still accept unknown keys, so typos like training.learningrate or parallelism.num_node are silently dropped and the job falls back to defaults. Add model_config = ConfigDict(extra="forbid") to both models.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/schemas.py` around lines 26 - 39, The nested training
schemas currently accept unknown keys, so typos in submitted config fields are
silently ignored and defaults are used instead. Update both ParallelismParams
and _TrainingBase in schemas.py to forbid extra fields by adding model_config =
ConfigDict(extra="forbid"), ensuring invalid keys under training and parallelism
are rejected during validation.
services/rl/src/nmp/rl/entities/values.py-19-21 (1)
19-21: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Remove unsupported PPO from the enum.

TrainingStepConfig.training.training_type accepts this value, but the backend only compiles DPO and GRPO and raises on anything else. That turns validation into a runtime failure in the training container.
Suggested fix
 class TrainingType(str, Enum):
     """RL training algorithm. DPO is wired; the rest are headroom."""
 
     DPO = "dpo"
     GRPO = "grpo"
-    PPO = "ppo"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/entities/values.py` around lines 19 - 21, The
TrainingType enum in values.py still exposes PPO even though the backend only
supports DPO and GRPO, so remove the unsupported PPO member from the enum.
Update the TrainingType definition and any related validation or mappings that
reference PPO so TrainingStepConfig.training.training_type only accepts the
supported values and cannot reach a runtime failure in the training container.
services/rl/src/nmp/rl/app/jobs/compiler.py-263-263 (1)
263-263: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Don't log the full submitted spec at INFO.

Line 263 serializes the entire request, and the same integrations object is later mined for secret envs. That creates a normal-path secret leak into job logs.
Suggested fix
-    logger.info("Compiling NeMo-RL DPO job to PlatformJobSpec: %s", job_spec.model_dump_json(indent=2))
+    logger.info(
+        "Compiling NeMo-RL DPO job to PlatformJobSpec",
+        extra={"model": job_spec.model, "output_model": job_spec.output.name},
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/app/jobs/compiler.py` at line 263, The INFO log in
compiler.py is serializing the full submitted job spec, which can leak secret
values from the later-processed integrations payload. Update the logging in the
job compilation flow around the compiler entry point to avoid dumping
job_spec.model_dump_json(...) at INFO; instead log only non-sensitive high-level
context (for example job identifiers or summary fields) and keep any detailed
spec output out of normal-path logs.
services/rl/src/nmp/rl/tasks/training/errors/parser.py-76-79 (1)
76-79: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Parse non-*Error exception names too.

This regex never extracts TimeoutExpired, so subprocess timeouts cannot hit the TrainingTimeoutError rule in error_rules.yaml and fall back to a generic runtime failure. Match generic exception class names and strip any module prefix.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/errors/parser.py` around lines 76 - 79,
The exception parser in _EXCEPTION_RE only matches names ending in Error or
Exception, so it misses generic exception types like TimeoutExpired. Update the
regex in parser.py to capture any exception class name, and make sure the
parsing logic strips any module prefix before applying the training error rules
so TrainingTimeoutError can match subprocess timeout failures.
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/grpo_config.py-44-47 (1)
44-47: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Do not advertise GRPO from a stub.

NemoRLBackend.compile_config() routes GRPO jobs here, but this path still prints the full config/job context and then raises NotImplementedError. Every GRPO submission fails here. Either implement it or reject GRPO before backend selection.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/grpo_config.py` around
lines 44 - 47, GRPO is still routed into a stub in
NemoRLBackend.compile_config() via the GRPO config path, which prints the full
training_config/job_ctx and then fails. Remove the stub behavior in the GRPO
config compiler and either implement the actual GRPO config compilation in the
relevant grpo_config handling or make backend selection reject GRPO before it
reaches this path; also eliminate the debug prints from the stub path.
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/backend.py-126-135 (1)
126-135: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Restore the process env after each run.

These writes mutate global state, and MLFLOW_URI is only set when present, never cleared. A reused worker can send a later job to the previous job’s tracking backend or run with stale settings. Build a per-run env or save/restore these keys in finally.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/backend.py` around
lines 126 - 135, The env setup in the Nemo RL backend mutates global process
state and leaves stale values behind between runs; update the run setup around
the existing environment writes in backend.py to save and restore the affected
keys, or construct a per-run env passed into the execution path. Make sure the
logic around BASE_LOG_DIR, GPUS_PER_NODE, and especially MLFLOW_URI in the
backend entrypoint is restored in a finally block so a reused worker cannot
inherit prior run settings.
services/rl/src/nmp/rl/tasks/training/errors/error_rules.yaml-415-418 (1)
415-418: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Narrow the generic OOM matcher.

contains: "out of memory" will also match host/Ray/system OOMs, so those failures get classified as CudaError and users get GPU-specific remediation. Keep the CUDA-specific rules and drop or remap this catch-all.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/errors/error_rules.yaml` around lines
415 - 418, The generic OOM matcher in error_rules.yaml is too broad because the
contains: "out of memory" rule can catch host, Ray, or system OOMs and
misclassify them as CudaError. Narrow this rule by removing it or changing it to
a CUDA-specific matcher, and keep the existing CUDA-focused rules so only true
GPU OOMs map to CudaError; the relevant symbols are the OOM keyword rule under
the training error rules and the CudaError mapping.
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/backend.py-157-165 (1)
157-165: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Put the original signal handlers back.

This installs process-wide SIGINT/SIGTERM handlers and never restores them. After run_with_driver() returns, later signals in the same process still hit this stale closure and can terminate the wrong driver or raise SystemExit unexpectedly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/backend.py` around
lines 157 - 165, The cleanup signal handler in bootstrap.run_with_driver setup
is replacing the process-wide SIGINT/SIGTERM handlers and never restoring them,
leaving a stale closure behind after the driver exits. Update the cleanup logic
around the cleanup function and the signal.signal registrations to save the
original handlers before installing them, then restore those original handlers
after run_with_driver(str(driver_path), driver_args) completes (preferably in a
finally block), so later signals do not call the old closure.
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/no_override_requirements.txt-1-8 (1)
1-8: 🗄️ Data Integrity & Integration | 🟠 Major

These markers don't pin base-image versions. sys_platform == 'never' drops the requirement entirely, so it doesn't protect ray, starlette, cryptography, mlflow, or prometheus-client from being replaced if another dependency pulls them in.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/no_override_requirements.txt`
around lines 1 - 8, The no_override_requirements markers are not preserving
base-image packages because sys_platform == 'never' removes the requirement
instead of constraining it. Update the entries in no_override_requirements.txt
so each preserved package (prometheus-client, starlette, cryptography, mlflow,
ray) is explicitly pinned or otherwise constrained in a way that prevents pip
from upgrading/replacing the base-image version, using the same package names
already listed in the file.
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/ray_bootstrap.py-246-248 (1)
246-248: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Namespace or clear ENDED per run.

A stale log_dir / "ENDED" from a previous retry makes new head/worker startup return immediately. Use a task/attempt-scoped marker path, or clear it with startup coordination before any rank checks it.

Also applies to: 357-360, 392-395
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/ray_bootstrap.py`
around lines 246 - 248, The ENDED marker in ended_file is currently shared
across retries because it always returns log_dir/"ENDED", so a stale file can
short-circuit new runs. Update the marker handling in ray_bootstrap.py to make
the path task/attempt-scoped (for example via the existing run/task identifiers
used by the bootstrap flow) or ensure the marker is cleared during startup
coordination before any rank checks it; apply the same fix to the related
startup/check paths referenced by ended_file and the retry coordination logic.
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/grpo_driver.py-55-104 (1)
55-104: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Do not ship an executable driver that only raises NotImplementedError.

If this module is reachable via backend selection or python -m, GRPO jobs fail immediately. Gate it out of runtime wiring or implement the driver before exposing it.

I can help generate the implementation plan or open an issue for the missing GRPO driver.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/grpo_driver.py` around
lines 55 - 104, The GRPO driver module is currently a stub and will fail at
runtime because `load_config`, `get_environment`, `run_grpo_training`, and
`main` all raise `NotImplementedError`. Either fully implement these entry
points in `grpo_driver.py` or remove/gate the module from any backend selection
and `python -m` wiring so it cannot be executed until complete. Make sure the
runtime path that imports or dispatches to `main` only exposes a working driver,
and use the existing function names to locate the missing implementation.
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_driver.py-55-64 (1)
55-64: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Do not dump full config or job context to stdout.

Hydra overrides, resolved configs, and NMPJobContext can contain internal URLs, identifiers, or credentials. Log redacted fields only, preferably at debug level.

Also applies to: 93-94
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_driver.py` around
lines 55 - 64, The startup logging in dpo_driver should not print raw Hydra
overrides or the fully resolved config to stdout. Update the logic around
parse_hydra_overrides, OmegaConf.to_container, and the final pprint output so
only redacted or minimal fields are logged, ideally through a debug logger
instead of print. Also review the related job-context logging referenced in the
same flow and ensure NMPJobContext values are sanitized before being emitted.
services/rl/src/nmp/rl/tasks/training/distributed.py-91-99 (1)
91-99: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Serialize stale-barrier cleanup before workers continue.

Workers can create or wait on current-run markers before rank 0 executes shutil.rmtree, so coordinator cleanup can delete live markers and deadlock barriers. Add a startup-ready marker, or use a per-attempt barrier directory instead of deleting a shared path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/distributed.py` around lines 91 - 99,
The stale-barrier cleanup in the distributed startup path can race with workers
creating or waiting on current-run markers, so `shutil.rmtree` may delete live
state. Update the initialization flow in `distributed.py` around `barrier_dir`
handling in the `world_size > 1` block so cleanup is serialized before any
worker proceeds, either by adding a startup-ready marker handshake or by
switching `barrier_dir` to a per-attempt path instead of deleting a shared
directory.
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/ray_bootstrap.py-450-450 (1)
450-450: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Put deadlines on Ray CLI subprocesses.

These calls can block indefinitely, making retry, worker-wait, and cleanup timeouts ineffective. Add timeout= and handle subprocess.TimeoutExpired; make cleanup unable to keep the process alive after its timeout.

Also applies to: 485-485, 529-536, 676-680, 697-703, 707-710
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/ray_bootstrap.py` at
line 450, The Ray CLI subprocess calls in ray_bootstrap.py can block forever and
bypass the existing retry/cleanup timeouts. Update the subprocess.run
invocations in the relevant bootstrap/cleanup paths (including the call sites
around result handling and the Ray cleanup helpers) to pass an explicit timeout,
and add subprocess.TimeoutExpired handling so the code can abort and continue
cleanup safely. Make sure the cleanup logic cannot leave the child process
running after its timeout by terminating/killing it and waiting for it to exit,
using the existing bootstrap and cleanup functions as the entry points to patch.
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/preference_datasets/helpsteer3.py-69-71 (1)
69-71: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Do not silently drop split overrides.

train_split / val_split are public args, but the HuggingFace fallback ignores them. Pass them through if the base dataset supports them, or reject them when train_data_path is absent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/preference_datasets/helpsteer3.py`
around lines 69 - 71, The HuggingFace fallback in HelpSteer3Dataset currently
ignores the public train_split and val_split args when train_data_path is not
provided. Update the fallback path in the dataset initializer to either pass
these split overrides through to the base dataset constructor if it supports
them, or explicitly validate and reject unsupported overrides before calling
super().__init__().
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/preference_datasets/tulu3.py-71-73 (1)
71-73: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Do not silently drop split overrides.

train_split / val_split are public args, but the HuggingFace fallback ignores them. Pass them through if the base dataset supports them, or reject them when train_data_path is absent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/preference_datasets/tulu3.py`
around lines 71 - 73, The HuggingFace fallback in the Tulu3 preference dataset
constructor is ignoring the public train_split and val_split arguments by
calling the base initializer without them. Update the fallback path around the
super().__init__ call in the Tulu3 dataset class to either forward train_split
and val_split to the base dataset if supported, or explicitly reject those
overrides when train_data_path is not provided instead of silently dropping
them.
services/rl/src/nmp/rl/tasks/training/templates/llama-3.1-instruct.jinja-20-24 (1)
20-24: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Serialize tool metadata as JSON instead of concatenating it.

tname or tdesc containing quotes/newlines produces invalid JSON in the prompt.
Proposed fix
-        {%- set tparams = t.function.parameters | tojson %}
         {{- "Use the function '" + tname + "' to '" + tdesc + "':\n" }}
-        {{- '{"name": "' + tname + '", "description": "' + tdesc + '", "parameters": ' + tparams + '}\n\n' }}
+        {{- ({"name": tname, "description": tdesc, "parameters": t.function.parameters} | tojson) + "\n\n" }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/templates/llama-3.1-instruct.jinja`
around lines 20 - 24, The tool metadata rendering in the llama-3.1-instruct
template is building JSON by string concatenation, which breaks when tname or
tdesc contains quotes/newlines. Update the template logic around the tname,
tdesc, and tparams block to serialize the tool payload as proper JSON instead of
manually assembling strings, and keep the existing function metadata fields
intact in the rendered prompt.
services/rl/src/nmp/rl/tasks/training/templates/llama-3.1-instruct.jinja-10-37 (1)
10-37: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Guard undefined tool variables.

In Jinja, an undefined value is not none, so this branch can emit the tool preamble even when tools / tool_choice were never provided.
Proposed fix
-{%- if tools is not none and tool_choice is not none %}
+{%- if tools is defined and tools is not none and tool_choice is defined and tool_choice is not none %}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/templates/llama-3.1-instruct.jinja`
around lines 10 - 37, The tool-preamble branch in llama-3.1-instruct.jinja only
checks for none, so it can still render when tools or tool_choice are undefined.
Update the conditional guarding the system/user preamble to explicitly handle
undefined values before entering the tools loop, using the existing tools and
tool_choice symbols, so the template only emits function-calling instructions
when both inputs are actually provided.
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py-357-372 (1)
357-372: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Detect dataset class per split before wiring NeMo-RL loaders.

Validation only proves each row matches the broad DPO union. Here the loader class is inferred from prepared.train_file once and then reused for validation too, so mixed-format train shards or a validation file in another supported DPO format compile cleanly and then hit the wrong NeMo-RL loader at runtime. Detect the schema for each prepared file and reject mismatches, or emit split-specific dataset_name values if NeMo-RL supports that.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py` around
lines 357 - 372, The dataset loader is currently inferred once from
prepared.train_file and reused for validation, which can assign the wrong
NeMo-RL class to a split. Update the DPO config logic around
detect_dpo_schema_name and _dataset_spec to determine dataset_name separately
for each prepared file (train and validation), and verify the split schemas are
compatible before building the loader config. If NeMo-RL allows it, pass
split-specific dataset_name values; otherwise reject mismatched formats early
with a clear error.
services/rl/src/nmp/rl/tasks/training/datasets/preparation.py-366-373 (1)
366-373: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Stream merges instead of loading each shard at once.

inp.read() pulls the whole dataset shard into memory before writing it back out. Large training files can OOM the worker during config preparation. Copy incrementally and only track whether the last line ended with \n.
Suggested fix
     with open(output_file, "w", encoding="utf-8") as out:
         for f in files:
             with open(f, "r", encoding="utf-8") as inp:
-                content = inp.read()
-                out.write(content)
-                # Ensure newline between files
-                if content and not content.endswith("\n"):
+                last_line: str | None = None
+                for last_line in inp:
+                    out.write(last_line)
+                if last_line and not last_line.endswith("\n"):
                     out.write("\n")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/datasets/preparation.py` around lines
366 - 373, The merge logic in preparation.py currently uses the dataset shard
read path in the write loop, which loads each file fully into memory before
appending it to the output. Update the file-copying flow in the shard merge
section to stream data incrementally instead of calling inp.read(), and use the
existing output handling to track whether the last chunk ended with a newline so
the separator behavior stays correct. Keep the change localized to the merge
loop that iterates over files and writes to output_file.
services/rl/src/nmp/rl/tasks/training/datasets/preparation.py-406-408 (1)
406-408: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Guard tiny datasets before auto-splitting.

A 1-row file produces val_size == 1, an empty train split, and compile_dpo_config() later fails on the generated empty train file. Clamp validation to at most len(lines) - 1 and fail early when the file is too small to split.
Suggested fix
     with open(train_file, "r", encoding="utf-8") as f:
         lines = [line for line in f if line.strip()]
+
+    if len(lines) < 2:
+        raise DatasetFormatError("Need at least 2 samples to create a validation split")
 
     # Shuffle for reproducibility (important for multi-node!)
     # Uses global seed if not explicitly provided
     random.seed(seed)
     random.shuffle(lines)
 
-    val_size = max(1, int(len(lines) * val_ratio))
+    val_size = min(max(1, int(len(lines) * val_ratio)), len(lines) - 1)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/datasets/preparation.py` around lines
406 - 408, Guard the auto-split logic in preparation.py so tiny datasets do not
produce an empty train file. In the code that sets val_size and builds
val_lines/train_lines, clamp the validation count to at most len(lines) - 1 and
add an early failure path when len(lines) is too small to split safely. Use the
existing split logic around val_size, val_lines, and train_lines to keep the
train output non-empty before compile_dpo_config() consumes it.
services/rl/src/nmp/rl/tasks/training/datasets/schemas.py-292-305 (1)
292-305: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Reject empty tool_calls.

tool_calls=[] passes this validator, but services/rl/src/nmp/rl/tasks/training/templates/nemotron-3.1.jinja Lines 28-41 treat any non-None value as a tool-call block and never close it when the list is empty. That yields malformed prompts.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/datasets/schemas.py` around lines 292 -
305, The SFTChatMessage validator in check_has_content_or_thinking_or_tool_calls
currently accepts tool_calls=[] because it only checks for None; update it to
reject empty tool call lists so only non-empty tool_calls are allowed. Keep the
existing content/thinking rules intact, and make sure the validation logic in
SFTChatMessage treats tool_calls the same way the nemotron-3.1.jinja template
does when deciding whether a tool-call block should exist.
services/rl/src/nmp/rl/tasks/training/datasets/schemas.py-366-369 (1)
366-369: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Require at least one chat message.

messages=[] validates here, but services/rl/src/nmp/rl/tasks/training/templates/nemotron-3.1.jinja Lines 1-6 and services/rl/src/nmp/rl/tasks/training/templates/nemotron-3.3.jinja Lines 2-7 index messages[0] immediately. Empty rows pass validation and then crash during template rendering.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/datasets/schemas.py` around lines 366 -
369, Require at least one chat message in the schema used by the Nemotron
training dataset. Update the messages field in the relevant model/class in
schemas.py so empty lists are rejected during validation, matching the
assumptions in the nemotron-3.1.jinja and nemotron-3.3.jinja templates that
access messages[0]. Keep the fix localized to the schema definition for the
conversation object so invalid rows fail early instead of crashing during
template rendering.
services/rl/src/nmp/rl/tasks/training/datasets/schemas.py-268-271 (1)
268-271: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

thinking is accepted without any renderer.

This schema allows messages with thinking and no content, but nemotron-3.1.jinja Lines 42-46, nemotron-3.3.jinja Lines 12-17, and phi-4.jinja Lines 6-10 all render message['content'] only. Valid rows will render incorrectly or fail.

Also applies to: 292-305
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/datasets/schemas.py` around lines 268 -
271, The message schema currently allows `thinking` without any rendering
support, so update the dataset/message handling to either reject rows where
`thinking` is present but `content` is missing or explicitly map `thinking` into
the rendered prompt path. Make the fix in the message schema/validation flow and
in the Jinja renderers (`nemotron-3.1.jinja`, `nemotron-3.3.jinja`,
`phi-4.jinja`) so they use the same source field consistently rather than only
`message['content']`; use the `thinking` and message schema definitions in
`schemas.py` to locate the affected logic.
services/rl/src/nmp/rl/tasks/training/datasets/schemas.py-351-351 (1)
351-351: 🎯 Functional Correctness | 🟠 Major

Fix the chat schema name/tag mismatch. get_sft_dataset_discriminator() returns SFTChatDatasetItemSchema for messages, but the class and union tag are SFTPChatDatasetItemSchema, so chat rows won’t resolve to that branch. Update the class/tag names to match.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/datasets/schemas.py` at line 351, The
chat dataset discriminator and schema class name are mismatched, so
`get_sft_dataset_discriminator()` cannot resolve `messages` rows to the intended
branch. Update the `SFTPChatDatasetItemSchema` class and its union/tag usage in
`schemas.py` so the name and discriminator key match exactly with
`SFTChatDatasetItemSchema`, and make sure any references in the dataset schema
union use the corrected identifier consistently.
services/rl/src/nmp/rl/tasks/training/templates/nemotron-3.3.jinja-17-20 (1)
17-20: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use real newlines here.

Line 17 and Line 20 emit \\n\\n, while Line 8 uses \n\n. That writes literal backslashes into the prompt and changes the expected chat format.
Suggested fix
-    {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n' + content | trim + '<|eot_id|>' }}
+    {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' + content | trim + '<|eot_id|>' }}
@@
-    {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}
+    {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/templates/nemotron-3.3.jinja` around
lines 17 - 20, In the Nemotron 3.3 Jinja template, the message rendering block
is emitting literal backslash characters because the assistant/user header
template uses escaped newline text instead of actual newlines. Update the prompt
construction around the message loop and the add_generation_prompt branch so it
matches the existing newline style used earlier in the template, and ensure the
rendered chat format from the template stays consistent across the
header/content separators.
services/rl/src/nmp/rl/tasks/training/utils.py-69-77 (1)
69-77: 🩺 Stability & Availability | 🟠 Major

Guard against zero GPUs. torch.cuda.device_count() can return 0 when CUDA is unavailable, so this branch can emit --nproc_per_node 0. Treat <= 0 as missing and fall back to 1.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/utils.py` around lines 69 - 77, The GPU
count fallback in the training utils can currently pass through a zero value
from torch.cuda.device_count(), which leads to an invalid nproc_per_node
setting. Update the logic in the GPU detection block around gpus_per_node so
that any value less than or equal to 0 is treated as unavailable, and fall back
to "1" just like the exception path. Keep the fix localized to the existing
os.environ.get("GPUS_PER_NODE") and torch.cuda.device_count() handling.
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/nemo_rl_logger.py-42-49 (1)
42-49: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Validate steps_per_epoch and log_interval in __init__.

Lines 100 and 105 divide/modulo by constructor inputs with no guard. A zero or negative value crashes the first metrics update. Reject invalid values before training starts.
Suggested fix
     def __init__(
         self,
         steps_per_epoch: int,
         job_ctx: NMPJobContext | None = None,
         log_interval: int = 10,
         max_steps: int | None = None,
         num_epochs: int | None = None,
     ):
@@
+        if steps_per_epoch <= 0:
+            raise ValueError("steps_per_epoch must be > 0")
+        if log_interval <= 0:
+            raise ValueError("log_interval must be > 0")
+
         self._job_ctx = job_ctx or NMPJobContext.from_env()
Also applies to: 97-105
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/nemo_rl_logger.py`
around lines 42 - 49, Validate the constructor inputs in NemoRLLogger.__init__:
reject zero or negative values for steps_per_epoch and log_interval before
storing them, since methods like _log_metrics and the step/update logic later
perform division and modulo using these fields. Add an early guard in __init__
that raises a clear error for invalid values so training fails fast instead of
crashing on the first metrics update.
services/rl/src/nmp/rl/tasks/training/templates/llama-3.2-instruct.jinja-19-25 (1)
19-25: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Serialize the tool schema with one tojson call.

Line 24 hand-builds JSON with raw tname and tdesc. Any quote or newline in those fields makes the tool definition invalid and corrupts the prompt format. Serialize a dict instead of concatenating JSON fragments.
Suggested fix
     {%- for t in tools %}
         {%- set tname = t.function.name %}
         {%- set tdesc = t.function.description %}
-        {%- set tparams = t.function.parameters | tojson %}
         {{- "Use the function '" + tname + "' to '" + tdesc + "':\n" }}
-        {{- '{"name": "' + tname + '", "description": "' + tdesc + '", "parameters": ' + tparams + '}\n\n' }}
+        {{- {"name": tname, "description": tdesc, "parameters": t.function.parameters} | tojson + '\n\n' }}
     {%- endfor %}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/templates/llama-3.2-instruct.jinja`
around lines 19 - 25, The tool schema rendering in the llama-3.2-instruct.jinja
template is manually concatenating JSON fragments with raw tname and tdesc,
which can break on quotes or newlines; update the loop over tools to serialize a
single dictionary with one tojson call instead of building the JSON string by
hand, while keeping the existing t.function.name, t.function.description, and
t.function.parameters values in the output.
services/rl/src/nmp/rl/tasks/training/templates/llama-3.3-instruct.jinja-19-25 (1)
19-25: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Serialize the tool schema with one tojson call.

Line 24 hand-builds JSON with raw tname and tdesc. Any quote or newline in those fields makes the tool definition invalid and corrupts the prompt format. Serialize a dict instead of concatenating JSON fragments.
Suggested fix
     {%- for t in tools %}
         {%- set tname = t.function.name %}
         {%- set tdesc = t.function.description %}
-        {%- set tparams = t.function.parameters | tojson %}
         {{- "Use the function '" + tname + "' to '" + tdesc + "':\n" }}
-        {{- '{"name": "' + tname + '", "description": "' + tdesc + '", "parameters": ' + tparams + '}\n\n' }}
+        {{- {"name": tname, "description": tdesc, "parameters": t.function.parameters} | tojson + '\n\n' }}
     {%- endfor %}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/templates/llama-3.3-instruct.jinja`
around lines 19 - 25, The tool schema serialization in the
llama-3.3-instruct.jinja template is being hand-built with raw tname and tdesc,
which can break JSON when those values contain quotes or newlines. Update the
loop over tools to construct a single dictionary-like payload for each tool and
serialize that whole object with one tojson call, using the existing
t.function.name, t.function.description, and t.function.parameters fields. Keep
the surrounding prompt text generation in the same template, but remove the
concatenated JSON fragment so the tool definition stays valid and robust.
services/rl/src/nmp/rl/tasks/training/errors/converter.py-89-102 (1)
89-102: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Keep create_error_details() non-throwing.

Line 90 calls get_error_converter() outside the fallback guard. If rule loading fails here, the original training exception escapes the error path and TrainingRunner.run() falls back to its stale default result instead of structured error details. Wrap converter acquisition in the same try and use the InternalError fallback from that branch too.
Suggested fix
 def create_error_details(exception: Exception) -> ErrorDetails:
@@
-    converter = get_error_converter()
     try:
+        converter = get_error_converter()
         converter.raise_converted_or_default(exception)
     except CustomizerTrainingError as converted:
         return converted.to_error_details()
     except Exception as e:  # noqa: BLE001 - intentional last-resort guard to guarantee dict return
         # Unexpected exception type - wrap in InternalError to ensure we always return a dict
         logger.warning(f"Unexpected exception type from converter: {type(e).__name__}: {e}")
         exc = InternalError(
             message=f"An internal error occurred. ({type(exception).__name__}: {exception})",
-            detail=str(exception),
+            detail=f"{type(exception).__name__}: {exception}",
         )
         return exc.to_error_details()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/errors/converter.py` around lines 89 -
102, Keep create_error_details() non-throwing by moving get_error_converter()
inside the same try block as converter.raise_converted_or_default() in
converter.py. If converter acquisition fails, handle it in the existing except
Exception branch and return InternalError.to_error_details() so
TrainingRunner.run() always gets structured error details instead of leaking the
original exception. Use the existing create_error_details(),
get_error_converter(), and InternalError flow as the fix point.

🟡 Minor comments (2)

plugins/nemo-rl/src/nemo_rl_plugin/schema.py-36-41 (1)
36-41: 🗄️ Data Integrity & Integration | 🟡 Minor

Cap OutputRequest.name at 255 chars
OutputResponse already limits both name and fileset to 255, so an oversized explicit output name should be rejected here instead of during transform.
Suggested fix
 class OutputRequest(BaseModel):
     """Submitter-facing output preferences. ``name`` is auto-derived if omitted."""
 
     model_config = ConfigDict(extra="forbid")
 
-    name: str | None = None
+    name: str | None = Field(default=None, max_length=255)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-rl/src/nemo_rl_plugin/schema.py` around lines 36 - 41,
OutputRequest.name is not bounded, so an explicit oversized output name can slip
through until later transform logic. Add a 255-character max constraint to the
name field in OutputRequest (the model in schema.py) so validation fails
immediately, matching the limits already enforced by OutputResponse for name and
fileset. Keep the change localized to the OutputRequest model definition.
services/rl/src/nmp/rl/tasks/training/runner.py-81-81 (1)
81-81: 📐 Maintainability & Code Quality | 🟡 Minor

Use Self for __enter__. In services/rl/src/nmp/rl/tasks/training/runner.py:81, replace the string return type with Self and import it from typing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/rl/src/nmp/rl/tasks/training/runner.py` at line 81, The __enter__
method in TrainingRunner is still using a string annotation instead of the
modern Self type. Update the TrainingRunner.__enter__ signature to return Self,
and add Self to the typing import list in the same module so the annotation is
resolved correctly.
Source: Coding guidelines

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 118f7c09-f7be-40f2-bcb5-727036cd8217

📥 Commits

Reviewing files that changed from the base of the PR and between bb73df7 and bae7dad.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (75)

docker-bake.hcl
docker/Dockerfile.nmp-rl-base
docker/Dockerfile.nmp-rl-tasks
docker/Dockerfile.nmp-rl-training
docker/rl/Dockerfile.platform-workspace
docker/rl/pyproject.workspace.toml
packages/nmp_customization_common/src/nmp/customization_common/contributor/jobs.py
plugins/nemo-rl/README.md
plugins/nemo-rl/pyproject.toml
plugins/nemo-rl/src/nemo_rl_plugin/__init__.py
plugins/nemo-rl/src/nemo_rl_plugin/cli/__init__.py
plugins/nemo-rl/src/nemo_rl_plugin/cli/inputs.py
plugins/nemo-rl/src/nemo_rl_plugin/cli/main.py
plugins/nemo-rl/src/nemo_rl_plugin/config.py
plugins/nemo-rl/src/nemo_rl_plugin/contributor.py
plugins/nemo-rl/src/nemo_rl_plugin/jobs/jobs.py
plugins/nemo-rl/src/nemo_rl_plugin/schema.py
plugins/nemo-rl/src/nemo_rl_plugin/sdk/__init__.py
plugins/nemo-rl/src/nemo_rl_plugin/sdk/resources.py
plugins/nemo-rl/src/nemo_rl_plugin/transform.py
pyproject.toml
services/rl/README.md
services/rl/pyproject.toml
services/rl/src/nmp/rl/app/constants.py
services/rl/src/nmp/rl/app/jobs/compiler.py
services/rl/src/nmp/rl/app/jobs/training/schemas.py
services/rl/src/nmp/rl/compile.py
services/rl/src/nmp/rl/config.py
services/rl/src/nmp/rl/entities/values.py
services/rl/src/nmp/rl/images.py
services/rl/src/nmp/rl/schemas.py
services/rl/src/nmp/rl/tasks/file_io/__init__.py
services/rl/src/nmp/rl/tasks/file_io/__main__.py
services/rl/src/nmp/rl/tasks/file_io/callbacks.py
services/rl/src/nmp/rl/tasks/file_io/run.py
services/rl/src/nmp/rl/tasks/model_entity/__init__.py
services/rl/src/nmp/rl/tasks/model_entity/__main__.py
services/rl/src/nmp/rl/tasks/model_entity/run.py
services/rl/src/nmp/rl/tasks/training/__main__.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/backend.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/callbacks.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/checkpoints.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_driver.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/grpo_config.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/grpo_driver.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/nemo_rl_logger.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/no_override_requirements.txt
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/preference_datasets/__init__.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/preference_datasets/helpsteer3.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/preference_datasets/tulu3.py
services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/ray_bootstrap.py
services/rl/src/nmp/rl/tasks/training/chat_templates.py
services/rl/src/nmp/rl/tasks/training/datasets/preparation.py
services/rl/src/nmp/rl/tasks/training/datasets/schemas.py
services/rl/src/nmp/rl/tasks/training/datasets/validation.py
services/rl/src/nmp/rl/tasks/training/distributed.py
services/rl/src/nmp/rl/tasks/training/errors/converter.py
services/rl/src/nmp/rl/tasks/training/errors/error_rules.yaml
services/rl/src/nmp/rl/tasks/training/errors/exceptions.py
services/rl/src/nmp/rl/tasks/training/errors/parser.py
services/rl/src/nmp/rl/tasks/training/integrations.py
services/rl/src/nmp/rl/tasks/training/protocol.py
services/rl/src/nmp/rl/tasks/training/runner.py
services/rl/src/nmp/rl/tasks/training/templates/llama-3.1-instruct.jinja
services/rl/src/nmp/rl/tasks/training/templates/llama-3.2-instruct.jinja
services/rl/src/nmp/rl/tasks/training/templates/llama-3.3-instruct.jinja
services/rl/src/nmp/rl/tasks/training/templates/nemotron-3.1.jinja
services/rl/src/nmp/rl/tasks/training/templates/nemotron-3.3.jinja
services/rl/src/nmp/rl/tasks/training/templates/nemotron-super-3.3.jinja
services/rl/src/nmp/rl/tasks/training/templates/phi-4.jinja
services/rl/src/nmp/rl/tasks/training/utils.py
services/rl/tests/test_compiler.py
services/rl/tests/test_dpo_config.py
services/rl/tests/test_schemas.py

coderabbitai · 2026-06-25T23:41:57Z

+    try:
+        from transformers import AutoTokenizer
+
+        tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+        template = getattr(tokenizer, "chat_template", None)
+        if template:
+            logger.debug(f"Found chat template in tokenizer for {model_path}")
+        return template
+    except Exception as e:
+        logger.warning(f"Could not load tokenizer to get chat template: {e}")


🔒 Security & Privacy | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate the target file and relevant call sites git ls-files 'services/rl/src/nmp/rl/tasks/training/chat_templates.py' \ 'services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py' echo '--- chat_templates.py (selected lines) ---' nl -ba services/rl/src/nmp/rl/tasks/training/chat_templates.py | sed -n '1,220p' echo '--- dpo_config.py (selected lines) ---' nl -ba services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py | sed -n '1,220p'

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 371

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate the target file and relevant call sites git ls-files 'services/rl/src/nmp/rl/tasks/training/chat_templates.py' \ 'services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py' echo '--- chat_templates.py (selected lines) ---' nl -ba services/rl/src/nmp/rl/tasks/training/chat_templates.py | sed -n '1,220p' echo '--- dpo_config.py (selected lines) ---' nl -ba services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py | sed -n '1,220p'

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 371

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- chat_templates.py ---' sed -n '1,220p' services/rl/src/nmp/rl/tasks/training/chat_templates.py | cat -n echo '--- dpo_config.py ---' sed -n '1,220p' services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py | cat -n

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 20073

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- chat_templates.py ---' sed -n '1,220p' services/rl/src/nmp/rl/tasks/training/chat_templates.py | cat -n echo '--- dpo_config.py ---' sed -n '1,220p' services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py | cat -n

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 20073

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- chat_templates.py ---' sed -n '1,220p' services/rl/src/nmp/rl/tasks/training/chat_templates.py | cat -n echo '--- dpo_config.py ---' sed -n '1,220p' services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py | cat -n

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 20073

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- chat_templates.py ---' sed -n '1,220p' services/rl/src/nmp/rl/tasks/training/chat_templates.py | cat -n echo '--- dpo_config.py ---' sed -n '1,220p' services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py | cat -n

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 20073

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- chat_templates.py ---' sed -n '1,220p' services/rl/src/nmp/rl/tasks/training/chat_templates.py | cat -n echo '--- dpo_config.py ---' sed -n '1,220p' services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py | cat -n

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 20073

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- chat_templates.py matches ---' rg -n -C 4 'from_pretrained|chat_template|model_path|trust_remote_code' services/rl/src/nmp/rl/tasks/training/chat_templates.py echo '--- dpo_config.py matches ---' rg -n -C 4 'resolve_chat_template|model\.path|customizer_config|chat_template' services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 14263

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- chat_templates.py matches ---' rg -n -C 4 'from_pretrained|chat_template|model_path|trust_remote_code' services/rl/src/nmp/rl/tasks/training/chat_templates.py echo '--- dpo_config.py matches ---' rg -n -C 4 'resolve_chat_template|model\.path|customizer_config|chat_template' services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 14263

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- chat_templates.py matches ---' rg -n -C 4 'from_pretrained|chat_template|model_path|trust_remote_code' services/rl/src/nmp/rl/tasks/training/chat_templates.py echo '--- dpo_config.py matches ---' rg -n -C 4 'resolve_chat_template|model\.path|customizer_config|chat_template' services/rl/src/nmp/rl/tasks/training/backends/nemo_rl/dpo_config.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 14263

Remove trust_remote_code=True from tokenizer fallback. This can execute arbitrary code from model_path while only resolving a chat template; keep it off unless the source is strictly allowlisted.

🧰 Tools

🪛 ast-grep (0.44.0)

[warning] 109-109: Loading a HuggingFace model, tokenizer, dataset or pipeline with trust_remote_code=True executes arbitrary code from the remote repository. Only enable it for fully trusted sources; prefer artifacts that do not require it.
Context: trust_remote_code=True
Note: [CWE-94] Improper Control of Generation of Code ('Code Injection').

(transformers-trust-remote-code-python)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/rl/src/nmp/rl/tasks/training/chat_templates.py` around lines 107 - 116, The tokenizer fallback in chat template resolution is using trust_remote_code=True, which should be removed from AutoTokenizer.from_pretrained in the template lookup path. Update the fallback in the chat template helper to load the tokenizer without remote code execution and keep it limited to extracting the chat_template attribute; if special handling is needed, gate it behind a strict allowlist in the same function.

Source: Linters/SAST tools

coderabbitai · 2026-06-25T23:41:57Z

+        finally:
+            # === Phase 5: Write result (coordinator only) ===
+            self._write_result(result)
+            return result


🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

Don't return from finally.

Line 153 suppresses any active exception, including BaseException and failures inside the error-conversion path. That can turn real job failures into the default "No result" response. Keep finally for _write_result() only and return after it.

Suggested fix

- finally: - # === Phase 5: Write result (coordinator only) === - self._write_result(result) - return result + finally: + # === Phase 5: Write result (coordinator only) === + self._write_result(result) + + return result

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

finally:

# === Phase 5: Write result (coordinator only) ===

self._write_result(result)

return result

finally:

# === Phase 5: Write result (coordinator only) ===

self._write_result(result)

return result

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/rl/src/nmp/rl/tasks/training/runner.py` around lines 150 - 153, The `finally` block in `training.runner` should not return, because `Runner`/`_write_result()` is currently suppressing active exceptions and can hide real failures as a default result. Keep the `finally` section in `runner.py` limited to calling `_write_result(result)`, then move the `return result` out of `finally` so normal completion still returns the result while exceptions propagate correctly.

coderabbitai · 2026-06-25T23:41:57Z

+        if self._dist_ctx.is_coordinator:
+            if isinstance(self._backend, SupportsPreprocessing):
+                self._progress.report_running("conversions")
+                self._backend.run_preprocessing(self._config)
+                logger.info("Pre-training conversions complete")
+            # Always release workers, even if no conversions are needed
+            self._dist_ctx.signal(BARRIER_PREPROCESSING_COMPLETE)
+        else:
+            self._dist_ctx.wait_for_coordinator(BARRIER_PREPROCESSING_COMPLETE)


🩺 Stability & Availability | 🔴 Critical | 🏗️ Heavy lift

Coordinator-only failures strand worker ranks.

These phases only publish success markers. If preprocessing, config compilation, or the YAML write fails on the coordinator, every worker stays in wait_for_coordinator() until timeout. The barrier protocol needs a failure marker/shared error payload so workers can exit immediately and surface the same error.

Also applies to: 218-235

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/rl/src/nmp/rl/tasks/training/runner.py` around lines 197 - 205, The coordinator-only phases in runner.py only signal success, so failures in run_preprocessing, config compilation, or the YAML write can leave worker ranks stuck in wait_for_coordinator(). Update the barrier flow around BARRIER_PREPROCESSING_COMPLETE and the later coordinator steps to publish a failure marker or shared error payload when an exception occurs. Make the coordinator catch and propagate the same error state, and have workers check that state after wait_for_coordinator() so they can exit immediately instead of timing out.

anubhutivyas requested review from a team as code owners June 25, 2026 23:16

github-actions Bot added the feat label Jun 25, 2026

feat(customizer): add RL backend in customizer plugin

bae7dad

Signed-off-by: anubhutiv <anubhutiv@nvidia.com>

anubhutivyas force-pushed the aalgo-292-dpo-support/anubhutiv branch from cc4e23d to bae7dad Compare June 25, 2026 23:17

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(customizer): add RL backend in customizer plugin#479

feat(customizer): add RL backend in customizer plugin#479
anubhutivyas wants to merge 1 commit into
mainfrom
aalgo-292-dpo-support/anubhutiv

anubhutivyas commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

anubhutivyas commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anubhutivyas commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading