feat(deployments): implement DockerDeploymentBackend (AIRCORE-756) by tylersbray · Pull Request #399 · NVIDIA-NeMo/nemo-platform

tylersbray · 2026-06-23T01:52:24Z

Summary

Implements DockerDeploymentBackend for nemo-deployments-plugin (AIRCORE-756).

Rebased onto main after #469 merged (prerequisites on Deployment, volume DELETING, orphan grace, shutdown cancellation). Branch history was reset to cherry-pick only the Docker backend commits — no duplicated reconciler code from the old #315 stack.

v1 scope: single-container deployments; rejects init_containers and multi-container specs
Entity contract: reconciler passes config_name only; backend loads full DeploymentConfig from the entity store via NemoEntitiesClient
Lifecycle: docker pull/run, inspect, stop/rm, volume create/delete, port allocation from entity DockerDeploymentConfig
Probes: exec and http_get readiness supported; grpc deferred
GPU: plugin-local DockerGPUPool (consolidation to nemo_platform_plugin tracked separately in AIRCORE-759)
Idempotency: workspace, name, managed-by, and deployment-config label matching

Test plan

pytest plugins/nemo-deployments/tests/unit — 138 passed
pytest plugins/nemo-deployments/tests/integration — 4 passed (Docker daemon required)
test_puller_server_prerequisite_chain — uses deployment-scoped prerequisites (puller → server)
CodeRabbit review fixes from d1270d0 included (port allocation, probe host resolution, label constants, test hardening)

Summary by CodeRabbit

Release Notes

New Features
- Added Docker deployment backend registered by default, with GPU allocation pooling, automatic host port selection, container readiness probes, managed log retrieval (with truncation), and Docker volume create/read/delete support.
Bug Fixes
- Improved deployment idempotency and “deployment lost” handling.
- Updated status_in filtering to use the stored nested data.status field.
Dependencies
- Added httpx>=0.27; introduced Docker extra pinned to docker>=7.0.
Tests
- Expanded Docker unit/integration coverage, including new GPU/labels/ports/probe/status tests and adjusted integration import paths.

tylersbray · 2026-06-23T02:13:43Z

@coderabbitai full review

coderabbitai · 2026-06-23T02:13:50Z

✅ Action performed

Full review finished.

coderabbitai · 2026-06-23T02:21:06Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a Docker deployment backend with container, volume, GPU, port, probe, and status handling. Registers it as the default docker backend and adds unit and integration coverage.

Changes

Docker Deployment Backend

Layer / File(s)	Summary
Config and identity wiring `plugins/nemo-deployments/pyproject.toml`, `plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/config.py`, `plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.py`, `plugins/nemo-deployments/src/nemo_deployments_plugin/api/v2/deployments.py`, `plugins/nemo-deployments/src/nemo_deployments_plugin/controller.py`, `plugins/nemo-deployments/tests/unit/reconciler/test_entity_client.py`, `plugins/nemo-deployments/tests/unit/test_api_deployments.py`, `plugins/nemo-deployments/tests/integration/conftest.py`, `plugins/nemo-deployments/tests/integration/docker_availability.py`, `pytest.ini`	Defines `DockerExecutorConfig`, adds Docker/httpx dependency updates, introduces Docker naming and identity label helpers, and updates deployment and volume status filters and matching tests to use `data.status`.
GPU pool `plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py`, `plugins/nemo-deployments/tests/unit/backends/docker/test_gpu.py`	Introduces the thread-safe Docker GPU pool, GPU detection, allocation recovery, and shared pool singleton.
Container spec helpers `plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/containers.py`, `plugins/nemo-deployments/tests/unit/backends/docker/test_restart_policy.py`, `plugins/nemo-deployments/tests/unit/backends/docker/test_status_mapping.py`	Adds Docker config parsing, restart policy mapping, env and volume merge helpers, port binding construction, and GPU device-request generation.
Ports, probes, status, and volumes `plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py`, `plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py`, `plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py`, `plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.py`, `plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py`, `plugins/nemo-deployments/tests/unit/backends/docker_helpers.py`, `plugins/nemo-deployments/tests/unit/backends/docker/test_labels.py`	Adds host-port allocation, readiness probe checks, Docker status mapping, and Docker volume lifecycle helpers.
Backend lifecycle `plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py`, `plugins/nemo-deployments/tests/unit/backends/docker/conftest.py`, `plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py`, `plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py`, `plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`, `plugins/nemo-deployments/tests/integration/test_reconcile_docker.py`	Implements DockerDeploymentBackend initialization, deployment creation and status handling, deletion, listing, logs, volume delegation, and end-to-end lifecycle/reconciliation tests.
Registry wiring `plugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.py`	Registers DockerDeploymentBackend in `BACKEND_CLASSES` as the default docker backend.

Possibly related PRs

NVIDIA-NeMo/nemo-platform#280: Adds the Docker backend registration and backend contract wiring this PR extends.
NVIDIA-NeMo/nemo-platform#315: Uses the same data.status filter path updated here in controller reconciliation listings.

Suggested reviewers

benmccown
mckornfield

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.03% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly states the main change: implementing DockerDeploymentBackend for deployments.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 756-docker-deployment-backend/tbray

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (7)

plugins/nemo-deployments/tests/unit/backends/docker_helpers.py (1)

13-26: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use a concrete restart-policy type and drop the ignore.

restart_policy: str is too broad and forces # type: ignore[arg-type]. Type this as the concrete policy union/Literal used by DeploymentConfig and remove the ignore.

As per coding guidelines, "**/*.py: Always prefer concrete type hints over string-based ones in Python code."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/unit/backends/docker_helpers.py` around lines
13 - 26, The sample_config function has a restart_policy parameter typed as str
which is too broad and requires a type ignore comment. Identify the concrete
Literal type or union type that DeploymentConfig expects for its restart_policy
field, update the function signature to use that concrete type instead of str
for the restart_policy parameter, and remove the # type: ignore[arg-type]
comment on the restart_policy argument passed to DeploymentConfig.
Source: Coding guidelines

plugins/nemo-deployments/tests/integration/conftest.py (1)

10-18: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add type hints to module-level variables.

Per coding guidelines, prefer concrete type hints. Add annotations:

DOCKER_AVAILABLE: bool = True

and

skip_without_docker: pytest.MarkDecorator = pytest.mark.skipif(...)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/integration/conftest.py` around lines 10 - 18,
Add explicit type hints to the module-level variables DOCKER_AVAILABLE and
skip_without_docker. For DOCKER_AVAILABLE, add a bool type annotation since it
is assigned boolean values (True or False) in the try-except block. For
skip_without_docker, add a pytest.MarkDecorator type annotation since it is
assigned the result of pytest.mark.skipif(). These type hints should be added
directly at the point of variable assignment to comply with coding guidelines.

Source: Coding guidelines

plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py (2)

162-164: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Narrow exception handling in cleanup.

Catching broad Exception hides potential issues. Use NotFound from docker.errors (already imported) for consistency with other cleanup blocks.

♻️ Proposed fix

         try:
             docker_backend._client.volumes.remove(docker_volume_name("itest", "data"))
-        except Exception:
+        except NotFound:
             pass

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`
around lines 162 - 164, The exception handling in the volume cleanup block is
catching a broad Exception which can mask real issues. Replace the generic
Exception catch with docker.errors.NotFound, which is already imported and used
consistently in other cleanup blocks throughout the test. This will ensure only
the expected "volume not found" error is silently ignored while other unexpected
errors will properly surface for debugging.

126-128: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Accessing private attribute in test cleanup.

Test cleanup directly accesses docker_backend._client, which couples tests to implementation details. Consider adding a public cleanup method to the backend or using the Docker SDK directly via docker.from_env().

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`
around lines 126 - 128, The test cleanup code directly accesses the private
`_client` attribute from docker_backend, creating coupling to implementation
details. Either add a public cleanup method to the docker_backend class that
encapsulates the container removal logic
(containers.get(c_name).remove(force=True)) and call that method from the test,
or alternatively, use docker.from_env() directly in the test cleanup code to
access the Docker client without relying on private implementation details.
Remove the direct access to the underscore-prefixed _client attribute.

plugins/nemo-deployments/tests/integration/test_reconcile_docker.py (2)

160-174: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low value

Optimize loop by constructing dictionaries once.

The by_name and by_config dictionaries are reconstructed on every iteration. Move construction outside the loop for clarity and minor efficiency gain.

♻️ Proposed refactor

+        by_name = {("itest", "puller"): puller_dep, ("itest", "server"): server_dep}
+        by_config = {
+            ("itest", "puller-cfg"): puller_dep,
+            ("itest", "server-cfg"): server_dep,
+        }
         for _ in range(40):
-            by_name = {("itest", "puller"): puller_dep, ("itest", "server"): server_dep}
-            by_config = {
-                ("itest", "puller-cfg"): puller_dep,
-                ("itest", "server-cfg"): server_dep,
-            }
             await deployment_reconciler.reconcile_one(

Apply the same pattern to the server reconciliation loop at lines 177-192.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py` around
lines 160 - 174, The by_name and by_config dictionaries are being reconstructed
on every iteration of the loop in the puller deployment reconciliation block.
Move the construction of both dictionaries outside the for loop (before the `for
_ in range(40):` line) since they contain the same values on each iteration.
Apply this same optimization pattern to the server reconciliation loop that
follows, moving the dictionary construction for that loop outside as well to
avoid redundant object creation.

128-138: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add type hints to get_side_effect.

Per coding guidelines, prefer concrete type hints. This function should annotate parameters and return type:

async def get_side_effect(entity_type: type, name: str, workspace: str | None = None) -> Volume | Deployment | DeploymentConfig:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py` around
lines 128 - 138, The get_side_effect function is missing type hints on its
parameters and return type. Add type annotations to the function signature:
annotate the entity_type parameter as type, the name parameter as str, the
workspace parameter as str or None with a default value of None, and the return
type as a union of Volume, Deployment, and DeploymentConfig.

Source: Coding guidelines

plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py (1)

28-33: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Restart policy parameter has no effect.

Both branches in map_exited_status return "FAILED" regardless of restart_policy. Consider removing the parameter if not needed, or document the future intent.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py`
around lines 28 - 33, In the map_exited_status function, the restart_policy
parameter is not being utilized effectively as both the if condition checking
for restart_policy == "Always" and the final return statement return the same
"FAILED" status. Either remove the restart_policy parameter entirely if it is
not needed for this function's logic, or modify the conditional logic to return
different DeploymentStatus values based on different restart_policy values so
that the parameter actually influences the function's behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py`:
- Around line 148-157: The loop iterating through container_spec.ports does not
track previously assigned host ports within the current iteration, allowing
find_available_port to potentially assign the same host port multiple times.
Modify the code to collect already-assigned ports from the host_ports dictionary
in each loop iteration and pass them to find_available_port so it can exclude
those ports when searching for available options, ensuring no duplicate port
assignments occur within a single multi-port deployment.
- Line 292: Replace hardcoded label key strings with imported constants to
maintain consistency with canonical definitions. First, add imports for
BACKOFF_LIMIT_LABEL and MANAGED_BY_KEY from labels.py at the top of the file.
Then replace the hardcoded string "nmp.nvidia.com/backoff-limit" in the
backoff_limit variable assignment with the BACKOFF_LIMIT_LABEL constant, and
replace the hardcoded "managed-by" strings used in the container_labels.get()
call and labels.get() comparison with the MANAGED_BY_KEY constant.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py`:
- Around line 17-18: Remove the TYPE_CHECKING conditional guard around the
docker import. Convert the conditional import of docker (currently under if
TYPE_CHECKING:) to a regular import statement at the module level, so that
docker is imported unconditionally rather than only for type checking purposes.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py`:
- Around line 90-99: The probe target resolution incorrectly hardcodes 127.0.0.1
instead of using the actual host from the host_url parameter, which breaks
functionality for remapped host ports and remote Docker daemons. Replace all
instances of 127.0.0.1 in the base URL construction within the port resolution
logic (in the branches checking isinstance(port, int), the named_ports lookup,
and the host_ports loop) with the appropriate host extracted from the host_url
parameter. Apply the same fix to the additional probe target resolution code
mentioned in lines 111-127 to ensure consistency across all probe endpoint
constructions.
- Around line 11-19: The docker import statement is unnecessarily wrapped in a
TYPE_CHECKING conditional block. Since the file already uses from __future__
import annotations which stringifies all type hints at runtime, there is no
performance cost to importing DockerContainer directly. Remove the if
TYPE_CHECKING: conditional block that wraps the docker import statement and move
the from docker.models.containers import Container as DockerContainer import to
the top-level imports section with the other import statements like httpx and
the nemo_deployments_plugin imports.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.py`:
- Around line 15-16: The docker import is currently guarded by the TYPE_CHECKING
conditional block. Remove the TYPE_CHECKING guard and import statement around
the docker module, and instead add docker as a regular import statement at the
top of the file with the other imports. This ensures docker is imported as a
concrete type hint rather than a string-based one, following the coding
guidelines.

In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`:
- Around line 26-32: The try-except block checking Docker availability and
setting _DOCKER_AVAILABLE is duplicated from the parent conftest.py file. Remove
lines 26-32 containing the duplicate Docker availability check (the try block
with docker.from_env().ping() and the _DOCKER_AVAILABLE assignment), and instead
use the existing DOCKER_AVAILABLE fixture or skip_without_docker marker that is
already defined in the parent conftest.py. Pytest automatically discovers and
uses parent conftest fixtures and markers, so this duplicate check should be
completely removed from this file.

In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py`:
- Around line 33-39: Remove the duplicated Docker availability check in the
test_reconcile_docker.py file. Delete the entire try-except block that attempts
to import docker and calls docker.from_env().ping() to set DOCKER_AVAILABLE.
Instead, rely on the Docker availability check already defined in the parent
conftest.py file, which Pytest discovers automatically through its marker
system. This eliminates code duplication and ensures consistency across all test
files in the project.

In `@plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py`:
- Line 11: The import statement at line 11 using `from backends.docker_helpers
import container_attrs, sample_config` is using an absolute import path that
does not resolve correctly for the test package layout. Change this to use a
relative import that properly navigates from the test file's location
(plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py) to
the helpers module, or adjust the import to use the correct absolute path that
matches your package structure so that pytest can resolve the module correctly.

In `@plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py`:
- Line 11: The import statement at the top of test_idempotency.py uses an
incorrect module path. The line `from backends.docker_helpers import
container_attrs, sample_config` needs to be updated to use the correct relative
or absolute import path that matches your repository's actual package structure.
Determine the correct path by checking where the docker_helpers module actually
exists in your repo (likely in the backends directory relative to the tests
directory), and adjust the import to use either a relative import (like from
..docker_helpers or similar) or the correct absolute path from the repo root.

In `@plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py`:
- Around line 21-24: The assertion in test_is_port_free_local is a tautology
that always evaluates to true since it checks if something is true OR not true,
which provides no meaningful test coverage. Replace this assertion with a
deterministic test by either occupying a port and verifying is_port_free returns
False for that port, or using a known free port and verifying is_port_free
returns True. Set up a specific port state beforehand (such as binding to a port
or using a port known to be available) and then assert that is_port_free returns
the expected boolean result for that deterministic condition.
- Around line 26-36: The test_find_available_port_skips_used test depends on
real host port availability which makes it flaky and environment-dependent. Mock
the is_port_free function that is called within find_available_port to return
deterministic values (such as returning False for port 9000 to simulate it being
in use, and True for ports 9001 and 9002 to simulate them being free). Then
update the assertion to check for the exact expected port (9001) instead of
using an assertion with multiple possible values, ensuring the test behavior is
reproducible and not dependent on the actual system port state.

---

Nitpick comments:
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py`:
- Around line 28-33: In the map_exited_status function, the restart_policy
parameter is not being utilized effectively as both the if condition checking
for restart_policy == "Always" and the final return statement return the same
"FAILED" status. Either remove the restart_policy parameter entirely if it is
not needed for this function's logic, or modify the conditional logic to return
different DeploymentStatus values based on different restart_policy values so
that the parameter actually influences the function's behavior.

In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`:
- Around line 162-164: The exception handling in the volume cleanup block is
catching a broad Exception which can mask real issues. Replace the generic
Exception catch with docker.errors.NotFound, which is already imported and used
consistently in other cleanup blocks throughout the test. This will ensure only
the expected "volume not found" error is silently ignored while other unexpected
errors will properly surface for debugging.
- Around line 126-128: The test cleanup code directly accesses the private
`_client` attribute from docker_backend, creating coupling to implementation
details. Either add a public cleanup method to the docker_backend class that
encapsulates the container removal logic
(containers.get(c_name).remove(force=True)) and call that method from the test,
or alternatively, use docker.from_env() directly in the test cleanup code to
access the Docker client without relying on private implementation details.
Remove the direct access to the underscore-prefixed _client attribute.

In `@plugins/nemo-deployments/tests/integration/conftest.py`:
- Around line 10-18: Add explicit type hints to the module-level variables
DOCKER_AVAILABLE and skip_without_docker. For DOCKER_AVAILABLE, add a bool type
annotation since it is assigned boolean values (True or False) in the try-except
block. For skip_without_docker, add a pytest.MarkDecorator type annotation since
it is assigned the result of pytest.mark.skipif(). These type hints should be
added directly at the point of variable assignment to comply with coding
guidelines.

In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py`:
- Around line 160-174: The by_name and by_config dictionaries are being
reconstructed on every iteration of the loop in the puller deployment
reconciliation block. Move the construction of both dictionaries outside the for
loop (before the `for _ in range(40):` line) since they contain the same values
on each iteration. Apply this same optimization pattern to the server
reconciliation loop that follows, moving the dictionary construction for that
loop outside as well to avoid redundant object creation.
- Around line 128-138: The get_side_effect function is missing type hints on its
parameters and return type. Add type annotations to the function signature:
annotate the entity_type parameter as type, the name parameter as str, the
workspace parameter as str or None with a default value of None, and the return
type as a union of Volume, Deployment, and DeploymentConfig.

In `@plugins/nemo-deployments/tests/unit/backends/docker_helpers.py`:
- Around line 13-26: The sample_config function has a restart_policy parameter
typed as str which is too broad and requires a type ignore comment. Identify the
concrete Literal type or union type that DeploymentConfig expects for its
restart_policy field, update the function signature to use that concrete type
instead of str for the restart_policy parameter, and remove the # type:
ignore[arg-type] comment on the restart_policy argument passed to
DeploymentConfig.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5d5a85f9-3645-431f-b8df-57e8281266da

📥 Commits

Reviewing files that changed from the base of the PR and between 0b92da1 and f612b6a.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (22)

plugins/nemo-deployments/pyproject.toml
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/config.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/containers.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.py
plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py
plugins/nemo-deployments/tests/integration/conftest.py
plugins/nemo-deployments/tests/integration/test_reconcile_docker.py
plugins/nemo-deployments/tests/unit/backends/docker/conftest.py
plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py
plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py
plugins/nemo-deployments/tests/unit/backends/docker/test_labels.py
plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py
plugins/nemo-deployments/tests/unit/backends/docker/test_restart_policy.py
plugins/nemo-deployments/tests/unit/backends/docker/test_status_mapping.py
plugins/nemo-deployments/tests/unit/backends/docker_helpers.py

github-actions · 2026-06-25T21:39:42Z

Suite	Lines Covered	Line Rate	Branch Rate
Unit Tests	21322/27924	76.4%	61.4%
Integration Tests	12354/26693	46.3%	19.8%

Add Docker backend for nemo-deployments-plugin with single-container v1 lifecycle, entity-store config fetch, port allocation, exec/http probes, plugin-local GPU pool, and unit plus integration test coverage. Stacks on #315 (AIRCORE-758 reconciler prerequisite DAG). Signed-off-by: Tyler Bray <tbray@nvidia.com>

…E-756) - Exclude already-assigned host ports during multi-port allocation - Use label constants; fix HTTP/TCP probe host/port resolution - Remove TYPE_CHECKING-only docker imports - Deduplicate Docker availability check via docker_availability module - Harden port unit tests; add type hints and cleanup improvements Signed-off-by: Tyler Bray <tbray@nvidia.com>

…te model Move server prerequisite from DeploymentConfig to Deployment, reference puller by deployment name, and drop deployments_by_config from reconcile_one. Signed-off-by: Tyler Bray <tbray@nvidia.com>

Entity store persists deployment and volume status under the data JSON column; filtering on top-level status caused list queries to fail and left DELETING deployments stuck. Align controller and status_in list API with data.status and update unit test expectations. Signed-off-by: Tyler Bray <tbray@nvidia.com>

tylersbray · 2026-06-26T16:38:34Z

Smoke testing (local, post-rebase)

Ran manual smoke against a local platform on localhost:8080 with the Docker executor configured in platform YAML (deployments.default_executor / executors[].backend: docker). GPU not required.

Tier 1 — automated

uv sync --extra docker
uv run pytest plugins/nemo-deployments/tests -q

142 unit tests passed (138 before the data.status filter fix)
4 Docker integration tests passed against a real daemon:
- test_volume_lifecycle
- test_never_deployment_succeeds
- test_lost_detection_for_always (with drift_recovery.action: ignore)
- test_puller_server_prerequisite_chain (deployment-scoped prerequisites per feat(deployments): move prerequisites to Deployment and harden reconcile lifecycle #469)

Tier 2 — platform REST E2E (manual)

Workspace smoke, scenarios exercised via the deployments v2 API:

Scenario	Config	Expected outcome	Result
A	`alpine:3.20`, `restart_policy: Never`, `echo smoke`	`PENDING` → `SUCCEEDED`, `exit_code: 0`	pass
B	`nginx:alpine`, host ports 9050–9060, `restart_policy: Always`, `drift_recovery: ignore`	`READY`, curl endpoint, `docker rm` container → `LOST`	pass
C	Volume `weights` + puller (`OnFailure`) → server (`Always`) with `prerequisites: [{deployment_name: puller, condition: succeeded}]`	puller `SUCCEEDED`, server `READY`/`STARTING`	pass

Issues found and fixed during smoke

data.status filter bug (3901ed80) — controller and status_in list API filtered on status instead of data.status, causing entity list queries to 500 and leaving DELETING deployments stuck. Fixed in this branch.
Async delete ordering — deployment DELETE is async (DELETING → entity removed by reconciler); config DELETE must wait for deployment to disappear.
Platform config location — executor config belongs in platform service YAML (NMP_CONFIG_FILE_PATH / packages/nmp_platform_runner/.../local.yaml), not ~/.config/nmp/config.yaml (CLI kubeconfig).
Full platform start — nemo services run --controller-group all alone starts controllers without API services; use nemo services run (default) or --service-group all --controller-group all.

Not run

GPU / nvidia.com/gpu limits (optional Tier 3; needs Container Toolkit + GPU host)

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py`:
- Around line 93-101: The shared GPU pool initialization in
get_shared_gpu_pool() only seeds reserved IDs from detect_gpu_device_ids(), so
restart recovery never accounts for GPUs already in use by existing managed
containers. Update the pool bootstrap path to rebuild reserved allocations from
currently running deployment-managed Docker containers using their labels before
serving requests, then merge that state into the DockerGPUPool created in
get_shared_gpu_pool().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1cb70618-5c61-4293-923f-c49f9b523d4b

📥 Commits

Reviewing files that changed from the base of the PR and between dd8fd6c and 3901ed8.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (27)

plugins/nemo-deployments/pyproject.toml
plugins/nemo-deployments/src/nemo_deployments_plugin/api/v2/deployments.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/config.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/containers.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.py
plugins/nemo-deployments/src/nemo_deployments_plugin/controller.py
plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py
plugins/nemo-deployments/tests/integration/conftest.py
plugins/nemo-deployments/tests/integration/docker_availability.py
plugins/nemo-deployments/tests/integration/test_reconcile_docker.py
plugins/nemo-deployments/tests/unit/backends/docker/conftest.py
plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py
plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py
plugins/nemo-deployments/tests/unit/backends/docker/test_labels.py
plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py
plugins/nemo-deployments/tests/unit/backends/docker/test_restart_policy.py
plugins/nemo-deployments/tests/unit/backends/docker/test_status_mapping.py
plugins/nemo-deployments/tests/unit/backends/docker_helpers.py
plugins/nemo-deployments/tests/unit/reconciler/test_entity_client.py
plugins/nemo-deployments/tests/unit/test_api_deployments.py

✅ Files skipped from review due to trivial changes (2)

plugins/nemo-deployments/tests/unit/backends/docker/test_status_mapping.py
plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py

🚧 Files skipped from review as they are similar to previous changes (20)

plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/config.py
plugins/nemo-deployments/tests/unit/backends/docker_helpers.py
plugins/nemo-deployments/tests/unit/backends/docker/test_restart_policy.py
plugins/nemo-deployments/tests/integration/docker_availability.py
plugins/nemo-deployments/tests/unit/backends/docker/test_labels.py
plugins/nemo-deployments/tests/integration/conftest.py
plugins/nemo-deployments/tests/unit/backends/docker/conftest.py
plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.py
plugins/nemo-deployments/pyproject.toml
plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py
plugins/nemo-deployments/tests/integration/test_reconcile_docker.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/containers.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.py
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py

CI collection failed with ModuleNotFoundError for docker_availability because integration test helpers live outside the repo root import path. Mirror the existing unit-test helper entry for plugins/nemo-deployments. Signed-off-by: Tyler Bray <tbray@nvidia.com>

On platform restart, seed the shared DockerGPUPool from running deployment-managed containers so GPUs already in use are not handed out again. Scan managed container labels and HostConfig DeviceRequests during pool bootstrap and add unit coverage for parsing and recovery. Signed-off-by: Tyler Bray <tbray@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py`:
- Around line 212-213: The shared GPU pool initialization in DockerGPU backend
caches an unrecovered pool state, so later get_shared_gpu_pool() calls can reuse
empty accounting after a transient recovery failure. Update the initialization
flow around DockerGPUPool and _recover_pool_allocations() so the pool is only
assigned/cached after recovery succeeds, or add explicit recovery state so a
failed recovery leaves _pool uninitialized and triggers a retry on the next
call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a882bdd9-3cc0-42b3-84a7-d0a2ee8a5f74

📥 Commits

Reviewing files that changed from the base of the PR and between aed4970 and dcd1bfa.

📒 Files selected for processing (2)

plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py
plugins/nemo-deployments/tests/unit/backends/docker/test_gpu.py

Docker integration tests share the itest workspace on one daemon. pytest-xdist loadscope spread them across workers, causing container remove/create races (409 removal in progress). Pin the package to a single xdist group, harden cleanup, and extend server reconcile polling. Signed-off-by: Tyler Bray <tbray@nvidia.com>

Only cache the shared GPU pool once container discovery succeeds so a transient Docker error does not leave all GPUs marked free on reuse. Signed-off-by: Tyler Bray <tbray@nvidia.com>

Use 127.0.0.1 instead of 0.0.0.0 when probing host port availability, matching the nemo-agents pattern. Also colocate docker test helpers under backends/docker/ for clearer pytest imports. Signed-off-by: Tyler Bray <tbray@nvidia.com>

tylersbray requested review from a team as code owners June 23, 2026 01:52

github-actions Bot added the feat label Jun 23, 2026

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

tylersbray force-pushed the 758-deployments-reconciler-prerequisite-dag/tbray branch from 0b92da1 to d036a92 Compare June 24, 2026 21:51

Base automatically changed from 758-deployments-reconciler-prerequisite-dag/tbray to main June 25, 2026 17:50

tylersbray force-pushed the 756-docker-deployment-backend/tbray branch from d1270d0 to dd8fd6c Compare June 25, 2026 21:26

github-advanced-security AI found potential problems Jun 25, 2026

View reviewed changes

Comment thread plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py Fixed

tylersbray added 4 commits June 26, 2026 08:30

fix(deployments): align Docker integration tests with #469 prerequisi…

af422b3

…te model Move server prerequisite from DeploymentConfig to Deployment, reference puller by deployment name, and drop deployments_by_config from reconcile_one. Signed-off-by: Tyler Bray <tbray@nvidia.com>

tylersbray force-pushed the 756-docker-deployment-backend/tbray branch from dd8fd6c to 3901ed8 Compare June 26, 2026 16:38

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py Outdated

tylersbray added 2 commits June 26, 2026 12:07

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py Outdated

tylersbray added 2 commits June 26, 2026 12:39

fix(deployments): retry GPU pool init after recovery failure

3b89239

Only cache the shared GPU pool once container discovery succeeds so a transient Docker error does not leave all GPUs marked free on reuse. Signed-off-by: Tyler Bray <tbray@nvidia.com>

tylersbray requested a review from benmccown June 26, 2026 20:32

Uh oh!

Conversation

tylersbray commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

tylersbray commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tylersbray commented Jun 26, 2026

Smoke testing (local, post-rebase)

Tier 1 — automated

Tier 2 — platform REST E2E (manual)

Issues found and fixed during smoke

Not run

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tylersbray commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 25, 2026 •

edited

Loading