feat(deployments): implement DockerDeploymentBackend (AIRCORE-756)#399
feat(deployments): implement DockerDeploymentBackend (AIRCORE-756)#399tylersbray wants to merge 9 commits into
Conversation
|
@coderabbitai full review |
✅ Action performedFull review finished. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a Docker deployment backend with container, volume, GPU, port, probe, and status handling. Registers it as the default ChangesDocker Deployment Backend
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (7)
plugins/nemo-deployments/tests/unit/backends/docker_helpers.py (1)
13-26: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse a concrete restart-policy type and drop the ignore.
restart_policy: stris too broad and forces# type: ignore[arg-type]. Type this as the concrete policy union/Literal used byDeploymentConfigand remove the ignore.As per coding guidelines, "
**/*.py: Always prefer concrete type hints over string-based ones in Python code."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/unit/backends/docker_helpers.py` around lines 13 - 26, The sample_config function has a restart_policy parameter typed as str which is too broad and requires a type ignore comment. Identify the concrete Literal type or union type that DeploymentConfig expects for its restart_policy field, update the function signature to use that concrete type instead of str for the restart_policy parameter, and remove the # type: ignore[arg-type] comment on the restart_policy argument passed to DeploymentConfig.Source: Coding guidelines
plugins/nemo-deployments/tests/integration/conftest.py (1)
10-18: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd type hints to module-level variables.
Per coding guidelines, prefer concrete type hints. Add annotations:
DOCKER_AVAILABLE: bool = Trueand
skip_without_docker: pytest.MarkDecorator = pytest.mark.skipif(...)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/integration/conftest.py` around lines 10 - 18, Add explicit type hints to the module-level variables DOCKER_AVAILABLE and skip_without_docker. For DOCKER_AVAILABLE, add a bool type annotation since it is assigned boolean values (True or False) in the try-except block. For skip_without_docker, add a pytest.MarkDecorator type annotation since it is assigned the result of pytest.mark.skipif(). These type hints should be added directly at the point of variable assignment to comply with coding guidelines.Source: Coding guidelines
plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py (2)
162-164: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winNarrow exception handling in cleanup.
Catching broad
Exceptionhides potential issues. UseNotFoundfromdocker.errors(already imported) for consistency with other cleanup blocks.♻️ Proposed fix
try: docker_backend._client.volumes.remove(docker_volume_name("itest", "data")) - except Exception: + except NotFound: pass🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py` around lines 162 - 164, The exception handling in the volume cleanup block is catching a broad Exception which can mask real issues. Replace the generic Exception catch with docker.errors.NotFound, which is already imported and used consistently in other cleanup blocks throughout the test. This will ensure only the expected "volume not found" error is silently ignored while other unexpected errors will properly surface for debugging.
126-128: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAccessing private attribute in test cleanup.
Test cleanup directly accesses
docker_backend._client, which couples tests to implementation details. Consider adding a public cleanup method to the backend or using the Docker SDK directly viadocker.from_env().🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py` around lines 126 - 128, The test cleanup code directly accesses the private `_client` attribute from docker_backend, creating coupling to implementation details. Either add a public cleanup method to the docker_backend class that encapsulates the container removal logic (containers.get(c_name).remove(force=True)) and call that method from the test, or alternatively, use docker.from_env() directly in the test cleanup code to access the Docker client without relying on private implementation details. Remove the direct access to the underscore-prefixed _client attribute.plugins/nemo-deployments/tests/integration/test_reconcile_docker.py (2)
160-174: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low valueOptimize loop by constructing dictionaries once.
The
by_nameandby_configdictionaries are reconstructed on every iteration. Move construction outside the loop for clarity and minor efficiency gain.♻️ Proposed refactor
+ by_name = {("itest", "puller"): puller_dep, ("itest", "server"): server_dep} + by_config = { + ("itest", "puller-cfg"): puller_dep, + ("itest", "server-cfg"): server_dep, + } for _ in range(40): - by_name = {("itest", "puller"): puller_dep, ("itest", "server"): server_dep} - by_config = { - ("itest", "puller-cfg"): puller_dep, - ("itest", "server-cfg"): server_dep, - } await deployment_reconciler.reconcile_one(Apply the same pattern to the server reconciliation loop at lines 177-192.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py` around lines 160 - 174, The by_name and by_config dictionaries are being reconstructed on every iteration of the loop in the puller deployment reconciliation block. Move the construction of both dictionaries outside the for loop (before the `for _ in range(40):` line) since they contain the same values on each iteration. Apply this same optimization pattern to the server reconciliation loop that follows, moving the dictionary construction for that loop outside as well to avoid redundant object creation.
128-138: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd type hints to get_side_effect.
Per coding guidelines, prefer concrete type hints. This function should annotate parameters and return type:
async def get_side_effect(entity_type: type, name: str, workspace: str | None = None) -> Volume | Deployment | DeploymentConfig:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py` around lines 128 - 138, The get_side_effect function is missing type hints on its parameters and return type. Add type annotations to the function signature: annotate the entity_type parameter as type, the name parameter as str, the workspace parameter as str or None with a default value of None, and the return type as a union of Volume, Deployment, and DeploymentConfig.Source: Coding guidelines
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py (1)
28-33: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueRestart policy parameter has no effect.
Both branches in
map_exited_statusreturn"FAILED"regardless ofrestart_policy. Consider removing the parameter if not needed, or document the future intent.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py` around lines 28 - 33, In the map_exited_status function, the restart_policy parameter is not being utilized effectively as both the if condition checking for restart_policy == "Always" and the final return statement return the same "FAILED" status. Either remove the restart_policy parameter entirely if it is not needed for this function's logic, or modify the conditional logic to return different DeploymentStatus values based on different restart_policy values so that the parameter actually influences the function's behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py`:
- Around line 148-157: The loop iterating through container_spec.ports does not
track previously assigned host ports within the current iteration, allowing
find_available_port to potentially assign the same host port multiple times.
Modify the code to collect already-assigned ports from the host_ports dictionary
in each loop iteration and pass them to find_available_port so it can exclude
those ports when searching for available options, ensuring no duplicate port
assignments occur within a single multi-port deployment.
- Line 292: Replace hardcoded label key strings with imported constants to
maintain consistency with canonical definitions. First, add imports for
BACKOFF_LIMIT_LABEL and MANAGED_BY_KEY from labels.py at the top of the file.
Then replace the hardcoded string "nmp.nvidia.com/backoff-limit" in the
backoff_limit variable assignment with the BACKOFF_LIMIT_LABEL constant, and
replace the hardcoded "managed-by" strings used in the container_labels.get()
call and labels.get() comparison with the MANAGED_BY_KEY constant.
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py`:
- Around line 17-18: Remove the TYPE_CHECKING conditional guard around the
docker import. Convert the conditional import of docker (currently under if
TYPE_CHECKING:) to a regular import statement at the module level, so that
docker is imported unconditionally rather than only for type checking purposes.
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py`:
- Around line 90-99: The probe target resolution incorrectly hardcodes 127.0.0.1
instead of using the actual host from the host_url parameter, which breaks
functionality for remapped host ports and remote Docker daemons. Replace all
instances of 127.0.0.1 in the base URL construction within the port resolution
logic (in the branches checking isinstance(port, int), the named_ports lookup,
and the host_ports loop) with the appropriate host extracted from the host_url
parameter. Apply the same fix to the additional probe target resolution code
mentioned in lines 111-127 to ensure consistency across all probe endpoint
constructions.
- Around line 11-19: The docker import statement is unnecessarily wrapped in a
TYPE_CHECKING conditional block. Since the file already uses from __future__
import annotations which stringifies all type hints at runtime, there is no
performance cost to importing DockerContainer directly. Remove the if
TYPE_CHECKING: conditional block that wraps the docker import statement and move
the from docker.models.containers import Container as DockerContainer import to
the top-level imports section with the other import statements like httpx and
the nemo_deployments_plugin imports.
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.py`:
- Around line 15-16: The docker import is currently guarded by the TYPE_CHECKING
conditional block. Remove the TYPE_CHECKING guard and import statement around
the docker module, and instead add docker as a regular import statement at the
top of the file with the other imports. This ensures docker is imported as a
concrete type hint rather than a string-based one, following the coding
guidelines.
In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`:
- Around line 26-32: The try-except block checking Docker availability and
setting _DOCKER_AVAILABLE is duplicated from the parent conftest.py file. Remove
lines 26-32 containing the duplicate Docker availability check (the try block
with docker.from_env().ping() and the _DOCKER_AVAILABLE assignment), and instead
use the existing DOCKER_AVAILABLE fixture or skip_without_docker marker that is
already defined in the parent conftest.py. Pytest automatically discovers and
uses parent conftest fixtures and markers, so this duplicate check should be
completely removed from this file.
In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py`:
- Around line 33-39: Remove the duplicated Docker availability check in the
test_reconcile_docker.py file. Delete the entire try-except block that attempts
to import docker and calls docker.from_env().ping() to set DOCKER_AVAILABLE.
Instead, rely on the Docker availability check already defined in the parent
conftest.py file, which Pytest discovers automatically through its marker
system. This eliminates code duplication and ensures consistency across all test
files in the project.
In `@plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py`:
- Line 11: The import statement at line 11 using `from backends.docker_helpers
import container_attrs, sample_config` is using an absolute import path that
does not resolve correctly for the test package layout. Change this to use a
relative import that properly navigates from the test file's location
(plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py) to
the helpers module, or adjust the import to use the correct absolute path that
matches your package structure so that pytest can resolve the module correctly.
In `@plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py`:
- Line 11: The import statement at the top of test_idempotency.py uses an
incorrect module path. The line `from backends.docker_helpers import
container_attrs, sample_config` needs to be updated to use the correct relative
or absolute import path that matches your repository's actual package structure.
Determine the correct path by checking where the docker_helpers module actually
exists in your repo (likely in the backends directory relative to the tests
directory), and adjust the import to use either a relative import (like from
..docker_helpers or similar) or the correct absolute path from the repo root.
In `@plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py`:
- Around line 21-24: The assertion in test_is_port_free_local is a tautology
that always evaluates to true since it checks if something is true OR not true,
which provides no meaningful test coverage. Replace this assertion with a
deterministic test by either occupying a port and verifying is_port_free returns
False for that port, or using a known free port and verifying is_port_free
returns True. Set up a specific port state beforehand (such as binding to a port
or using a port known to be available) and then assert that is_port_free returns
the expected boolean result for that deterministic condition.
- Around line 26-36: The test_find_available_port_skips_used test depends on
real host port availability which makes it flaky and environment-dependent. Mock
the is_port_free function that is called within find_available_port to return
deterministic values (such as returning False for port 9000 to simulate it being
in use, and True for ports 9001 and 9002 to simulate them being free). Then
update the assertion to check for the exact expected port (9001) instead of
using an assertion with multiple possible values, ensuring the test behavior is
reproducible and not dependent on the actual system port state.
---
Nitpick comments:
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py`:
- Around line 28-33: In the map_exited_status function, the restart_policy
parameter is not being utilized effectively as both the if condition checking
for restart_policy == "Always" and the final return statement return the same
"FAILED" status. Either remove the restart_policy parameter entirely if it is
not needed for this function's logic, or modify the conditional logic to return
different DeploymentStatus values based on different restart_policy values so
that the parameter actually influences the function's behavior.
In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`:
- Around line 162-164: The exception handling in the volume cleanup block is
catching a broad Exception which can mask real issues. Replace the generic
Exception catch with docker.errors.NotFound, which is already imported and used
consistently in other cleanup blocks throughout the test. This will ensure only
the expected "volume not found" error is silently ignored while other unexpected
errors will properly surface for debugging.
- Around line 126-128: The test cleanup code directly accesses the private
`_client` attribute from docker_backend, creating coupling to implementation
details. Either add a public cleanup method to the docker_backend class that
encapsulates the container removal logic
(containers.get(c_name).remove(force=True)) and call that method from the test,
or alternatively, use docker.from_env() directly in the test cleanup code to
access the Docker client without relying on private implementation details.
Remove the direct access to the underscore-prefixed _client attribute.
In `@plugins/nemo-deployments/tests/integration/conftest.py`:
- Around line 10-18: Add explicit type hints to the module-level variables
DOCKER_AVAILABLE and skip_without_docker. For DOCKER_AVAILABLE, add a bool type
annotation since it is assigned boolean values (True or False) in the try-except
block. For skip_without_docker, add a pytest.MarkDecorator type annotation since
it is assigned the result of pytest.mark.skipif(). These type hints should be
added directly at the point of variable assignment to comply with coding
guidelines.
In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py`:
- Around line 160-174: The by_name and by_config dictionaries are being
reconstructed on every iteration of the loop in the puller deployment
reconciliation block. Move the construction of both dictionaries outside the for
loop (before the `for _ in range(40):` line) since they contain the same values
on each iteration. Apply this same optimization pattern to the server
reconciliation loop that follows, moving the dictionary construction for that
loop outside as well to avoid redundant object creation.
- Around line 128-138: The get_side_effect function is missing type hints on its
parameters and return type. Add type annotations to the function signature:
annotate the entity_type parameter as type, the name parameter as str, the
workspace parameter as str or None with a default value of None, and the return
type as a union of Volume, Deployment, and DeploymentConfig.
In `@plugins/nemo-deployments/tests/unit/backends/docker_helpers.py`:
- Around line 13-26: The sample_config function has a restart_policy parameter
typed as str which is too broad and requires a type ignore comment. Identify the
concrete Literal type or union type that DeploymentConfig expects for its
restart_policy field, update the function signature to use that concrete type
instead of str for the restart_policy parameter, and remove the # type:
ignore[arg-type] comment on the restart_policy argument passed to
DeploymentConfig.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5d5a85f9-3645-431f-b8df-57e8281266da
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (22)
plugins/nemo-deployments/pyproject.tomlplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/config.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/containers.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.pyplugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.pyplugins/nemo-deployments/tests/integration/conftest.pyplugins/nemo-deployments/tests/integration/test_reconcile_docker.pyplugins/nemo-deployments/tests/unit/backends/docker/conftest.pyplugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.pyplugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.pyplugins/nemo-deployments/tests/unit/backends/docker/test_labels.pyplugins/nemo-deployments/tests/unit/backends/docker/test_ports.pyplugins/nemo-deployments/tests/unit/backends/docker/test_restart_policy.pyplugins/nemo-deployments/tests/unit/backends/docker/test_status_mapping.pyplugins/nemo-deployments/tests/unit/backends/docker_helpers.py
0b92da1 to
d036a92
Compare
d1270d0 to
dd8fd6c
Compare
|
Add Docker backend for nemo-deployments-plugin with single-container v1 lifecycle, entity-store config fetch, port allocation, exec/http probes, plugin-local GPU pool, and unit plus integration test coverage. Stacks on #315 (AIRCORE-758 reconciler prerequisite DAG). Signed-off-by: Tyler Bray <tbray@nvidia.com>
…E-756) - Exclude already-assigned host ports during multi-port allocation - Use label constants; fix HTTP/TCP probe host/port resolution - Remove TYPE_CHECKING-only docker imports - Deduplicate Docker availability check via docker_availability module - Harden port unit tests; add type hints and cleanup improvements Signed-off-by: Tyler Bray <tbray@nvidia.com>
…te model Move server prerequisite from DeploymentConfig to Deployment, reference puller by deployment name, and drop deployments_by_config from reconcile_one. Signed-off-by: Tyler Bray <tbray@nvidia.com>
Entity store persists deployment and volume status under the data JSON column; filtering on top-level status caused list queries to fail and left DELETING deployments stuck. Align controller and status_in list API with data.status and update unit test expectations. Signed-off-by: Tyler Bray <tbray@nvidia.com>
dd8fd6c to
3901ed8
Compare
Smoke testing (local, post-rebase)Ran manual smoke against a local platform on Tier 1 — automateduv sync --extra docker
uv run pytest plugins/nemo-deployments/tests -q
Tier 2 — platform REST E2E (manual)Workspace
Issues found and fixed during smoke
Not run
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py`:
- Around line 93-101: The shared GPU pool initialization in
get_shared_gpu_pool() only seeds reserved IDs from detect_gpu_device_ids(), so
restart recovery never accounts for GPUs already in use by existing managed
containers. Update the pool bootstrap path to rebuild reserved allocations from
currently running deployment-managed Docker containers using their labels before
serving requests, then merge that state into the DockerGPUPool created in
get_shared_gpu_pool().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1cb70618-5c61-4293-923f-c49f9b523d4b
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (27)
plugins/nemo-deployments/pyproject.tomlplugins/nemo-deployments/src/nemo_deployments_plugin/api/v2/deployments.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/config.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/containers.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.pyplugins/nemo-deployments/src/nemo_deployments_plugin/controller.pyplugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.pyplugins/nemo-deployments/tests/integration/conftest.pyplugins/nemo-deployments/tests/integration/docker_availability.pyplugins/nemo-deployments/tests/integration/test_reconcile_docker.pyplugins/nemo-deployments/tests/unit/backends/docker/conftest.pyplugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.pyplugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.pyplugins/nemo-deployments/tests/unit/backends/docker/test_labels.pyplugins/nemo-deployments/tests/unit/backends/docker/test_ports.pyplugins/nemo-deployments/tests/unit/backends/docker/test_restart_policy.pyplugins/nemo-deployments/tests/unit/backends/docker/test_status_mapping.pyplugins/nemo-deployments/tests/unit/backends/docker_helpers.pyplugins/nemo-deployments/tests/unit/reconciler/test_entity_client.pyplugins/nemo-deployments/tests/unit/test_api_deployments.py
✅ Files skipped from review due to trivial changes (2)
- plugins/nemo-deployments/tests/unit/backends/docker/test_status_mapping.py
- plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py
🚧 Files skipped from review as they are similar to previous changes (20)
- plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py
- plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/config.py
- plugins/nemo-deployments/tests/unit/backends/docker_helpers.py
- plugins/nemo-deployments/tests/unit/backends/docker/test_restart_policy.py
- plugins/nemo-deployments/tests/integration/docker_availability.py
- plugins/nemo-deployments/tests/unit/backends/docker/test_labels.py
- plugins/nemo-deployments/tests/integration/conftest.py
- plugins/nemo-deployments/tests/unit/backends/docker/conftest.py
- plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py
- plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py
- plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py
- plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.py
- plugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.py
- plugins/nemo-deployments/pyproject.toml
- plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py
- plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py
- plugins/nemo-deployments/tests/integration/test_reconcile_docker.py
- plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/containers.py
- plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.py
- plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py
CI collection failed with ModuleNotFoundError for docker_availability because integration test helpers live outside the repo root import path. Mirror the existing unit-test helper entry for plugins/nemo-deployments. Signed-off-by: Tyler Bray <tbray@nvidia.com>
On platform restart, seed the shared DockerGPUPool from running deployment-managed containers so GPUs already in use are not handed out again. Scan managed container labels and HostConfig DeviceRequests during pool bootstrap and add unit coverage for parsing and recovery. Signed-off-by: Tyler Bray <tbray@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py`:
- Around line 212-213: The shared GPU pool initialization in DockerGPU backend
caches an unrecovered pool state, so later get_shared_gpu_pool() calls can reuse
empty accounting after a transient recovery failure. Update the initialization
flow around DockerGPUPool and _recover_pool_allocations() so the pool is only
assigned/cached after recovery succeeds, or add explicit recovery state so a
failed recovery leaves _pool uninitialized and triggers a retry on the next
call.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a882bdd9-3cc0-42b3-84a7-d0a2ee8a5f74
📒 Files selected for processing (2)
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.pyplugins/nemo-deployments/tests/unit/backends/docker/test_gpu.py
Docker integration tests share the itest workspace on one daemon. pytest-xdist loadscope spread them across workers, causing container remove/create races (409 removal in progress). Pin the package to a single xdist group, harden cleanup, and extend server reconcile polling. Signed-off-by: Tyler Bray <tbray@nvidia.com>
Only cache the shared GPU pool once container discovery succeeds so a transient Docker error does not leave all GPUs marked free on reuse. Signed-off-by: Tyler Bray <tbray@nvidia.com>
Use 127.0.0.1 instead of 0.0.0.0 when probing host port availability, matching the nemo-agents pattern. Also colocate docker test helpers under backends/docker/ for clearer pytest imports. Signed-off-by: Tyler Bray <tbray@nvidia.com>
Summary
Implements
DockerDeploymentBackendfornemo-deployments-plugin(AIRCORE-756).Rebased onto
mainafter #469 merged (prerequisites onDeployment, volumeDELETING, orphan grace, shutdown cancellation). Branch history was reset to cherry-pick only the Docker backend commits — no duplicated reconciler code from the old #315 stack.init_containersand multi-container specsconfig_nameonly; backend loads fullDeploymentConfigfrom the entity store viaNemoEntitiesClientDockerDeploymentConfigDockerGPUPool(consolidation tonemo_platform_plugintracked separately in AIRCORE-759)managed-by, anddeployment-configlabel matchingTest plan
pytest plugins/nemo-deployments/tests/unit— 138 passedpytest plugins/nemo-deployments/tests/integration— 4 passed (Docker daemon required)test_puller_server_prerequisite_chain— uses deployment-scoped prerequisites (puller→server)d1270d0included (port allocation, probe host resolution, label constants, test hardening)Summary by CodeRabbit
Release Notes
status_infiltering to use the stored nesteddata.statusfield.httpx>=0.27; introduced Docker extra pinned todocker>=7.0.