Add IMDSv2 migration command for ROSA Classic clusters#909
Conversation
WalkthroughThis PR introduces ChangesIMDSv2 Migration Command
Sequence DiagramsequenceDiagram
participant User as User/CLI
participant Cmd as osdctl imdsv2
participant OCM as OCM API
participant K8s as Kubernetes API
participant Hive as Hive API
User->>Cmd: cluster imdsv2 --cluster-id=... --nodes=...
Cmd->>OCM: Fetch cluster and validate AWS Classic
Cmd->>K8s: Initialize standard and admin clients
Cmd->>Hive: Initialize Hive clients (if infra/workers)
Cmd->>K8s: Pre-flight: Check operators, CPMS, nodes, etcd
rect rgba(100, 200, 150, 0.5)
Note over Cmd,Hive: Infra Migration (if infra/all)
Cmd->>Hive: Get infra MachinePool
Cmd->>Hive: Patch for IMDSv2
Cmd->>K8s: Wait for operators healthy
end
rect rgba(200, 150, 100, 0.5)
Note over Cmd,K8s: CPMS Update (if master/all)
Cmd->>K8s: Get ControlPlaneMachineSet
Cmd->>K8s: Patch for IMDSv2
Cmd->>K8s: Monitor rollout completion
end
rect rgba(150, 150, 200, 0.5)
Note over Cmd,Hive: Worker Migration (if workers/all)
Cmd->>Hive: List worker MachinePools
Cmd->>Hive: Patch for IMDSv2
end
Cmd->>K8s: Final validation: poll nodes Ready
Cmd->>K8s: Check Machine provider specs
Cmd->>User: Success/warnings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cmd/cluster/imdsv2.go (1)
348-395: ⚖️ Poor tradeoffConsider using the Kubernetes client instead of shelling out to
oc.Shelling out to
ocintroduces a dependency on the CLI being installed and configured. The kubeconfig context used byocmay differ from the client initialized via backplane ininit(). Consider using theo.clientto fetchClusterOperatorresources directly via theconfig.openshift.io/v1API group.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/cluster/imdsv2.go` around lines 348 - 395, The checkClusterOperators function currently shells out to `oc` to list ClusterOperators; replace that with direct API calls using the existing o.client so the same kubeconfig/backplane client is used. Use the OpenShift config v1 ClusterOperator type (e.g., configv1.ClusterOperatorList) or a dynamic client list against "config.openshift.io/v1" to fetch all ClusterOperator resources with o.client.List(ctx, &list) (or equivalent), then iterate list.Items and inspect each item.Status.Conditions exactly as done now to build unhealthyOps, returning errors on client.List or unmarshalling failures as appropriate; keep the same final behavior/message and error formatting but remove the exec.CommandContext call and JSON unmarshal logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/cluster/imdsv2.go`:
- Around line 728-754: The function monitorCPMSRollout creates a timeout context
from context.Background(), breaking cancellation propagation; change the context
creation to derive from the passed ctx (use context.WithTimeout(ctx,
imdsv2RolloutPollTimeout) and keep cancel deferred) so parent cancellations
(SIGINT or upstream cancellation) interrupt the polling loop; update the context
creation in monitorCPMSRollout where pollCtx and cancel are defined (and
continue to pass pollCtx into wait.PollUntilContextTimeout as before).
---
Nitpick comments:
In `@cmd/cluster/imdsv2.go`:
- Around line 348-395: The checkClusterOperators function currently shells out
to `oc` to list ClusterOperators; replace that with direct API calls using the
existing o.client so the same kubeconfig/backplane client is used. Use the
OpenShift config v1 ClusterOperator type (e.g., configv1.ClusterOperatorList) or
a dynamic client list against "config.openshift.io/v1" to fetch all
ClusterOperator resources with o.client.List(ctx, &list) (or equivalent), then
iterate list.Items and inspect each item.Status.Conditions exactly as done now
to build unhealthyOps, returning errors on client.List or unmarshalling failures
as appropriate; keep the same final behavior/message and error formatting but
remove the exec.CommandContext call and JSON unmarshal logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 993b8d49-1d85-43e7-9f2d-d04b1a1555f0
📒 Files selected for processing (2)
cmd/cluster/cmd.gocmd/cluster/imdsv2.go
|
/retest |
|
@coderabbitai review |
✅ Action performedReview finished.
|
clcollins
left a comment
There was a problem hiding this comment.
A couple of questions.
Also, Claude noticed ""preFlightChecks, monitorCPMSRollout, init, and Hive client setup are near-identical copies of changevolumetype.go" and suggested they should be consolidated/adjusted to work so there's less redundant code.
| } | ||
|
|
||
| // replaceInfraMachines replaces infra machines one at a time. | ||
| func (o *imdsv2Options) replaceInfraMachines(ctx context.Context) error { |
There was a problem hiding this comment.
You may want to try to use
osdctl/pkg/infra/machinepool.go
Line 95 in 0f7bc39
| } | ||
|
|
||
| // Delete the machine (MachineSet will create replacement with IMDSv2) | ||
| if err := o.clientAdmin.Delete(ctx, machine); err != nil { |
There was a problem hiding this comment.
There are confirmation functions elsewhere in this PR, but there's not one included here. Is that intended or desired?
| } | ||
| } | ||
|
|
||
| // Step 3: Clean up temporary override annotations from MachinePools |
There was a problem hiding this comment.
What happens if this process fails partway through, before it gets here? The annotations would continue to live on the machine pools:
| fmt.Printf("Hive namespace: %s\n", hiveNamespace) | ||
|
|
||
| // Retrieve all MachinePools for this cluster | ||
| mpList := &hivev1.MachinePoolList{} |
There was a problem hiding this comment.
Is there any chance this would match machinepools you don't want to patch? Are ALL machinepools in that namespace supposed to be patched, every time? (Truly asking - i'm not 100% sure)
| fmt.Println("Worker nodes must be replaced by the customer.") | ||
| } | ||
|
|
||
| printer.PrintlnGreen("\n✓ IMDSv2 migration completed successfully!") |
There was a problem hiding this comment.
Nit: Everything before this is an if statement, so there's a chance (however small - errors, or changes in the future) where you might get here with no actual operation happening, and still get the "completed successfully!" message. Is there something you can check on to ensure that it happened? Some state on the nodes, etc?
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: givaldolins The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cmd/cluster/imdsv2.go (1)
628-634: 💤 Low value
time.Sleepignores context cancellation.The retry loop uses
time.Sleep(retryDelay)which doesn't respect the passedctx. If the user sends SIGINT during the 30-second sleep, the command won't respond until the sleep completes.Consider using a context-aware wait:
Suggested improvement
- if attempt < maxRetries { - fmt.Printf(" ⏳ Waiting for %d nodes to become Ready (attempt %d/%d): %s\n", - len(notReadyNodes), attempt, maxRetries, strings.Join(notReadyNodes, ", ")) - time.Sleep(retryDelay) - continue + if attempt < maxRetries { + fmt.Printf(" ⏳ Waiting for %d nodes to become Ready (attempt %d/%d): %s\n", + len(notReadyNodes), attempt, maxRetries, strings.Join(notReadyNodes, ", ")) + select { + case <-ctx.Done(): + return ctx.Err() + case <-time.After(retryDelay): + } + continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/cluster/imdsv2.go` around lines 628 - 634, The `time.Sleep(retryDelay)` call in the retry loop blocks indefinitely and ignores context cancellation, preventing responsive handling of SIGINT signals. Replace the `time.Sleep(retryDelay)` statement with a context-aware wait mechanism using a select statement that monitors both `ctx.Done()` for cancellation and a timer for the delay. If the context is cancelled first, break out of the retry loop; otherwise, continue to the next retry attempt after the timer fires.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/cluster/imdsv2.go`:
- Around line 399-406: The validWorkerNames map and filtering logic in the
MachinePool iteration only matches the default "worker" pool name, silently
skipping custom worker pools like "worker-2" or "gpu-workers" that ROSA
supports. Replace the inclusion-based filter (validWorkerNames map checking
mp.Spec.Name == "worker") with an exclusion-based approach that skips only
"infra" and "master" pool types, or add logging to indicate which pools are
being skipped so operators understand which pools require separate handling.
This ensures the function processes all worker pools as its comment claims
rather than just the single default pool.
---
Nitpick comments:
In `@cmd/cluster/imdsv2.go`:
- Around line 628-634: The `time.Sleep(retryDelay)` call in the retry loop
blocks indefinitely and ignores context cancellation, preventing responsive
handling of SIGINT signals. Replace the `time.Sleep(retryDelay)` statement with
a context-aware wait mechanism using a select statement that monitors both
`ctx.Done()` for cancellation and a timer for the delay. If the context is
cancelled first, break out of the retry loop; otherwise, continue to the next
retry attempt after the timer fires.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a8204653-bfa2-4998-9304-1c072b527ed8
📒 Files selected for processing (2)
cmd/cluster/helpers.gocmd/cluster/imdsv2.go
This command automates the SOP for migrating ROSA Classic cluster nodes to enforce IMDSv2 (Instance Metadata Service v2) by: - Replacing infra nodes using the MachinePool dance pattern - Updating ControlPlaneMachineSet for automatic master node rollout - Patching worker MachinePools (customer performs machine replacement) - Validating all nodes/machines are using IMDSv2 Key features: - Pre-flight checks verify cluster health before making changes - Confirmation prompts before all destructive operations - Tracks actual changes and shows accurate success/skip messages - Supports selective migration: --nodes all|master|infra|workers - Context-aware with proper cancellation handling - Comprehensive test coverage (11 test cases) Implementation follows all review comments from @clcollins: 1. Code consolidation via helpers.go (shared with changevolumetype) 2. Uses RunMachinePoolDance for atomic infra replacement 3. Confirmation prompts before destructive operations 4. No annotation cleanup risk (eliminated pattern entirely) 5. MachinePool name validation for safety 6. Accurate change tracking and reporting Additional improvements from CodeRabbit AI review: - Exclusion-based worker pool filtering to support custom pools (worker-2, gpu-workers, etc.) - Context-aware sleep for responsive cancellation handling Files added: - cmd/cluster/imdsv2.go (501 lines) - main command implementation - cmd/cluster/helpers.go (235 lines) - shared helper functions - cmd/cluster/imdsv2_test.go (580 lines) - comprehensive tests - docs/osdctl_cluster_imdsv2.md - generated documentation All tests passing, builds successfully. Signed-off-by: Givaldo Lins <glins@redhat.com>
|
@givaldolins: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/cluster/imdsv2_test.go`:
- Around line 224-333: The tests `TestCheckIMDSv2Configuration` and
`TestMachinePoolNameValidation` are tautological because they duplicate logic
locally instead of testing actual production functions. To fix this, identify
the real production functions or helpers in `imdsv2.go` that implement the
IMDSv2 configuration checking and machine pool name validation logic, then
rewrite these tests to call those actual functions with the test inputs and
assert on their outputs, rather than reimplementing the same logic inline. If no
suitable public functions exist, extract testable helper functions from
`imdsv2.go` and test those instead. This ensures the tests will catch actual
regressions in the production code.
- Line 397: The Name field assignment is using string(rune(i)) which converts
the integer loop counter to its corresponding Unicode character rather than its
numeric digit string representation, resulting in non-printable control
characters. Replace the rune conversion with proper numeric string formatting
such as using strconv.Itoa(i) or fmt.Sprintf with the %d verb to convert the
integer i to its string representation, so that the master node names are
generated correctly as "master-0", "master-1", etc.
- Around line 88-90: Replace all instances where `AddToScheme()` and `Build()`
errors are ignored using blank identifier assignments (like `_ =
corev1.AddToScheme(scheme)`) with proper error checking using
`require.NoError(t, err)` instead. This applies to all occurrences in
imdsv2_test.go, particularly at the locations mentioned (lines 88-90, 136-138,
183-185, 337, 366-368, and 527-553). Capture the error returned from each
AddToScheme() and Build() call and validate it immediately to ensure setup
failures are caught at the point of failure rather than causing misleading test
symptoms later.
In `@cmd/cluster/imdsv2.go`:
- Around line 514-518: Add a nil check for ProviderSpec.Value before
dereferencing Value.Raw in the code block that parses the AWS provider spec.
Before the json.Unmarshal call that accesses
cpms.Spec.Template.OpenShiftMachineV1Beta1Machine.Spec.ProviderSpec.Value.Raw,
verify that Value is not nil and return an appropriate error if it is missing.
This defensive check should follow the same pattern used in the validateIMDSv2
function to prevent potential panics from malformed or missing provider
specifications.
In `@docs/osdctl_cluster_imdsv2.md`:
- Around line 7-40: The command documentation advertises workers as a supported
value in the --nodes flag option, but the synopsis and examples only document
migration steps for infra and master nodes, creating a discrepancy. Either add a
documented worker node migration step and example to match the advertised
--nodes workers capability, or remove workers from the supported values in the
--nodes option. Since this documentation is auto-generated, make these changes
in the source code generator or template that produces this markdown file, not
directly in the emitted markdown.
In `@docs/README.md`:
- Around line 1767-1798: The documentation for the imdsv2 command advertises
support for the --nodes workers flag but does not describe the worker migration
workflow, leaving this documented CLI option undocumented. Instead of editing
the markdown file directly under docs/, locate and update the source generator
or template that produces this documentation (likely a source file or template
outside the docs/ directory). Add documentation describing the worker migration
path using the --nodes workers flag to ensure all advertised command options are
properly documented and the CLI contract is consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b71875da-6e93-49c0-bc85-cbd7a3620dc8
📒 Files selected for processing (7)
cmd/cluster/cmd.gocmd/cluster/helpers.gocmd/cluster/imdsv2.gocmd/cluster/imdsv2_test.godocs/README.mddocs/osdctl_cluster.mddocs/osdctl_cluster_imdsv2.md
🚧 Files skipped from review as they are similar to previous changes (3)
- cmd/cluster/cmd.go
- docs/osdctl_cluster.md
- cmd/cluster/helpers.go
Hi @clcollins, |
Implements automated IMDSv2 migration workflow with:
Supports selective migration via --nodes flag (all, master, infra, workers).
This MR automates SOP
Summary by CodeRabbit
New Features
osdctl cluster imdsv2to migrate ROSA Classic clusters to require EC2 IMDSv2 authentication.--nodes all|master|infra|workers, with pre-flight gating, interactive confirmations, and post-migration validation of node and provider-spec configuration.Documentation
--nodes/flag details.Tests