Skip to content

GCP-841: remove ClusterResourceSet feature gate from CAPG manager args#8795

Open
cristianoveiga wants to merge 1 commit into
openshift:mainfrom
cristianoveiga:fix/gcp-841-remove-clusterresourceset-feature-gate
Open

GCP-841: remove ClusterResourceSet feature gate from CAPG manager args#8795
cristianoveiga wants to merge 1 commit into
openshift:mainfrom
cristianoveiga:fix/gcp-841-remove-clusterresourceset-feature-gate

Conversation

@cristianoveiga

@cristianoveiga cristianoveiga commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Removes ClusterResourceSet=false from the --feature-gates arg passed to the CAPG manager
  • ClusterResourceSet was promoted to GA in CAPI 1.10 and removed in CAPI 1.12 (kubernetes-sigs/cluster-api#12950)
  • OCP 4.22+ ships CAPG built against CAPI 1.12.8, causing the capi-provider pod to crash at startup with: unrecognized feature gate: ClusterResourceSet
  • MachinePool=false is retained — still valid in CAPI 1.12 (Beta, default-on)

Fixes: https://redhat.atlassian.net/browse/GCP-841

Test plan

  • Existing unit tests pass (go test ./hypershift-operator/controllers/hostedcluster/internal/platform/gcp/)
  • periodic-ci-openshift-hypershift-release-4.23-periodics-e2e-v2-gke no longer fails due to capi-provider crash
  • capi-provider pod starts successfully on 4.22.x and 4.23.x without a CAPG image override

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Simplified GCP controller feature gate configuration by removing version-dependent logic, now using a fixed set of feature gates instead of conditionally adjusting based on payload version.

ClusterResourceSet was promoted to GA in CAPI 1.10 and removed entirely
in CAPI 1.12. OCP 4.22+ ships CAPG built against CAPI 1.12.8, causing
the capi-provider pod to crash at startup with:

  invalid argument "MachinePool=false,ClusterResourceSet=false" for
  "--feature-gates" flag: unrecognized feature gate: ClusterResourceSet

Fixes: GCP-841

Signed-off-by: Cristiano Veiga <cveiga@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2026
@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

In CAPIProviderDeploymentSpec within the GCP platform controller, the featureGates variable is now initialized with a single static entry (MachinePool=false). The previous conditional logic that parsed payloadVersion and appended ClusterResourceSet=false when the major version was 4 and the minor version was greater than 16 has been removed entirely.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not contain Ginkgo test definitions. Modified file (gcp.go) is non-test code; codebase uses standard Go testing, not Ginkgo.
Test Structure And Quality ✅ Passed PR modifies only non-test code (gcp.go) and contains no Ginkgo tests. Custom check for Ginkgo test quality is not applicable to this pull request.
Topology-Aware Scheduling Compatibility ✅ Passed This PR only modifies feature gate configuration strings for CAPI 1.12 compatibility; it introduces no scheduling constraints, affinity rules, topology assumptions, or replica changes whatsoever.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. It modifies only the GCP platform controller configuration to remove an obsolete feature gate, making this check not applicable.
No-Weak-Crypto ✅ Passed PR modifies GCP feature gate configuration, not cryptographic code. No MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, custom crypto, or insecure secret comparisons detected in changes.
Container-Privileges ✅ Passed PR contains no container privilege escalations: AllowPrivilegeEscalation=false, RunAsNonRoot=true, all capabilities dropped. Changes are only to feature gates, not security configuration.
No-Sensitive-Data-In-Logs ✅ Passed The PR removes a feature gate flag from CAPG controller configuration. No logging statements are added, modified, or exposed. No sensitive data (credentials, tokens, PII) is logged in this change.
Title check ✅ Passed The title accurately and specifically describes the main change: removing the ClusterResourceSet feature gate from CAPG manager arguments, which aligns with the core objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release label Jun 22, 2026
@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cristianoveiga
Once this PR has been reviewed and has the lgtm label, please assign csrwng for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/platform/gcp PR/issue for GCP (GCPPlatform) platform and removed do-not-merge/needs-area labels Jun 22, 2026
@cristianoveiga cristianoveiga changed the title fix(gcp): remove ClusterResourceSet feature gate from CAPG manager args GCP-841: remove ClusterResourceSet feature gate from CAPG manager args Jun 22, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 22, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 22, 2026

Copy link
Copy Markdown

@cristianoveiga: This pull request references GCP-841 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Removes ClusterResourceSet=false from the --feature-gates arg passed to the CAPG manager
  • ClusterResourceSet was promoted to GA in CAPI 1.10 and removed in CAPI 1.12 (kubernetes-sigs/cluster-api#12950)
  • OCP 4.22+ ships CAPG built against CAPI 1.12.8, causing the capi-provider pod to crash at startup with: unrecognized feature gate: ClusterResourceSet
  • MachinePool=false is retained — still valid in CAPI 1.12 (Beta, default-on)

Fixes: https://redhat.atlassian.net/browse/GCP-841

Test plan

  • Existing unit tests pass (go test ./hypershift-operator/controllers/hostedcluster/internal/platform/gcp/)
  • periodic-ci-openshift-hypershift-release-4.23-periodics-e2e-v2-gke no longer fails due to capi-provider crash
  • capi-provider pod starts successfully on 4.22.x and 4.23.x without a CAPG image override

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
  • Simplified GCP controller feature gate configuration by removing version-dependent logic, now using a fixed set of feature gates instead of conditionally adjusting based on payload version.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.09%. Comparing base (8019810) to head (f11fc38).
⚠️ Report is 106 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8795      +/-   ##
==========================================
- Coverage   42.09%   42.09%   -0.01%     
==========================================
  Files         766      766              
  Lines       95047    95043       -4     
==========================================
- Hits        40012    40008       -4     
  Misses      52221    52221              
  Partials     2814     2814              
Files with missing lines Coverage Δ
...rollers/hostedcluster/internal/platform/gcp/gcp.go 83.67% <ø> (-0.20%) ⬇️
Flag Coverage Δ
cmd-support 35.42% <ø> (ø)
cpo-hostedcontrolplane 44.48% <ø> (ø)
cpo-other 44.25% <ø> (ø)
hypershift-operator 51.91% <ø> (-0.01%) ⬇️
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cristianoveiga cristianoveiga marked this pull request as ready for review June 22, 2026 15:47
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2026
@openshift-ci openshift-ci Bot requested review from clebs and jimdaga June 22, 2026 15:47
@cristianoveiga

Copy link
Copy Markdown
Contributor Author

/test e2e-v2-gke

@clebs

clebs commented Jun 25, 2026

Copy link
Copy Markdown
Member

@cristianoveiga hypershift is still on CAPI 1.11, since you are removing a feature that is still there on that version we need to make sure it is fine.

@cristianoveiga

Copy link
Copy Markdown
Contributor Author

@cristianoveiga hypershift is still on CAPI 1.11, since you are removing a feature that is still there on that version we need to make sure it is fine.

Hi @clebs,

The deployed CAPG binary comes from the OCP payload image, built separately from HyperShift's own vendor. My understanding is that these versions are not required to match.

The OpenShift CAPG fork upgraded to CAPI 1.12.8 in openshift/cluster-api-provider-gcp@e049bbd, and the new payloads (GCP HCP minimum will be 4.23) ship that binary.

ClusterResourceSet doesn't exist in any supported CAPG binary, so the fix is safe.

@clebs

clebs commented Jun 26, 2026

Copy link
Copy Markdown
Member

@cristianoveiga I see, if older CAPG versions that are still on CAPI 1.11 do not have that either, it should work fine.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2070428924499726336 | Cost: $2.93488025 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cristianoveiga

Copy link
Copy Markdown
Contributor Author

/retest-required

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2070515044726083584 | Cost: $2.9783685 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@cristianoveiga: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aks f11fc38 link true /test e2e-aks
ci/prow/e2e-aws f11fc38 link true /test e2e-aws
ci/prow/e2e-aws-4-22 f11fc38 link true /test e2e-aws-4-22
ci/prow/e2e-v2-aws f11fc38 link true /test e2e-v2-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Here is the complete analysis report:

Test Failure Analysis Complete

Job Information

Job 1: e2e-aws

Job 2: e2e-aks

Test Failure Analysis

Error

e2e-aws (2 root failures):
1. TestCreateCluster/Main/EnsureGlobalPullSecret: DaemonSet global-pull-secret-syncer stuck at 2/3 pods ready → context deadline exceeded
2. TestCreateClusterRequestServingIsolation/ValidateHostedCluster/EnsureNoCrashingPods: packageserver pods had restartCount > 0 (1-2 restarts)

e2e-aks (1 root failure):
1. TestUpgradeControlPlane/ValidateHostedCluster/EnsureOAPIMountsTrustBundle: cannot exec into a container in a completed pod; current phase is Failed → 300s timeout

Summary

Both job failures are pre-existing flaky tests unrelated to PR #8795. The PR only modifies hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go, removing a conditional ClusterResourceSet=false feature gate from GCP CAPG manager args. None of the failing tests involve GCP — they run on AWS and AKS (Azure) platforms respectively. The e2e-aws failures are caused by a global-pull-secret-syncer DaemonSet that couldn't schedule its third pod and OLM packageserver pod restarts — both transient infrastructure issues. The e2e-aks failure is caused by an openshift-apiserver pod in Failed phase during the control plane upgrade test, preventing the EnsureOAPIMountsTrustBundle check from exec'ing into the container. These are environmental/timing flakes with no connection to the GCP code change.

Root Cause

e2e-aws — Failure 1: TestCreateCluster/Main/EnsureGlobalPullSecret

The test patches the management-cluster pull secret and waits for it to propagate to the guest cluster. As part of validation, it waits for the global-pull-secret-syncer DaemonSet to become fully ready (3/3 pods). One of the three pods never became ready — the DaemonSet was stuck at 2/3 ready for the entire 20-minute polling window until the context deadline expired. This is a transient scheduling or node issue on the guest cluster where one node couldn't run the syncer pod. A cascading failure then occurred in the next subtest (Check_if_the_config.json_is_correct_in_all_of_the_nodes) because the previous subtest left a kubelet-config-verifier DaemonSet behind, causing a 409 Conflict ("already exists").

e2e-aws — Failure 2: TestCreateClusterRequestServingIsolation/ValidateHostedCluster/EnsureNoCrashingPods

The EnsureNoCrashingPods validation checks that no pods in the guest cluster have restartCount > 0. Two packageserver pods (OLM component) had restart counts of 1 and 2 respectively. OLM packageserver restarts are a known transient issue during cluster initialization and are not caused by this PR's GCP change.

e2e-aks — Failure 1: TestUpgradeControlPlane/ValidateHostedCluster/EnsureOAPIMountsTrustBundle

During the control plane upgrade test, the openshift-apiserver pod entered a Failed phase. The EnsureOAPIMountsTrustBundle test tried to exec into this pod to verify the ca-bundle.crt file was mounted, but received "cannot exec into a container in a completed pod; current phase is Failed". This retried for 300 seconds (5 minutes) before timing out. The pod failure is a transient issue during the upgrade rollout — likely the old pod was terminated while the new one was starting, and the test tried to exec into the wrong (terminated) pod. Notably, the cluster itself ultimately rolled out successfully (Successfully waited for HostedCluster to rollout in 4m6s), confirming this was a timing issue with the pod lifecycle during upgrade.

PR Relationship: PR #8795 modifies only gcp.go to remove a conditional ClusterResourceSet=false feature gate from CAPG manager args. This code path is exclusively exercised on GCP platform. None of the failing tests run on GCP — they run on AWS and Azure (AKS). The failures are infrastructure flakes.

Root Cause — Detail per Test

e2e-aws: TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place...

util.go:2290: DaemonSet global-pull-secret-syncer not ready: 2/3 pods ready
  (repeated 37 times over ~20 minutes)
util.go:2270: Failed to get DaemonSet global-pull-secret-syncer: context deadline exceeded
globalps.go:217: failed to wait for DaemonSet global-pull-secret-syncer to be ready: context deadline exceeded

e2e-aws: TestCreateCluster/Main/EnsureGlobalPullSecret/Check_if_the_config.json_is_correct_in_all_of_the_nodes

daemonsets.apps "kubelet-config-verifier" already exists (409 Conflict — cascading from prior subtest)

e2e-aws: TestCreateClusterRequestServingIsolation/ValidateHostedCluster/EnsureNoCrashingPods

util.go:829: Container packageserver in pod packageserver-9ccbbfb8c-c2d75 has a restartCount > 0 (2)
util.go:829: Container packageserver in pod packageserver-9ccbbfb8c-lqpcg has a restartCount > 0 (1)

e2e-aks: TestUpgradeControlPlane/ValidateHostedCluster/EnsureOAPIMountsTrustBundle

util.go:970: Timed out after 300.001s.
  ca-bundle.crt file should be available in openshift-apiserver pod
  Expected success, but got an error:
    cannot exec into a container in a completed pod; current phase is Failed
Recommendations
  1. Rerun the failing jobs — These are transient infrastructure flakes unrelated to the GCP code change. A /retest should resolve them.

  2. No code changes needed — The PR's modification to remove the ClusterResourceSet=false feature gate from GCP CAPG manager args has no effect on AWS or AKS test paths.

  3. Known flake patterns to track:

    • EnsureNoCrashingPods failing on OLM packageserver restarts is a recurring flake pattern in HyperShift CI
    • global-pull-secret-syncer DaemonSet readiness timeouts may indicate intermittent node scheduling issues
    • EnsureOAPIMountsTrustBundle failing during control plane upgrades suggests a race condition between pod termination and the exec check
Evidence
Evidence Detail
PR scope Only modifies hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go — removes conditional ClusterResourceSet=false feature gate
PR platform GCP only — no AWS or AKS code paths affected
e2e-aws failure 1 global-pull-secret-syncer DaemonSet stuck at 2/3 ready pods for 20+ minutes → context deadline exceeded
e2e-aws failure 1 cascade kubelet-config-verifier DaemonSet 409 Conflict ("already exists") due to leftover from prior subtest
e2e-aws failure 2 packageserver pods restartCount > 0 (pod packageserver-9ccbbfb8c-c2d75: 2 restarts, pod packageserver-9ccbbfb8c-lqpcg: 1 restart)
e2e-aks failure 1 openshift-apiserver pod in Failed phase → cannot exec into a container in a completed pod → 300s timeout
e2e-aks cluster health HostedCluster rollout succeeded (4m6s), nodes ready (6m24s), conditions valid — failure was transient pod lifecycle timing
e2e-aws step e2e-aws-hypershift-aws-run-e2e-nested failed after 1h9m2s
e2e-aks step e2e-aks-hypershift-azure-run-e2e failed after 1h11m26s
e2e-aws test count 597 tests, 30 skipped, 8 failures (2 root + 6 cascading parent failures)
e2e-aks test count 402 tests, 47 skipped, 3 failures (1 root + 2 cascading parent failures)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/gcp PR/issue for GCP (GCPPlatform) platform jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants