GCP-841: remove ClusterResourceSet feature gate from CAPG manager args by cristianoveiga · Pull Request #8795 · openshift/hypershift

cristianoveiga · 2026-06-22T13:47:21Z

Summary

Removes ClusterResourceSet=false from the --feature-gates arg passed to the CAPG manager
ClusterResourceSet was promoted to GA in CAPI 1.10 and removed in CAPI 1.12 (kubernetes-sigs/cluster-api#12950)
OCP 4.22+ ships CAPG built against CAPI 1.12.8, causing the capi-provider pod to crash at startup with: unrecognized feature gate: ClusterResourceSet
MachinePool=false is retained — still valid in CAPI 1.12 (Beta, default-on)

Fixes: https://redhat.atlassian.net/browse/GCP-841

Test plan

Existing unit tests pass (go test ./hypershift-operator/controllers/hostedcluster/internal/platform/gcp/)
periodic-ci-openshift-hypershift-release-4.23-periodics-e2e-v2-gke no longer fails due to capi-provider crash
capi-provider pod starts successfully on 4.22.x and 4.23.x without a CAPG image override

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor
- Simplified GCP controller feature gate configuration by removing version-dependent logic, now using a fixed set of feature gates instead of conditionally adjusting based on payload version.

ClusterResourceSet was promoted to GA in CAPI 1.10 and removed entirely in CAPI 1.12. OCP 4.22+ ships CAPG built against CAPI 1.12.8, causing the capi-provider pod to crash at startup with: invalid argument "MachinePool=false,ClusterResourceSet=false" for "--feature-gates" flag: unrecognized feature gate: ClusterResourceSet Fixes: GCP-841 Signed-off-by: Cristiano Veiga <cveiga@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code)

openshift-ci · 2026-06-22T13:47:27Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-merge-bot · 2026-06-22T13:47:30Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

coderabbitai · 2026-06-22T13:47:48Z

📝 Walkthrough

Walkthrough

In CAPIProviderDeploymentSpec within the GCP platform controller, the featureGates variable is now initialized with a single static entry (MachinePool=false). The previous conditional logic that parsed payloadVersion and appended ClusterResourceSet=false when the major version was 4 and the minor version was greater than 16 has been removed entirely.

🚥 Pre-merge checks | ✅ 11

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR does not contain Ginkgo test definitions. Modified file (gcp.go) is non-test code; codebase uses standard Go testing, not Ginkgo.
Test Structure And Quality	✅ Passed	PR modifies only non-test code (gcp.go) and contains no Ginkgo tests. Custom check for Ginkgo test quality is not applicable to this pull request.
Topology-Aware Scheduling Compatibility	✅ Passed	This PR only modifies feature gate configuration strings for CAPI 1.12 compatibility; it introduces no scheduling constraints, affinity rules, topology assumptions, or replica changes whatsoever.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This PR does not add any Ginkgo e2e tests. It modifies only the GCP platform controller configuration to remove an obsolete feature gate, making this check not applicable.
No-Weak-Crypto	✅ Passed	PR modifies GCP feature gate configuration, not cryptographic code. No MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, custom crypto, or insecure secret comparisons detected in changes.
Container-Privileges	✅ Passed	PR contains no container privilege escalations: AllowPrivilegeEscalation=false, RunAsNonRoot=true, all capabilities dropped. Changes are only to feature gates, not security configuration.
No-Sensitive-Data-In-Logs	✅ Passed	The PR removes a feature gate flag from CAPG controller configuration. No logging statements are added, modified, or exposed. No sensitive data (credentials, tokens, PII) is logged in this change.
Title check	✅ Passed	The title accurately and specifically describes the main change: removing the ClusterResourceSet feature gate from CAPG manager arguments, which aligns with the core objective of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-22T13:47:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cristianoveiga
Once this PR has been reviewed and has the lgtm label, please assign csrwng for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-06-22T13:50:43Z

@cristianoveiga: This pull request references GCP-841 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Removes ClusterResourceSet=false from the --feature-gates arg passed to the CAPG manager

ClusterResourceSet was promoted to GA in CAPI 1.10 and removed in CAPI 1.12 (kubernetes-sigs/cluster-api#12950)

OCP 4.22+ ships CAPG built against CAPI 1.12.8, causing the capi-provider pod to crash at startup with: unrecognized feature gate: ClusterResourceSet

MachinePool=false is retained — still valid in CAPI 1.12 (Beta, default-on)

Fixes: https://redhat.atlassian.net/browse/GCP-841

Test plan

Existing unit tests pass (go test ./hypershift-operator/controllers/hostedcluster/internal/platform/gcp/)

periodic-ci-openshift-hypershift-release-4.23-periodics-e2e-v2-gke no longer fails due to capi-provider crash

capi-provider pod starts successfully on 4.22.x and 4.23.x without a CAPG image override

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor

Simplified GCP controller feature gate configuration by removing version-dependent logic, now using a fixed set of feature gates instead of conditionally adjusting based on payload version.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

codecov · 2026-06-22T13:56:18Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.09%. Comparing base (8019810) to head (f11fc38).
⚠️ Report is 106 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8795      +/-   ##
==========================================
- Coverage   42.09%   42.09%   -0.01%     
==========================================
  Files         766      766              
  Lines       95047    95043       -4     
==========================================
- Hits        40012    40008       -4     
  Misses      52221    52221              
  Partials     2814     2814

Files with missing lines	Coverage Δ
...rollers/hostedcluster/internal/platform/gcp/gcp.go	`83.67% <ø> (-0.20%)`	⬇️

Flag	Coverage Δ
cmd-support	`35.42% <ø> (ø)`
cpo-hostedcontrolplane	`44.48% <ø> (ø)`
cpo-other	`44.25% <ø> (ø)`
hypershift-operator	`51.91% <ø> (-0.01%)`	⬇️
other	`31.56% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cristianoveiga · 2026-06-22T16:43:54Z

/test e2e-v2-gke

clebs · 2026-06-25T11:07:15Z

@cristianoveiga hypershift is still on CAPI 1.11, since you are removing a feature that is still there on that version we need to make sure it is fine.

cristianoveiga · 2026-06-25T13:03:04Z

@cristianoveiga hypershift is still on CAPI 1.11, since you are removing a feature that is still there on that version we need to make sure it is fine.

Hi @clebs,

The deployed CAPG binary comes from the OCP payload image, built separately from HyperShift's own vendor. My understanding is that these versions are not required to match.

The OpenShift CAPG fork upgraded to CAPI 1.12.8 in openshift/cluster-api-provider-gcp@e049bbd, and the new payloads (GCP HCP minimum will be 4.23) ship that binary.

ClusterResourceSet doesn't exist in any supported CAPG binary, so the fix is safe.

clebs · 2026-06-26T08:43:59Z

@cristianoveiga I see, if older CAPG versions that are still on CAPI 1.11 do not have that either, it should work fine.

/lgtm

openshift-merge-bot · 2026-06-26T08:45:02Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

hypershift-jira-solve-ci · 2026-06-26T10:57:10Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2070428924499726336 | Cost: $2.93488025 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

cristianoveiga · 2026-06-26T14:29:59Z

/retest-required

hypershift-jira-solve-ci · 2026-06-26T17:00:22Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2070515044726083584 | Cost: $2.9783685 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

openshift-ci · 2026-06-26T17:07:48Z

@cristianoveiga: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aks	`f11fc38`	link	true	`/test e2e-aks`
ci/prow/e2e-aws	`f11fc38`	link	true	`/test e2e-aws`
ci/prow/e2e-aws-4-22	`f11fc38`	link	true	`/test e2e-aws-4-22`
ci/prow/e2e-v2-aws	`f11fc38`	link	true	`/test e2e-v2-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hypershift-jira-solve-ci · 2026-06-27T05:40:30Z

Here is the complete analysis report:

Test Failure Analysis Complete

Job Information

Job 1: e2e-aws

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
Build ID: 2070515047464964096
Target: e2e-aws
Result: FAILURE (597 tests, 30 skipped, 8 failures — 2 distinct root failures + cascading parent failures)
Duration: 2h53m24s
Prow URL: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/8795/pull-ci-openshift-hypershift-main-e2e-aws/2070515047464964096

Job 2: e2e-aks

Prow Job: pull-ci-openshift-hypershift-main-e2e-aks
Build ID: 2070515044726083584
Target: e2e-aks
Result: FAILURE (402 tests, 47 skipped, 3 failures — 1 distinct root failure + cascading parent failures)
Duration: 2h36m45s
Prow URL: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/8795/pull-ci-openshift-hypershift-main-e2e-aks/2070515044726083584

Test Failure Analysis

Error

e2e-aws (2 root failures):
1. TestCreateCluster/Main/EnsureGlobalPullSecret: DaemonSet global-pull-secret-syncer stuck at 2/3 pods ready → context deadline exceeded
2. TestCreateClusterRequestServingIsolation/ValidateHostedCluster/EnsureNoCrashingPods: packageserver pods had restartCount > 0 (1-2 restarts)

e2e-aks (1 root failure):
1. TestUpgradeControlPlane/ValidateHostedCluster/EnsureOAPIMountsTrustBundle: cannot exec into a container in a completed pod; current phase is Failed → 300s timeout

Summary

Both job failures are pre-existing flaky tests unrelated to PR #8795. The PR only modifies hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go, removing a conditional ClusterResourceSet=false feature gate from GCP CAPG manager args. None of the failing tests involve GCP — they run on AWS and AKS (Azure) platforms respectively. The e2e-aws failures are caused by a global-pull-secret-syncer DaemonSet that couldn't schedule its third pod and OLM packageserver pod restarts — both transient infrastructure issues. The e2e-aks failure is caused by an openshift-apiserver pod in Failed phase during the control plane upgrade test, preventing the EnsureOAPIMountsTrustBundle check from exec'ing into the container. These are environmental/timing flakes with no connection to the GCP code change.

Root Cause

e2e-aws — Failure 1: TestCreateCluster/Main/EnsureGlobalPullSecret

The test patches the management-cluster pull secret and waits for it to propagate to the guest cluster. As part of validation, it waits for the global-pull-secret-syncer DaemonSet to become fully ready (3/3 pods). One of the three pods never became ready — the DaemonSet was stuck at 2/3 ready for the entire 20-minute polling window until the context deadline expired. This is a transient scheduling or node issue on the guest cluster where one node couldn't run the syncer pod. A cascading failure then occurred in the next subtest (Check_if_the_config.json_is_correct_in_all_of_the_nodes) because the previous subtest left a kubelet-config-verifier DaemonSet behind, causing a 409 Conflict ("already exists").

e2e-aws — Failure 2: TestCreateClusterRequestServingIsolation/ValidateHostedCluster/EnsureNoCrashingPods

The EnsureNoCrashingPods validation checks that no pods in the guest cluster have restartCount > 0. Two packageserver pods (OLM component) had restart counts of 1 and 2 respectively. OLM packageserver restarts are a known transient issue during cluster initialization and are not caused by this PR's GCP change.

e2e-aks — Failure 1: TestUpgradeControlPlane/ValidateHostedCluster/EnsureOAPIMountsTrustBundle

During the control plane upgrade test, the openshift-apiserver pod entered a Failed phase. The EnsureOAPIMountsTrustBundle test tried to exec into this pod to verify the ca-bundle.crt file was mounted, but received "cannot exec into a container in a completed pod; current phase is Failed". This retried for 300 seconds (5 minutes) before timing out. The pod failure is a transient issue during the upgrade rollout — likely the old pod was terminated while the new one was starting, and the test tried to exec into the wrong (terminated) pod. Notably, the cluster itself ultimately rolled out successfully (Successfully waited for HostedCluster to rollout in 4m6s), confirming this was a timing issue with the pod lifecycle during upgrade.

PR Relationship: PR #8795 modifies only gcp.go to remove a conditional ClusterResourceSet=false feature gate from CAPG manager args. This code path is exclusively exercised on GCP platform. None of the failing tests run on GCP — they run on AWS and Azure (AKS). The failures are infrastructure flakes.

Root Cause — Detail per Test

e2e-aws: `TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place...`

util.go:2290: DaemonSet global-pull-secret-syncer not ready: 2/3 pods ready
  (repeated 37 times over ~20 minutes)
util.go:2270: Failed to get DaemonSet global-pull-secret-syncer: context deadline exceeded
globalps.go:217: failed to wait for DaemonSet global-pull-secret-syncer to be ready: context deadline exceeded

e2e-aws: `TestCreateCluster/Main/EnsureGlobalPullSecret/Check_if_the_config.json_is_correct_in_all_of_the_nodes`

daemonsets.apps "kubelet-config-verifier" already exists (409 Conflict — cascading from prior subtest)

e2e-aws: `TestCreateClusterRequestServingIsolation/ValidateHostedCluster/EnsureNoCrashingPods`

util.go:829: Container packageserver in pod packageserver-9ccbbfb8c-c2d75 has a restartCount > 0 (2)
util.go:829: Container packageserver in pod packageserver-9ccbbfb8c-lqpcg has a restartCount > 0 (1)

e2e-aks: `TestUpgradeControlPlane/ValidateHostedCluster/EnsureOAPIMountsTrustBundle`

util.go:970: Timed out after 300.001s.
  ca-bundle.crt file should be available in openshift-apiserver pod
  Expected success, but got an error:
    cannot exec into a container in a completed pod; current phase is Failed

Recommendations

Rerun the failing jobs — These are transient infrastructure flakes unrelated to the GCP code change. A /retest should resolve them.
No code changes needed — The PR's modification to remove the ClusterResourceSet=false feature gate from GCP CAPG manager args has no effect on AWS or AKS test paths.
Known flake patterns to track:
- EnsureNoCrashingPods failing on OLM packageserver restarts is a recurring flake pattern in HyperShift CI
- global-pull-secret-syncer DaemonSet readiness timeouts may indicate intermittent node scheduling issues
- EnsureOAPIMountsTrustBundle failing during control plane upgrades suggests a race condition between pod termination and the exec check

Evidence

Evidence	Detail
PR scope	Only modifies `hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go` — removes conditional `ClusterResourceSet=false` feature gate
PR platform	GCP only — no AWS or AKS code paths affected
e2e-aws failure 1	`global-pull-secret-syncer` DaemonSet stuck at 2/3 ready pods for 20+ minutes → `context deadline exceeded`
e2e-aws failure 1 cascade	`kubelet-config-verifier` DaemonSet 409 Conflict ("already exists") due to leftover from prior subtest
e2e-aws failure 2	`packageserver` pods `restartCount > 0` (pod `packageserver-9ccbbfb8c-c2d75`: 2 restarts, pod `packageserver-9ccbbfb8c-lqpcg`: 1 restart)
e2e-aks failure 1	`openshift-apiserver` pod in `Failed` phase → `cannot exec into a container in a completed pod` → 300s timeout
e2e-aks cluster health	HostedCluster rollout succeeded (`4m6s`), nodes ready (`6m24s`), conditions valid — failure was transient pod lifecycle timing
e2e-aws step	`e2e-aws-hypershift-aws-run-e2e-nested` failed after 1h9m2s
e2e-aks step	`e2e-aks-hypershift-azure-run-e2e` failed after 1h11m26s
e2e-aws test count	597 tests, 30 skipped, 8 failures (2 root + 6 cascading parent failures)
e2e-aks test count	402 tests, 47 skipped, 3 failures (1 root + 2 cascading parent failures)

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2026

openshift-ci Bot added the do-not-merge/needs-area label Jun 22, 2026

openshift-ci Bot added the area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release label Jun 22, 2026

openshift-ci Bot added area/platform/gcp PR/issue for GCP (GCPPlatform) platform and removed do-not-merge/needs-area labels Jun 22, 2026

cristianoveiga changed the title ~~fix(gcp): remove ClusterResourceSet feature gate from CAPG manager args~~ GCP-841: remove ClusterResourceSet feature gate from CAPG manager args Jun 22, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 22, 2026

cristianoveiga marked this pull request as ready for review June 22, 2026 15:47

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2026

openshift-ci Bot requested review from clebs and jimdaga June 22, 2026 15:47

openshift-ci Bot assigned clebs Jun 26, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2026

Uh oh!

Conversation

cristianoveiga commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Jun 22, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

openshift-ci Bot commented Jun 22, 2026

Uh oh!

openshift-ci-robot commented Jun 22, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

codecov Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cristianoveiga commented Jun 22, 2026

Uh oh!

clebs commented Jun 25, 2026

Uh oh!

cristianoveiga commented Jun 25, 2026

Uh oh!

clebs commented Jun 26, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 26, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented Jun 26, 2026

AI Test Failure Analysis

Uh oh!

cristianoveiga commented Jun 26, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented Jun 26, 2026

AI Test Failure Analysis

Uh oh!

openshift-ci Bot commented Jun 26, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented Jun 27, 2026

Test Failure Analysis Complete

Job Information

Job 1: e2e-aws

Job 2: e2e-aks

Test Failure Analysis

Error

Summary

e2e-aws: TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place...

e2e-aws: TestCreateCluster/Main/EnsureGlobalPullSecret/Check_if_the_config.json_is_correct_in_all_of_the_nodes

e2e-aws: TestCreateClusterRequestServingIsolation/ValidateHostedCluster/EnsureNoCrashingPods

e2e-aks: TestUpgradeControlPlane/ValidateHostedCluster/EnsureOAPIMountsTrustBundle

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cristianoveiga commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

openshift-ci-robot commented Jun 22, 2026 •

edited by openshift-ci Bot

Loading

codecov Bot commented Jun 22, 2026 •

edited

Loading

e2e-aws: `TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place...`

e2e-aws: `TestCreateCluster/Main/EnsureGlobalPullSecret/Check_if_the_config.json_is_correct_in_all_of_the_nodes`

e2e-aws: `TestCreateClusterRequestServingIsolation/ValidateHostedCluster/EnsureNoCrashingPods`

e2e-aks: `TestUpgradeControlPlane/ValidateHostedCluster/EnsureOAPIMountsTrustBundle`