Skip to content

add srelib poc#922

Draft
petrkotas wants to merge 1 commit into
openshift:masterfrom
petrkotas:add-srelib-poc
Draft

add srelib poc#922
petrkotas wants to merge 1 commit into
openshift:masterfrom
petrkotas:add-srelib-poc

Conversation

@petrkotas

@petrkotas petrkotas commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary by CodeRabbit

  • New Features

    • Integrated SRE library plugin system for cluster operations and queries.
    • Cluster lookups and status checks now handled through the external SRE library plugin.
  • Chores

    • Updated Go toolchain from version 1.25.7 to 1.26.3.
    • Updated multiple dependencies and added new ones to support the plugin architecture.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2026
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Walkthrough

The PR integrates a HashiCorp go-plugin-backed SRE library client. A new pkg/srelib/client.go wraps the plugin lifecycle and exposes cluster lookup methods. main.go resolves the plugin binary path and initializes the client at startup. pkg/utils/ocm.go redirects cluster lookup helpers to delegate through the new client instead of the OCM SDK. go.mod adds the required dependencies and a local replace directive.

Changes

srelib Plugin Client Integration

Layer / File(s) Summary
Module dependencies and replace directives
go.mod
Updates Go to 1.26.3, adds direct dependencies for go-hclog, go-plugin, and srelib, adds indirect entries for protobuf, yamux, and oklog/run, updates the observatorium/api replace version, and adds a local replace for srelib => ../srelib.
srelib plugin Client wrapper
pkg/srelib/client.go
Defines Client holding an inner v1.Client and *plugin.Client killer; NewClient launches the plugin process, performs RPC handshake/dispense, and kills the process on failure; pass-through GetClusters/GetClusterAnyStatus delegate to inner client; Close kills the plugin process.
Startup initialization and cluster lookup delegation
main.go, pkg/utils/ocm.go
main.go adds resolveSrelibPlugin() to resolve the plugin binary path from env or executable dir, initializes the client before running the Cobra command, defers Close, and calls SetSrelibClient; ocm.go adds activeSrelibClient and SetSrelibClient, and replaces OCM SDK cluster query logic in GetClusterAnyStatus and GetClusters with srelib delegation.

Sequence Diagram(s)

sequenceDiagram
  participant main
  participant resolveSrelibPlugin
  participant srelibpkg as pkg/srelib.NewClient
  participant goPlugin as go-plugin process
  participant ocmUtils as pkg/utils (GetClusters)

  main->>resolveSrelibPlugin: resolve plugin path
  resolveSrelibPlugin-->>main: pluginPath
  main->>srelibpkg: NewClient(pluginPath)
  srelibpkg->>goPlugin: start process + RPC handshake
  goPlugin-->>srelibpkg: v1.Client dispensed
  srelibpkg-->>main: *Client
  main->>ocmUtils: SetSrelibClient(*Client)
  Note over ocmUtils: activeSrelibClient set
  ocmUtils->>goPlugin: GetClusters / GetClusterAnyStatus via inner v1.Client
  goPlugin-->>ocmUtils: []*cmv1.Cluster
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'add srelib poc' is vague and uses generic terminology ('poc' - proof of concept) without clearly describing the specific changes or purpose of the pull request. Consider expanding the title to be more descriptive, such as 'Add SRE library plugin client integration' or similar to better convey the main objective and changes.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test definitions (It(), Describe(), Context(), When()) found in PR changes. Modified files contain only implementation code and configuration, with no test code that could have unstable o...
Test Structure And Quality ✅ Passed No Ginkgo test code was added or modified in this PR. The changes only affect production code (go.mod, main.go, pkg/srelib/client.go, pkg/utils/ocm.go).
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. Changes consist of go.mod dependency updates, main.go plugin initialization, pkg/srelib/client.go plugin wrapper implementation, and pkg/utils/ocm.go refac...
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Changes are limited to go.mod, main.go, pkg/srelib/client.go, and pkg/utils/ocm.go—all production/library code with no test definitions or multi-node a...
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces Go module updates and library/CLI code (plugin client wrapper and utility refactoring) only. No Kubernetes deployment manifests, operator code, or scheduling constraints that would re...
Ote Binary Stdout Contract ✅ Passed New code in main.go writes only to stderr via fmt.Fprintf(os.Stderr, ...). New pkg/srelib/client.go uses hclog (stderr by default). Utility functions in ocm.go are not process-level code. The check...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add any Ginkgo e2e tests - check not applicable. Only 4 files modified: go.mod, main.go, pkg/srelib/client.go, pkg/utils/ocm.go (all non-test files).
No-Weak-Crypto ✅ Passed No weak cryptographic usage found in PR changes: no MD5, SHA1, DES, RC4, 3DES, Blowfish, or ECB; no custom crypto implementations; no non-constant-time secret comparisons.
Container-Privileges ✅ Passed PR contains only Go source code changes (go.mod, main.go, pkg/srelib/, pkg/utils/); no container or K8s manifests with privileged settings added or modified.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data (passwords, tokens, API keys, PII, session IDs, hostnames, or customer data) is exposed in logging statements throughout the PR changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petrkotas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go.mod`:
- Line 280: The local filesystem replace directive for
github.com/petrkotas/srelib that points to ../srelib will cause build failures
in CI systems and clean environments where that relative path does not exist.
Remove this replace directive from go.mod entirely, or if the module is needed
as a dependency, replace it with a resolvable reference such as a published
version, a commit-based pseudo-version, or a git branch reference that can be
accessed reproducibly across all build environments.

In `@pkg/srelib/client.go`:
- Line 42: The type assertion `raw.(v1.Client)` in the return statement is
unguarded and will panic if the plugin dispense does not return a value of the
expected type. Guard the type assertion by using the two-value form (value, ok
:= raw.(v1.Client)) to check if the assertion succeeds. If the assertion fails,
return nil and a descriptive error instead of allowing the panic. Only create
and return the Client struct if the type assertion succeeds.

In `@pkg/utils/ocm.go`:
- Around line 69-77: The GetClusterAnyStatus and GetClusters functions must
guard access to the global activeSrelibClient with nil checks to prevent panics
when called outside the main entrypoint (such as from docgen or tests), and the
GetClusters function must remove the log.Fatalf call that terminates the process
on error, instead returning the error to the caller so they can handle the
failure appropriately according to their error handling expectations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8f0259b6-27c3-4e3b-aafa-6b9b22f08c0c

📥 Commits

Reviewing files that changed from the base of the PR and between b19f141 and 8674fe3.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (4)
  • go.mod
  • main.go
  • pkg/srelib/client.go
  • pkg/utils/ocm.go

Comment thread go.mod

replace github.com/observatorium/api v0.1.2 => github.com/observatorium/api v0.1.3-0.20260326145919-3032029d0fde

replace github.com/petrkotas/srelib => ../srelib

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Replace directives:"
rg -n '^replace\s+' go.mod

echo "srelib entries:"
rg -n 'github.com/petrkotas/srelib' go.mod

if [ -d ../srelib ]; then
  echo "Found ../srelib locally (still non-portable across CI/consumers)."
else
  echo "../srelib is absent; this replace will fail in clean environments."
fi

Repository: openshift/osdctl

Length of output: 425


Remove local replace directive; it blocks builds in any clean environment.

replace github.com/petrkotas/srelib => ../srelib will fail immediately on CI systems and consumer builds since ../srelib is absent from the repository. This is a build blocker, not just a reproducibility issue. Either remove the replace directive or reference a resolvable module version (published release, commit-based pseudo-version, or git branch).

Per supply chain security guidelines, all dependency references must be reproducible and pinnable. Local filesystem paths violate this requirement and must not reach production.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go.mod` at line 280, The local filesystem replace directive for
github.com/petrkotas/srelib that points to ../srelib will cause build failures
in CI systems and clean environments where that relative path does not exist.
Remove this replace directive from go.mod entirely, or if the module is needed
as a dependency, replace it with a resolvable reference such as a published
version, a commit-based pseudo-version, or a git branch reference that can be
accessed reproducibly across all build environments.

Source: Coding guidelines

Comment thread pkg/srelib/client.go
return nil, fmt.Errorf("srelib: dispense plugin: %w", err)
}

return &Client{inner: raw.(v1.Client), killer: pc}, nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "client.go" | grep srelib

Repository: openshift/osdctl

Length of output: 83


🏁 Script executed:

head -60 pkg/srelib/client.go | cat -n

Repository: openshift/osdctl

Length of output: 1697


Unguarded type assertion from plugin dispense can panic.

raw.(v1.Client) will panic if the plugin returns a value that doesn't match the expected type. This can happen due to version mismatch, incorrect plugin configuration, or handshake issues. Guard the assertion and return an error instead:

Proposed fix
-	return &Client{inner: raw.(v1.Client), killer: pc}, nil
+	inner, ok := raw.(v1.Client)
+	if !ok {
+		pc.Kill()
+		return nil, fmt.Errorf("srelib: unexpected plugin client type %T", raw)
+	}
+	return &Client{inner: inner, killer: pc}, nil
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return &Client{inner: raw.(v1.Client), killer: pc}, nil
inner, ok := raw.(v1.Client)
if !ok {
pc.Kill()
return nil, fmt.Errorf("srelib: unexpected plugin client type %T", raw)
}
return &Client{inner: inner, killer: pc}, nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/srelib/client.go` at line 42, The type assertion `raw.(v1.Client)` in the
return statement is unguarded and will panic if the plugin dispense does not
return a value of the expected type. Guard the type assertion by using the
two-value form (value, ok := raw.(v1.Client)) to check if the assertion
succeeds. If the assertion fails, return nil and a descriptive error instead of
allowing the panic. Only create and return the Client struct if the type
assertion succeeds.

Comment thread pkg/utils/ocm.go
Comment on lines 69 to 77
func GetClusterAnyStatus(conn *sdk.Connection, clusterId string) (*cmv1.Cluster, error) {
// identifier in the accounts management service. To find those clusters we need to check
// directly in the clusters management service.
clustersSearch := fmt.Sprintf(ClusterServiceClusterSearch, clusterId, clusterId, clusterId)
clustersListResponse, err := conn.ClustersMgmt().V1().Clusters().List().Search(clustersSearch).Size(1).Send()
if err != nil {
return nil, fmt.Errorf("can't retrieve clusters for clusterId '%s': %w", clusterId, err)
}

// If there is exactly one cluster matching then return it:
clustersTotal := clustersListResponse.Total()
if clustersTotal == 1 {
return clustersListResponse.Items().Slice()[0], nil
}

return nil, fmt.Errorf("there are %d clusters with identifier or name '%s', expected 1", clustersTotal, clusterId)
return activeSrelibClient.GetClusterAnyStatus(clusterId)
}

func GetClusters(ocmClient *sdk.Connection, clusterIds []string) []*cmv1.Cluster {
for i, id := range clusterIds {
clusterIds[i] = GenerateQuery(id)
}

clusters, err := ApplyFilters(ocmClient, []string{strings.Join(clusterIds, " or ")})
clusters, err := activeSrelibClient.GetClusters(clusterIds)
if err != nil {
log.Fatalf("error while retrieving cluster(s) from ocm: %[1]s", err)
log.Fatalf("error retrieving cluster(s) via srelib: %s", err)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n pkg/utils/ocm.go | head -100

Repository: openshift/osdctl

Length of output: 4078


🏁 Script executed:

# Search for activeSrelibClient initialization and usage
rg "activeSrelibClient" -B 2 -A 2

Repository: openshift/osdctl

Length of output: 966


🏁 Script executed:

# Check for SetSrelibClient function definition
rg "SetSrelibClient" -B 3 -A 3

Repository: openshift/osdctl

Length of output: 661


🏁 Script executed:

# Look for all calls to GetClusters to understand usage patterns
rg "GetClusters" -B 2 -A 2

Repository: openshift/osdctl

Length of output: 2532


🏁 Script executed:

# Check for error handling in callers of GetClusters
rg "GetClusters|GetClusterAnyStatus" --type go -A 5 | head -50

Repository: openshift/osdctl

Length of output: 2319


🏁 Script executed:

# Find test files
fd -e _test.go | head -20

Repository: openshift/osdctl

Length of output: 42


🏁 Script executed:

# Search for calls to GetClusters and GetClusterAnyStatus in tests
rg "GetClusters|GetClusterAnyStatus" --type go -B 3 -A 3 | grep -E "(_test\.go|test_|Test)" -A 5 | head -40

Repository: openshift/osdctl

Length of output: 42


🏁 Script executed:

# Check if tests initialize the client
rg "SetSrelibClient" --type go

Repository: openshift/osdctl

Length of output: 191


🏁 Script executed:

# Look for alternative entrypoints or main functions
rg "func main" --type go

Repository: openshift/osdctl

Length of output: 178


🏁 Script executed:

# Check docgen/main.go for potential usage
cat -n utils/docgen/main.go | head -80

Repository: openshift/osdctl

Length of output: 222


🏁 Script executed:

# Check for any other imports or usage of GetClusters/GetClusterAnyStatus in tools
rg "GetClusters|GetClusterAnyStatus" utils/docgen/ 2>/dev/null || echo "No usage in docgen"

Repository: openshift/osdctl

Length of output: 79


🏁 Script executed:

# Check if there are any test directories at all
find . -type d -name "*test*" 2>/dev/null | head -10

Repository: openshift/osdctl

Length of output: 131


🏁 Script executed:

# Look for go.sum or go.mod to understand if there's a testing framework
ls -la | grep -E "go\.(mod|sum)|Makefile|\.github"

Repository: openshift/osdctl

Length of output: 295


🏁 Script executed:

# Check callers of GetClusters and GetClusterAnyStatus to understand usage patterns
rg "GetClusters|GetClusterAnyStatus" --type go -B 2 -A 5 | grep -E "(cmd|pkg)/[^/]+\.(go|ts)" | head -50

Repository: openshift/osdctl

Length of output: 42


🏁 Script executed:

# Verify the exact error handling expectations from callers
rg "GetClusters\(" -A 3 --type go | head -50

Repository: openshift/osdctl

Length of output: 2191


Guard global client access and stop terminating process from utility helpers.

These helpers directly dereference the uninitialized activeSrelibClient global without nil checks, causing panic if invoked outside the main entrypoint (e.g., from docgen or tests). Additionally, log.Fatalf in GetClusters hard-exits on recoverable RPC failures, making lookup failures non-recoverable and breaking the error handling contract expected by callers.

Suggested fix
 func GetClusterAnyStatus(conn *sdk.Connection, clusterId string) (*cmv1.Cluster, error) {
-	return activeSrelibClient.GetClusterAnyStatus(clusterId)
+	if activeSrelibClient == nil {
+		return nil, fmt.Errorf("srelib client is not initialized")
+	}
+	return activeSrelibClient.GetClusterAnyStatus(clusterId)
 }
 
-func GetClusters(ocmClient *sdk.Connection, clusterIds []string) []*cmv1.Cluster {
-	clusters, err := activeSrelibClient.GetClusters(clusterIds)
-	if err != nil {
-		log.Fatalf("error retrieving cluster(s) via srelib: %s", err)
-	}
-	return clusters
+func GetClusters(ocmClient *sdk.Connection, clusterIds []string) ([]*cmv1.Cluster, error) {
+	if activeSrelibClient == nil {
+		return nil, fmt.Errorf("srelib client is not initialized")
+	}
+	return activeSrelibClient.GetClusters(clusterIds)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/utils/ocm.go` around lines 69 - 77, The GetClusterAnyStatus and
GetClusters functions must guard access to the global activeSrelibClient with
nil checks to prevent panics when called outside the main entrypoint (such as
from docgen or tests), and the GetClusters function must remove the log.Fatalf
call that terminates the process on error, instead returning the error to the
caller so they can handle the failure appropriately according to their error
handling expectations.

@bergmannf

Copy link
Copy Markdown
Contributor

Looks good - I'd love to see this for some more complex flows like the assume role chaining for AWS or constructing backplane clients for different OCM connections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants