Skip to content

Add drift detection and periodic resync for OpenStack resources#759

Open
eshulman2 wants to merge 10 commits into
k-orc:mainfrom
eshulman2:forge/aisos-376
Open

Add drift detection and periodic resync for OpenStack resources#759
eshulman2 wants to merge 10 commits into
k-orc:mainfrom
eshulman2:forge/aisos-376

Conversation

@eshulman2

@eshulman2 eshulman2 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements periodic resync and external deletion handling for the OpenStack Resource Controller. When enabled, ORC periodically reconciles resources with OpenStack to detect configuration drift and external deletions, automatically recreating managed resources that were deleted outside of ORC while surfacing terminal errors for imported resources.

Changes

API Changes

  • Added resyncPeriod field to the code generator spec template for per-resource resync configuration
  • Added lastSyncTime field to the code generator status template to track successful reconciliation timestamps
  • Added GetResyncPeriod(), GetLastSyncTime(), and IsImported() methods to APIObjectAdapter interface

CLI & Manager

  • Added --default-resync-period CLI flag with default value of 0 (disabled)
  • Added ResyncConfigurable interface to propagate global default to all controllers

Core Resync Logic

  • Created internal/controllers/generic/resync/ package with:
    • DetermineResyncPeriod(): resolves effective period (per-resource → global → disabled)
    • CalculateJitteredDuration(): wraps wait.Jitter with 20% positive-only jitter
    • ShouldScheduleResync(): guards against scheduling for disabled/terminal/pending-requeue states
  • Modified shouldReconcile() to trigger reconciliation when resync period has elapsed
  • Modified reconcileNormal() to schedule jittered requeue after successful reconciliation

External Deletion Handling

  • Modified GetOrCreateOSResource() to detect 404s when fetching by status.id
  • Returns typed ExternallyDeleted signal via ReconcileStatus for managed non-imported resources
  • Restored safety guard for unexpected nil returns from actuators
  • Imported/unmanaged resources return terminal error
  • Added ClearStatusID() to status writer for clearing ID before recreation

Status Updates

  • Modified UpdateStatus() to set lastSyncTime only on successful reconciliation
  • Fixed Available condition: terminal errors now set False instead of Unknown

Code Generation

  • Updated api.template and adapter.template with new fields and methods
  • Regenerated all resources, CRDs, deep copy, and client code

Documentation

  • Created website/docs/user-guide/drift-detection.md
  • Updated enhancements/drift-detection.md with implementation status

Commit structure

  1. API fields + templates — hand-written type and template changes
  2. Regenerate — all generated code (CRDs, adapters, apply configs, OpenAPI, docs)
  3. Resync scheduling — scheduler, controller integration, CLI flag, all controller wiring
  4. E2E tests for resync — kuttl test suites + integration tests
  5. External deletion — detection, recreation, condition fix, unit + E2E tests
  6. Documentation — user guide and enhancement doc
  7. Design update — update enhancement proposal with typed ExternallyDeleted signal

Design decisions

  • Typed ExternallyDeleted signal: The original proposal used a (nil, nil) return from GetOrCreateOSResource to signal external deletion. This was replaced with a typed ExternallyDeleted flag on ReconcileStatus, which avoids ambiguity with actuator bugs that might also return nil and restores a safety guard for unexpected nil returns from GetOSResourceByID.
  • wait.Jitter reuse: CalculateJitteredDuration wraps k8s.io/apimachinery/pkg/util/wait.Jitter rather than reimplementing the same [base, base*1.2) jitter range.
  • Removed dead CommonStatus/CommonOptions types: These types were declared but never embedded — the code generator inlines the fields directly. They were removed to avoid misleading readers.

Testing

  • Unit tests for resync package, shouldReconcile, status updates, external deletion (all policy/import combinations)
  • 6 kuttl E2E test suites: resync-period, resync-disabled, resync-terminal-error, resync-jitter, external-deletion, external-deletion-import
  • go build, go vet, make lint all pass

@github-actions github-actions Bot added the semver:minor Backwards-compatible change label Apr 19, 2026
@eshulman2 eshulman2 changed the title [AISOS-376] Implementation for AISOS-376 [AISOS-376] Add drift detection and periodic resync for OpenStack resources Apr 20, 2026

@eshulman2 eshulman2 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some issues especially with docs

Comment thread internal/controllers/generic/reconciler/controller.go Outdated
Comment thread internal/controllers/generic/reconciler/resource_actions.go Outdated
Comment thread internal/controllers/generic/reconciler/controller.go Outdated
Comment thread api/v1alpha1/controller_options.go Outdated
Comment thread internal/controllers/generic/status/status_test.go Outdated
Comment thread internal/controllers/generic/resync/scheduler_test.go Outdated
Comment thread enhancements/drift-detection.md Outdated
@mandre mandre added the hold Do not merge label Apr 23, 2026
@mandre

mandre commented Apr 23, 2026

Copy link
Copy Markdown
Collaborator

Adding the hold label because this is an experiment.

@eshulman2 eshulman2 changed the title [AISOS-376] Add drift detection and periodic resync for OpenStack resources Add drift detection and periodic resync for OpenStack resources May 25, 2026
@eshulman2 eshulman2 force-pushed the forge/aisos-376 branch 4 times, most recently from 9138a1d to da9fb9b Compare May 31, 2026 08:28
@eshulman2

Copy link
Copy Markdown
Contributor Author

@mandre I revised the PR myself removing any indication of internal references I think it is ready for review

Comment thread .gitignore Outdated
@eshulman2 eshulman2 requested a review from dlaw4608 June 9, 2026 06:42
eshulman2 added 8 commits June 9, 2026 09:53
Add resyncPeriod to the code generator spec template allowing
per-resource configuration of how frequently the controller
re-reconciles even when no changes are detected. Add lastSyncTime to the
status template to track when the last successful resync occurred.

Update the adapter template and interface to expose GetResyncPeriod(),
GetLastSyncTime(), and IsImported() methods.
Run make generate to update all generated code after API field and
template changes.
Add --default-resync-period CLI flag to the manager for setting a global
default resync period. Implement DetermineResyncPeriod to resolve the
effective period using per-resource then global default precedence.

Create resync scheduler package using wait.Jitter for jittered duration
calculation. Modify shouldReconcile to support periodic resync by
checking lastSyncTime against the effective resync period.

Integrate resync requeue scheduling into reconcileNormal and update the
status writer to set lastSyncTime on successful reconciliation. Wire
the DefaultResyncPeriod from manager options through to all controllers.

Add typed ExternallyDeleted signal on ReconcileStatus for safe
detection of externally deleted OpenStack resources.
Add E2E tests covering resync period, resync jitter, resync disabled,
and terminal error behavior during resync. Add integration tests for
handling unmanaged resources during resync to verify that imported
resources without resyncPeriod skip requeue scheduling.
Modify GetOrCreateOSResource to detect when a managed, non-imported
resource has been deleted externally from OpenStack. Signal the caller
via the typed ExternallyDeleted ReconcileStatus so it can clear
status.id and trigger recreation. Restore the safety guard for
unexpected nil returns from actuators.

Fix Available condition to correctly transition from Unknown to False
when a terminal error is present during resync.

Include unit tests for external deletion handling and E2E tests
covering both managed resource recreation and imported resource
terminal error behavior.
Add user-facing documentation covering periodic resync configuration,
drift detection behavior, and external deletion handling. Update the
enhancement proposal to reflect the final two-tier resync period
resolution design.
Replace the (nil, nil) sentinel pattern with a typed ExternallyDeleted
signal on ReconcileStatus. This is a design improvement over the
original proposal: using a dedicated signal prevents ambiguity with
actuator bugs that might also return nil, and restores the safety guard
for unexpected nil returns from GetOSResourceByID.

@dlaw4608 dlaw4608 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @eshulman2 , I ran it locally following the drift-detection.md and it works well, also ran some of the E2E tests, I left a few comments, please let me know what you think?

Comment thread website/docs/user-guide/drift-detection.md Outdated
Comment thread internal/controllers/generic/reconciler/resource_actions.go Outdated
Comment thread internal/controllers/generic/reconciler/resource_actions_test.go Outdated
Comment thread internal/controllers/generic/reconciler/resource_actions_test.go Outdated
Comment thread internal/controllers/generic/resync/scheduler.go Outdated
@eshulman2 eshulman2 requested a review from dlaw4608 June 10, 2026 13:24
- Simplify external-deletion condition: managed+import is rejected by
  CEL validation, so !IsImported() was redundant. Remove the dead check
  and delete the three unit tests that exercised the impossible scenario.
- Fix drift-detection.md: remove impossible managed+import rows from the
  summary table and the invalid YAML example showing managed+import.
- Strip TS-XXX internal labels from comments across all affected files.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Comment thread website/docs/user-guide/drift-detection.md
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@eshulman2 eshulman2 requested a review from dlaw4608 June 11, 2026 12:10

@winiciusallan winiciusallan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eshulman2, thanks for working on this. This is such a nice feature to have in ORC.

I've left a few comments along all the changes. Let me know what you think.


## Drift Detection Without Resync

Even with `resyncPeriod: 0` (the default, disabled), ORC will still detect external deletion when another event triggers reconciliation — for example, when you make a spec change or the controller restarts. The recreation or terminal error behavior is the same; the difference is only in how quickly ORC detects the deletion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, are you claiming here that ORC still detects external deletion and sync it accordingly?

Checking here, deleting a recently created port and changing its spec (e.g. updating the description) makes the resource transitions to this state:

❯ k get port
NAME                  ID                                     AVAILABLE   ADDRESSES   MESSAGE
port-create-minimal   65ddd01a-f803-4fb3-879c-716dbd435984   Unknown                 resource has been deleted from OpenStack

I just want to make sure that I am missing nothing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you add this changes manually or using an agent?

I think we could add these in the scaffold-controller so new controllers can be created with the defaultResyncPeriod config. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe if we're wiling to add these tests for the adapters, we should do this consistently across all controllers. This could be done in a separated PR; this PR is quite big already =).

Does it make sense?

4. If `GetOSResourceByID` returns a nil resource with no error (actuator bug), return an explicit error rather than silently misinterpreting it as external deletion

This ensures that managed resources created by ORC are automatically recreated, while imported or unmanaged resources correctly fail with a terminal error when deleted externally.
This ensures that managed resources created by ORC are automatically recreated, while imported or unmanaged resources correctly fail with a terminal error when deleted externally. The typed signal avoids ambiguity with nil returns from actuator bugs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what are the bugs? Something that we should fix?

// existing condition-based behaviour is unchanged.
//
// The resync check uses the persisted lastSyncTime so that controller restarts
// respect the time already elapsed, preventing a thundering herd\.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// respect the time already elapsed, preventing a thundering herd\.
// respect the time already elapsed, preventing a thundering herd.


// Schedule a resync requeue when the effective resync period is configured,
// there is no terminal error, and no other requeue is already pending.
// Positive-only jitter of [0%, +20%] is applied to spread load across

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be another effort, but I would like to leave the idea here: Would it be good to allow the operator/user to configure the jitter interval?

@@ -0,0 +1,44 @@
/*
Copyright 2024 The ORC Authors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have dropped years from the copyright. #629

We can use Copyright The ORC Authors. instead. Ditto for the other new files.

ref: network
assertAll:
# Verify the OpenStack ID is set before we delete the network externally.
- celExpr: "has(network.status.id) && network.status.id != ''"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that checking network.status.id != '' is enough to verify that the resource ID was set. If not set, the kuttl will catch it in an assertion error.

This is not a big deal, but since there are a few checks of this kind, we can remove them and decrease the number of executed celExpr. Makes sense?


Record the initial `lastSyncTime` for all three networks.

## Step 02

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The described action in Step 2 was done in Step 1 following the filename 01-record-sync-times.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hold Do not merge semver:minor Backwards-compatible change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants