Add drift detection and periodic resync for OpenStack resources#759
Add drift detection and periodic resync for OpenStack resources#759eshulman2 wants to merge 10 commits into
Conversation
5072ca0 to
3ebabf9
Compare
|
Adding the |
9138a1d to
da9fb9b
Compare
|
@mandre I revised the PR myself removing any indication of internal references I think it is ready for review |
2f59e67 to
03223a3
Compare
Add resyncPeriod to the code generator spec template allowing per-resource configuration of how frequently the controller re-reconciles even when no changes are detected. Add lastSyncTime to the status template to track when the last successful resync occurred. Update the adapter template and interface to expose GetResyncPeriod(), GetLastSyncTime(), and IsImported() methods.
Run make generate to update all generated code after API field and template changes.
Add --default-resync-period CLI flag to the manager for setting a global default resync period. Implement DetermineResyncPeriod to resolve the effective period using per-resource then global default precedence. Create resync scheduler package using wait.Jitter for jittered duration calculation. Modify shouldReconcile to support periodic resync by checking lastSyncTime against the effective resync period. Integrate resync requeue scheduling into reconcileNormal and update the status writer to set lastSyncTime on successful reconciliation. Wire the DefaultResyncPeriod from manager options through to all controllers. Add typed ExternallyDeleted signal on ReconcileStatus for safe detection of externally deleted OpenStack resources.
Add E2E tests covering resync period, resync jitter, resync disabled, and terminal error behavior during resync. Add integration tests for handling unmanaged resources during resync to verify that imported resources without resyncPeriod skip requeue scheduling.
Modify GetOrCreateOSResource to detect when a managed, non-imported resource has been deleted externally from OpenStack. Signal the caller via the typed ExternallyDeleted ReconcileStatus so it can clear status.id and trigger recreation. Restore the safety guard for unexpected nil returns from actuators. Fix Available condition to correctly transition from Unknown to False when a terminal error is present during resync. Include unit tests for external deletion handling and E2E tests covering both managed resource recreation and imported resource terminal error behavior.
Add user-facing documentation covering periodic resync configuration, drift detection behavior, and external deletion handling. Update the enhancement proposal to reflect the final two-tier resync period resolution design.
Replace the (nil, nil) sentinel pattern with a typed ExternallyDeleted signal on ReconcileStatus. This is a design improvement over the original proposal: using a dedicated signal prevents ambiguity with actuator bugs that might also return nil, and restores the safety guard for unexpected nil returns from GetOSResourceByID.
03223a3 to
e29ca26
Compare
dlaw4608
left a comment
There was a problem hiding this comment.
Hey @eshulman2 , I ran it locally following the drift-detection.md and it works well, also ran some of the E2E tests, I left a few comments, please let me know what you think?
- Simplify external-deletion condition: managed+import is rejected by CEL validation, so !IsImported() was redundant. Remove the dead check and delete the three unit tests that exercised the impossible scenario. - Fix drift-detection.md: remove impossible managed+import rows from the summary table and the invalid YAML example showing managed+import. - Strip TS-XXX internal labels from comments across all affected files. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
winiciusallan
left a comment
There was a problem hiding this comment.
Hi @eshulman2, thanks for working on this. This is such a nice feature to have in ORC.
I've left a few comments along all the changes. Let me know what you think.
|
|
||
| ## Drift Detection Without Resync | ||
|
|
||
| Even with `resyncPeriod: 0` (the default, disabled), ORC will still detect external deletion when another event triggers reconciliation — for example, when you make a spec change or the controller restarts. The recreation or terminal error behavior is the same; the difference is only in how quickly ORC detects the deletion. |
There was a problem hiding this comment.
Just to confirm, are you claiming here that ORC still detects external deletion and sync it accordingly?
Checking here, deleting a recently created port and changing its spec (e.g. updating the description) makes the resource transitions to this state:
❯ k get port
NAME ID AVAILABLE ADDRESSES MESSAGE
port-create-minimal 65ddd01a-f803-4fb3-879c-716dbd435984 Unknown resource has been deleted from OpenStackI just want to make sure that I am missing nothing.
There was a problem hiding this comment.
Did you add this changes manually or using an agent?
I think we could add these in the scaffold-controller so new controllers can be created with the defaultResyncPeriod config. What do you think?
There was a problem hiding this comment.
I believe if we're wiling to add these tests for the adapters, we should do this consistently across all controllers. This could be done in a separated PR; this PR is quite big already =).
Does it make sense?
| 4. If `GetOSResourceByID` returns a nil resource with no error (actuator bug), return an explicit error rather than silently misinterpreting it as external deletion | ||
|
|
||
| This ensures that managed resources created by ORC are automatically recreated, while imported or unmanaged resources correctly fail with a terminal error when deleted externally. | ||
| This ensures that managed resources created by ORC are automatically recreated, while imported or unmanaged resources correctly fail with a terminal error when deleted externally. The typed signal avoids ambiguity with nil returns from actuator bugs. |
There was a problem hiding this comment.
Out of curiosity, what are the bugs? Something that we should fix?
| // existing condition-based behaviour is unchanged. | ||
| // | ||
| // The resync check uses the persisted lastSyncTime so that controller restarts | ||
| // respect the time already elapsed, preventing a thundering herd\. |
There was a problem hiding this comment.
| // respect the time already elapsed, preventing a thundering herd\. | |
| // respect the time already elapsed, preventing a thundering herd. |
|
|
||
| // Schedule a resync requeue when the effective resync period is configured, | ||
| // there is no terminal error, and no other requeue is already pending. | ||
| // Positive-only jitter of [0%, +20%] is applied to spread load across |
There was a problem hiding this comment.
This could be another effort, but I would like to leave the idea here: Would it be good to allow the operator/user to configure the jitter interval?
| @@ -0,0 +1,44 @@ | |||
| /* | |||
| Copyright 2024 The ORC Authors. | |||
There was a problem hiding this comment.
We have dropped years from the copyright. #629
We can use Copyright The ORC Authors. instead. Ditto for the other new files.
| ref: network | ||
| assertAll: | ||
| # Verify the OpenStack ID is set before we delete the network externally. | ||
| - celExpr: "has(network.status.id) && network.status.id != ''" |
There was a problem hiding this comment.
I have a feeling that checking network.status.id != '' is enough to verify that the resource ID was set. If not set, the kuttl will catch it in an assertion error.
This is not a big deal, but since there are a few checks of this kind, we can remove them and decrease the number of executed celExpr. Makes sense?
|
|
||
| Record the initial `lastSyncTime` for all three networks. | ||
|
|
||
| ## Step 02 |
There was a problem hiding this comment.
The described action in Step 2 was done in Step 1 following the filename 01-record-sync-times.yaml
Summary
Implements periodic resync and external deletion handling for the OpenStack Resource Controller. When enabled, ORC periodically reconciles resources with OpenStack to detect configuration drift and external deletions, automatically recreating managed resources that were deleted outside of ORC while surfacing terminal errors for imported resources.
Changes
API Changes
resyncPeriodfield to the code generator spec template for per-resource resync configurationlastSyncTimefield to the code generator status template to track successful reconciliation timestampsGetResyncPeriod(),GetLastSyncTime(), andIsImported()methods toAPIObjectAdapterinterfaceCLI & Manager
--default-resync-periodCLI flag with default value of 0 (disabled)ResyncConfigurableinterface to propagate global default to all controllersCore Resync Logic
internal/controllers/generic/resync/package with:DetermineResyncPeriod(): resolves effective period (per-resource → global → disabled)CalculateJitteredDuration(): wrapswait.Jitterwith 20% positive-only jitterShouldScheduleResync(): guards against scheduling for disabled/terminal/pending-requeue statesshouldReconcile()to trigger reconciliation when resync period has elapsedreconcileNormal()to schedule jittered requeue after successful reconciliationExternal Deletion Handling
GetOrCreateOSResource()to detect 404s when fetching bystatus.idExternallyDeletedsignal viaReconcileStatusfor managed non-imported resourcesClearStatusID()to status writer for clearing ID before recreationStatus Updates
UpdateStatus()to setlastSyncTimeonly on successful reconciliationAvailablecondition: terminal errors now setFalseinstead ofUnknownCode Generation
api.templateandadapter.templatewith new fields and methodsDocumentation
website/docs/user-guide/drift-detection.mdenhancements/drift-detection.mdwith implementation statusCommit structure
Design decisions
(nil, nil)return fromGetOrCreateOSResourceto signal external deletion. This was replaced with a typedExternallyDeletedflag onReconcileStatus, which avoids ambiguity with actuator bugs that might also return nil and restores a safety guard for unexpected nil returns fromGetOSResourceByID.wait.Jitterreuse:CalculateJitteredDurationwrapsk8s.io/apimachinery/pkg/util/wait.Jitterrather than reimplementing the same [base, base*1.2) jitter range.CommonStatus/CommonOptionstypes: These types were declared but never embedded — the code generator inlines the fields directly. They were removed to avoid misleading readers.Testing
go build,go vet,make lintall pass