OLS-3221 Add PostgreSQL auto-recovery after DB restart by sriroopar · Pull Request #2964 · openshift/lightspeed-service

sriroopar · 2026-06-24T17:45:21Z

Implement background health-check loop, smarter error classification in the @connection decorator, operation timeouts, and enhanced liveness/readiness probes so OLS can recover automatically when the backing PostgreSQL database is restarted.

Description

Type of change

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Improved service health checks for PostgreSQL-backed caches, with /liveness now able to report a 503 status when the database is unavailable.
- Added new timeout and health-check configuration options for PostgreSQL cache behavior.
Bug Fixes
- Strengthened cache connection handling and retry behavior during temporary database or connection failures.
- ready() now reflects the latest background health status more reliably.
Tests
- Expanded coverage for liveness behavior, configuration defaults, and PostgreSQL connection recovery scenarios.

coderabbitai · 2026-06-24T17:45:38Z

📝 Walkthrough

Walkthrough

The PR adds new Postgres cache timeout and liveness-threshold configuration, updates Postgres connection and cache health handling, and changes /liveness to return 503 when the cache failure threshold is reached. Tests and OpenAPI output were updated to match.

Changes

Postgres cache health and liveness

Layer / File(s)	Summary
Configuration defaults and thresholds `ols/constants.py`, `ols/app/models/config.py`, `tests/unit/app/models/test_config.py`	New cache timeout constants and liveness threshold values are added, and `PostgresConfig` and `OLSConfig` expose those defaults and overrides.
Connection retry and statement timeout `ols/utils/postgres.py`, `tests/unit/utils/test_postgres.py`	The Postgres connection decorator retries connection-level errors once, wraps database errors in `CacheError`, and `connect()` sets `statement_timeout`.
Cache health state and readiness `ols/src/cache/postgres_cache.py`, `tests/unit/cache/test_postgres_cache.py`	`PostgresCache` starts a health-check thread, tracks health state and failure counts, `set_topic_summary` follows the new lock and error handling, and `ready()` uses the cached health state.
Cache locking and error handling `ols/src/cache/postgres_cache.py`, `tests/unit/cache/test_postgres_cache.py`	The remaining cache read/write methods use bounded locks and separate operational/interface errors from database errors; tests cover lock timeouts.
Liveness endpoint and OpenAPI contract `ols/app/endpoints/health.py`, `docs/openapi.json`, `tests/unit/app/endpoints/test_health.py`	`/liveness` returns 503 for Postgres cache threshold breaches, and the OpenAPI spec and endpoint tests include the new response.

Sequence Diagram(s)

sequenceDiagram
  participant "Client"
  participant "liveness_probe_get_method"
  participant "config._conversation_cache"
  participant "PostgresCache"
  participant "HTTPException"

  Client->>liveness_probe_get_method: GET /liveness
  liveness_probe_get_method->>config._conversation_cache: inspect cache implementation
  alt config._conversation_cache is PostgresCache
    liveness_probe_get_method->>PostgresCache: read _consecutive_failures under _health_lock
    alt failures >= liveness_db_failure_threshold
      liveness_probe_get_method->>HTTPException: raise 503 "database unreachable"
    else below threshold
      liveness_probe_get_method-->>Client: LivenessResponse(alive=True)
    end
  else other cache
    liveness_probe_get_method-->>Client: LivenessResponse(alive=True)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the main change: PostgreSQL auto-recovery after a database restart.
Docstring Coverage	✅ Passed	Docstring coverage is 96.72% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

openshift-ci · 2026-06-24T17:46:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign blublinsky for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (1)

tests/unit/cache/test_postgres_cache.py (1)
836-865: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy lift

Exercise the health-check behavior instead of assigning its result.

These tests directly set _health_status and _consecutive_failures, so they would still pass if _health_check_loop() stopped updating either field. Extract a single-iteration health-check helper or inject a controllable sleep/event so the tests drive the success and failure paths. As per coding guidelines, tests/**/*.py: Assert specific values and behaviors in tests, not just that code runs without error.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/cache/test_postgres_cache.py` around lines 836 - 865, The
health-check tests are bypassing the real behavior by mutating PostgresCache
internals directly, so they won’t catch regressions in _health_check_loop().
Update the tests to drive the loop logic through a single-iteration helper or a
controllable sleep/event on PostgresCache, then assert the resulting
_health_status and _consecutive_failures values after simulated success/failure.
Use the existing PostgresCache, _health_check_loop, _mark_unhealthy, and ready()
symbols to locate the behavior and keep the assertions focused on observable
state changes rather than direct field assignment.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ols/app/endpoints/health.py`:
- Around line 126-129: The 503 liveness response in the health endpoint is
returning an HTTPException shape instead of the declared LivenessResponse
contract. Update the liveness failure path in the health endpoint so the 503
body matches LivenessResponse (for example, by returning that model directly),
or change the documented 503 response model to the actual error payload; keep
the response schema and the OpenAPI declaration in sync around the liveness
handler.

In `@ols/app/models/config.py`:
- Around line 801-803: The config model currently accepts negative values for
statement_timeout, lock_timeout, and health_check_interval, so tighten
validation in the Pydantic config class that defines these fields. Add explicit
field constraints or validators on statement_timeout and lock_timeout to reject
negative integers, and make health_check_interval strictly positive so the
health-check loop and PostgresCache lock acquisition keep their timing
guarantees. Use the existing config model/field definitions in Config to enforce
these contracts at load time rather than downstream.
- Around line 1166-1167: `OLSConfig.__init__` is currently taking
`liveness_db_failure_threshold` directly from `data.get(...)`, so it can remain
a string or accept invalid values. Parse this field as an int when loading
config, and validate that the value is at least 1; if it is missing,
non-numeric, zero, or negative, raise `checks.InvalidConfigurationError` before
the value is used by `/liveness`.

In `@ols/src/cache/postgres_cache.py`:
- Around line 159-165: Validate the Postgres cache timing settings before they
reach the daemon: `health_check_interval` must not allow negative values or
zero-hot-loop behavior, and `lock_timeout` should reject negative values unless
indefinite blocking is explicitly intended. Add Pydantic validation in
`PostgresConfig` (using appropriate constrained types such as
non-negative/positive ints) and keep the runtime consumers in `PostgresCache`
safe by relying on those validated values in the initialization and health-check
logic. Update any related uses in `PostgresCache` and its config wiring so the
daemon never calls `time.sleep()` or lock acquisition with invalid values.
- Around line 189-219: The health-check path in _connect_health and
_health_check_loop can block indefinitely because neither the psycopg2.connect
call nor the SELECT 1 probe has a timeout. Update the PostgresCache health
connection setup to pass an explicit connect timeout in connect_kwargs and add a
statement timeout before executing the cursor probe, so ready() does not wait on
a hung network or server and _health_status can recover promptly.
- Around line 194-195: The broad exception handling in PostgresCache needs to be
narrowed to satisfy Ruff BLE001. Update the catch blocks in _connect_health,
_health_check_loop, _safe_rollback, and _safe_set_autocommit to handle only the
specific psycopg2 or attribute-related exceptions they can actually raise, and
keep the existing logger/debug behavior in those paths. If any catch-all is
intentionally required to keep the daemon alive, add an explicit # noqa: BLE001
on that except clause rather than leaving a bare except Exception.

In `@ols/utils/postgres.py`:
- Around line 106-108: The new cursor in the statement-timeout setup can leak if
`cursor.execute` fails during schema initialization. Update the `Postgres`
initialization flow around the `cursor = self.connection.cursor()` and `SET
statement_timeout` call to use a `try/finally` so `cursor.close()` always runs,
even on exceptions; keep the fix localized to this cursor-handling block.
- Around line 29-45: The wrapper in postgres connection handling is calling
connectable.connect() before the try/except, so a failed initial connect skips
the same recovery path as query failures. Move the initial connect() call into
the connection-error handling flow inside wrapper() so both the first connect
and the retry go through the same psycopg2.OperationalError/InterfaceError
handling, including _mark_unhealthy() and CacheError wrapping in the reconnect
path.
- Around line 33-45: The `@connection` retry logic in the postgres decorator is
causing duplicate writes for methods like PostgresCache.insert_or_append because
the operation may have already committed before reconnect and retry. Update the
retry behavior in the connection wrapper to avoid automatically retrying
write/mutating methods after OperationalError/InterfaceError, and keep the retry
path only for safe read operations or explicitly idempotent calls identified by
the wrapped function name.

In `@tests/unit/cache/test_postgres_cache.py`:
- Around line 20-23: The new pytest fixture and several test functions in
Postgres cache tests are missing type annotations. Update _suppress_health_loop
to declare Generator[None, None, None] as its return type, and add -> None to
the affected test functions in test_postgres_cache so the new autouse fixture
and tests are fully typed and consistent with the rest of the module.

In `@tests/unit/utils/test_postgres.py`:
- Around line 108-147: Add explicit -> None return annotations to the new test
methods in test_postgres.py, including the helper failing_connect nested in
test_connection_error_reconnect_failure_raises_cache_error, to keep the test
signatures consistent. Also tighten
test_database_error_wraps_in_cache_error_no_retry by asserting the Connectable
instance does not attempt reconnect/retry when do_work_database_error is called;
use the existing Connectable, connect, and do_work_database_error symbols to
place the assertion near the current CacheError check.
- Around line 124-129: The test
`test_database_error_wraps_in_cache_error_no_retry` only verifies that
`do_work_database_error()` raises `CacheError`, but it does not prove the
reconnect path was skipped. After the initial `c.connect()` call, patch or mock
`Connectable.connect` and assert it is not called when
`do_work_database_error()` raises, so this test checks the “no retry” behavior
as well as the `CacheError` wrapping.

---

Nitpick comments:
In `@tests/unit/cache/test_postgres_cache.py`:
- Around line 836-865: The health-check tests are bypassing the real behavior by
mutating PostgresCache internals directly, so they won’t catch regressions in
_health_check_loop(). Update the tests to drive the loop logic through a
single-iteration helper or a controllable sleep/event on PostgresCache, then
assert the resulting _health_status and _consecutive_failures values after
simulated success/failure. Use the existing PostgresCache, _health_check_loop,
_mark_unhealthy, and ready() symbols to locate the behavior and keep the
assertions focused on observable state changes rather than direct field
assignment.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3804766b-967a-49bb-b307-520f135d2133

📥 Commits

Reviewing files that changed from the base of the PR and between b51ca02 and 9868f61.

📒 Files selected for processing (10)

ols/app/endpoints/health.py
ols/app/models/config.py
ols/app/models/models.py
ols/constants.py
ols/src/cache/postgres_cache.py
ols/utils/postgres.py
tests/unit/app/endpoints/test_health.py
tests/unit/app/models/test_config.py
tests/unit/cache/test_postgres_cache.py
tests/unit/utils/test_postgres.py

coderabbitai · 2026-06-24T17:54:44Z

+    503: {
+        "description": "Service is not alive",
+        "model": LivenessResponse,
+    },


🗄️ Data Integrity & Integration | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Map relevant files first. git ls-files | rg '^ols/app/endpoints/health\.py$|^.*health.*test.*\.py$|^.*test.*health.*\.py$|^.*liveness.*test.*\.py$|^.*health.*\.py$' # Show outline for the target file. ast-grep outline ols/app/endpoints/health.py --view expanded # Inspect the relevant line range with numbers. sed -n '1,240p' ols/app/endpoints/health.py | cat -n # Search for liveness/503 response declarations and tests. rg -n "liveness|503|LivenessResponse|HTTPException|JSONResponse|get_liveness_responses" ols -g '!**/__pycache__/**'

Repository: openshift/lightspeed-service

Length of output: 12991

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show tests related to the health endpoint if present. rg -n "liveness_probe_get_method|LivenessResponse|database unreachable|service is not alive|HTTP_503_SERVICE_UNAVAILABLE" . -g '*test*.py' -g '*tests*.py' # Print the nearby definitions in the health endpoint. sed -n '1,220p' ols/app/endpoints/health.py | cat -n

Repository: openshift/lightspeed-service

Length of output: 7441

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the exact health endpoint implementation and adjacent helpers. sed -n '1,220p' ols/app/endpoints/health.py | cat -n

Repository: openshift/lightspeed-service

Length of output: 6023

Align the 503 liveness response with LivenessResponse.

raise HTTPException(detail={"alive": False, "reason": ...}) returns {"detail": ...}, which does not match the declared 503 LivenessResponse schema or the generated OpenAPI contract. Return a LivenessResponse payload with the 503 response body, or change the documented 503 model to match the actual error shape.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/app/endpoints/health.py` around lines 126 - 129, The 503 liveness response in the health endpoint is returning an HTTPException shape instead of the declared LivenessResponse contract. Update the liveness failure path in the health endpoint so the 503 body matches LivenessResponse (for example, by returning that model directly), or change the documented 503 response model to the actual error payload; keep the response schema and the OpenAPI declaration in sync around the liveness handler.

Source: Coding guidelines

coderabbitai · 2026-06-24T17:54:45Z

+    statement_timeout: int = constants.POSTGRES_STATEMENT_TIMEOUT
+    lock_timeout: int = constants.POSTGRES_LOCK_TIMEOUT
+    health_check_interval: int = constants.CACHE_HEALTH_CHECK_INTERVAL


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Reject invalid timeout and interval values.

These fields currently allow negative integers. That breaks the new recovery path: a negative health_check_interval will terminate the health-check thread on time.sleep(), and a negative lock_timeout breaks the bounded-wait guarantee in PostgresCache._acquire_lock(). Validate these at config load and keep health_check_interval strictly positive.

Suggested constraint tightening

- statement_timeout: int = constants.POSTGRES_STATEMENT_TIMEOUT - lock_timeout: int = constants.POSTGRES_LOCK_TIMEOUT - health_check_interval: int = constants.CACHE_HEALTH_CHECK_INTERVAL + statement_timeout: NonNegativeInt = constants.POSTGRES_STATEMENT_TIMEOUT + lock_timeout: NonNegativeInt = constants.POSTGRES_LOCK_TIMEOUT + health_check_interval: PositiveInt = constants.CACHE_HEALTH_CHECK_INTERVAL

As per coding guidelines, "Use Pydantic models for configuration and data validation" and "Define function signatures and data shapes (interface contracts) before implementing; prefer explicit contracts over implicit ones".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

statement_timeout: int = constants.POSTGRES_STATEMENT_TIMEOUT

lock_timeout: int = constants.POSTGRES_LOCK_TIMEOUT

health_check_interval: int = constants.CACHE_HEALTH_CHECK_INTERVAL

statement_timeout: NonNegativeInt = constants.POSTGRES_STATEMENT_TIMEOUT

lock_timeout: NonNegativeInt = constants.POSTGRES_LOCK_TIMEOUT

health_check_interval: PositiveInt = constants.CACHE_HEALTH_CHECK_INTERVAL

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/app/models/config.py` around lines 801 - 803, The config model currently accepts negative values for statement_timeout, lock_timeout, and health_check_interval, so tighten validation in the Pydantic config class that defines these fields. Add explicit field constraints or validators on statement_timeout and lock_timeout to reject negative integers, and make health_check_interval strictly positive so the health-check loop and PostgresCache lock acquisition keep their timing guarantees. Use the existing config model/field definitions in Config to enforce these contracts at load time rather than downstream.

Source: Coding guidelines

coderabbitai · 2026-06-24T17:54:45Z

+    liveness_db_failure_threshold: int = constants.LIVENESS_DB_FAILURE_THRESHOLD
+


🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the relevant config model and surrounding initialization logic. FILE="ols/app/models/config.py" wc -l "$FILE" printf '\n--- outline ---\n' ast-grep outline "$FILE" --view expanded | sed -n '1,220p' printf '\n--- relevant line ranges ---\n' sed -n '1120,1285p' "$FILE"

Repository: openshift/lightspeed-service

Length of output: 17658

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect supporting definitions used by the config model. rg -n "liveness_db_failure_threshold|LIVENESS_DB_FAILURE_THRESHOLD|InvalidConfigurationError|class OLSConfig|def __init__" ols/app/models/config.py ols/utils/checks.py ols/app/models -S

Repository: openshift/lightspeed-service

Length of output: 13448

🏁 Script executed:

#!/bin/bash set -euo pipefail # Read the constructor and any validation helpers around the cited fields. FILE="ols/app/models/config.py" # Show the class definition and init method ranges if present. ast-grep outline "$FILE" --match OLSConfig --view expanded | sed -n '1,260p' printf '\n--- constructor / field assignments ---\n' rg -n "def __init__|liveness_db_failure_threshold|data.get\\(\"liveness_db_failure_threshold\"|InvalidConfigurationError" "$FILE" -n -S -A 4 -B 4

Repository: openshift/lightspeed-service

Length of output: 26243

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check whether assignment validation is enabled anywhere for this model hierarchy. rg -n "validate_assignment|model_config|ConfigDict|class BaseModel" ols/app/models/config.py ols/app/models -S printf '\n--- OLSConfig / BaseModel relevant slice ---\n' sed -n '1,120p' ols/app/models/config.py

Repository: openshift/lightspeed-service

Length of output: 5473

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find where the liveness threshold is consumed. rg -n "liveness_db_failure_threshold|/liveness|failures >= threshold|failure_threshold" ols -S

Repository: openshift/lightspeed-service

Length of output: 829

🏁 Script executed:

#!/bin/bash set -euo pipefail # Read the liveness check implementation. rg -n "/liveness|liveness" ols -S -A 6 -B 6 | sed -n '1,240p'

Repository: openshift/lightspeed-service

Length of output: 7132

🏁 Script executed:

#!/bin/bash set -euo pipefail python3 - <<'PY' from pathlib import Path text = Path("ols/app/models/config.py").read_text() for needle in ["validate_assignment", "model_config", "ConfigDict", "BaseModel"]: print(f"{needle}: {needle in text}") PY printf '\n--- liveness references ---\n' rg -n "liveness_db_failure_threshold|/liveness|failures >?=|failure" ols -S -A 3 -B 3

Repository: openshift/lightspeed-service

Length of output: 10511

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check whether Pydantic assignment validation is enabled for OLSConfig or its base classes. python3 - <<'PY' from pathlib import Path text = Path("ols/app/models/config.py").read_text() for needle in ("validate_assignment", "ConfigDict", "model_config"): print(f"{needle}:", needle in text) PY printf '\n--- liveness usage ---\n' rg -n "liveness_db_failure_threshold|/liveness|failures >=|failures >=" ols -S -A 4 -B 4

Repository: openshift/lightspeed-service

Length of output: 3476

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check whether Pydantic assignment validation is enabled for OLSConfig or its base classes. python3 - <<'PY' from pathlib import Path text = Path("ols/app/models/config.py").read_text() for needle in ("validate_assignment", "ConfigDict", "model_config"): print(f"{needle}:", needle in text) PY printf '\n--- liveness usage ---\n' rg -n "liveness_db_failure_threshold|/liveness|failures >=|failures >=" ols -S -A 4 -B 4

Repository: openshift/lightspeed-service

Length of output: 3476

Validate liveness_db_failure_threshold in OLSConfig.__init__. data.get(...) bypasses Pydantic here, so a YAML string like "3" reaches /liveness as a string and breaks failures >= threshold; 0 or negative values also force 503 on Postgres-backed instances. Parse it as an int and reject values < 1 with checks.InvalidConfigurationError.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/app/models/config.py` around lines 1166 - 1167, `OLSConfig.__init__` is currently taking `liveness_db_failure_threshold` directly from `data.get(...)`, so it can remain a string or accept invalid values. Parse this field as an int when loading config, and validate that the value is at least 1; if it is missing, non-numeric, zero, or negative, raise `checks.InvalidConfigurationError` before the value is used by `/liveness`.

Source: Coding guidelines

coderabbitai · 2026-06-24T17:54:45Z

+        self._lock_timeout = config.lock_timeout
        self.capacity = config.max_entries
+
+        self._health_status = True
+        self._consecutive_failures = 0
+        self._health_lock = threading.Lock()
+        self._health_check_interval = config.health_check_interval


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate timeout and interval values before the daemon uses them.

health_check_interval reaches time.sleep() before the try, so a negative value kills the health thread and 0 creates a hot polling loop. lock_timeout should also reject negative values unless indefinite blocking is intentional. The related PostgresConfig contract currently exposes these as plain int; validate them there with PositiveInt/non-negative types. As per coding guidelines, ols/**/*.py: Use Pydantic models for configuration and data validation.

Also applies to: 211-215, 254-257

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/src/cache/postgres_cache.py` around lines 159 - 165, Validate the Postgres cache timing settings before they reach the daemon: `health_check_interval` must not allow negative values or zero-hot-loop behavior, and `lock_timeout` should reject negative values unless indefinite blocking is explicitly intended. Add Pydantic validation in `PostgresConfig` (using appropriate constrained types such as non-negative/positive ints) and keep the runtime consumers in `PostgresCache` safe by relying on those validated values in the initialization and health-check logic. Update any related uses in `PostgresCache` and its config wiring so the daemon never calls `time.sleep()` or lock acquisition with invalid values.

Source: Coding guidelines

coderabbitai · 2026-06-24T17:54:45Z

+    def _connect_health(self) -> None:
+        """Establish a dedicated lightweight connection for health checks."""
+        if self._health_connection is not None:
+            try:
+                self._health_connection.close()
+            except Exception as close_err:
+                logger.debug("Failed to close old health connection: %s", close_err)
+        config = self.connection_config
+        connect_kwargs: dict[str, Any] = {
+            "host": config.host,
+            "port": config.port,
+            "user": config.user,
+            "password": config.password,
+            "dbname": config.dbname,
+            "sslmode": config.ssl_mode,
+            "sslrootcert": config.ca_cert_path,
+            "gssencmode": config.gss_encmode,
+            **libpq_tls_params(config.tls_security_profile),
+        }
+        self._health_connection = psycopg2.connect(**connect_kwargs)
+        self._health_connection.autocommit = True
+
+    def _health_check_loop(self) -> None:
+        """Background loop that periodically checks DB connectivity."""
+        while True:
+            time.sleep(self._health_check_interval)
+            try:
+                if self._health_connection is None or self._health_connection.closed:
+                    self._connect_health()
+                with self._health_connection.cursor() as cursor:
+                    cursor.execute("SELECT 1")


🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🌐 Web query:

psycopg2 2.9.9 libpq connect_timeout parameter and PostgreSQL SET statement_timeout session behavior

💡 Result:

In the context of psycopg2 (version 2.9.9), the connect_timeout parameter and the PostgreSQL SET statement_timeout command serve distinct purposes at different stages of the database interaction lifecycle [1]. connect_timeout Parameter The connect_timeout parameter is a client-side setting handled by libpq, the underlying C library used by psycopg2 [2][3]. It defines the maximum number of seconds to wait while attempting to establish the initial connection to the PostgreSQL server [1]. If the connection cannot be established within this duration (e.g., due to network issues or the server being unreachable), the attempt is aborted, and psycopg2 raises an OperationalError [4][1]. This parameter does not affect queries or operations once the connection is successfully established [1]. SET statement_timeout Behavior The SET statement_timeout command is a server-side configuration that is applied within the context of a specific database session [1][5]. When executed via psycopg2 (e.g., cursor.execute("SET statement_timeout = 5000")), it instructs the PostgreSQL server to automatically abort any subsequent SQL statement in that session that exceeds the specified duration (measured in milliseconds) [6][5][7]. Key Differences 1. Scope: connect_timeout is a connection-phase constraint enforced by the client driver (libpq) [1]; statement_timeout is a query-execution constraint enforced by the database server [6][7]. 2. Purpose: connect_timeout prevents the application from hanging during network attempts to reach the database [1]. statement_timeout prevents runaway or long-running queries from consuming excessive server resources [5][7]. 3. Interaction: The two settings are independent. Setting a short connect_timeout does not help if a query itself takes too long to execute once connected, and setting a statement_timeout does not help if the application cannot connect to the server in the first place [1]. For robust applications, it is common to use both: connect_timeout to handle initial connectivity failures, and statement_timeout to safeguard against unexpected query latency [1]. Note that because statement_timeout is session-scoped, it persists for the lifetime of that specific psycopg2 connection until explicitly changed or the session ends [5].

Citations:

1: https://www.geeksforgeeks.org/python/how-to-set-timeouts-in-psycopg2-using-python/

2: https://www.psycopg.org/docs/install.html

3: https://pypi.org/project/psycopg2/

4: connect_timeout parameter not enough in a failover event psycopg/psycopg2#985

5: https://github.com/jbranchaud/til/blob/master/postgres/set-a-statement-timeout-threshold-for-a-session.md

6: https://www.postgresql.org/docs/current/runtime-config-client.html

7: https://postgresqlco.nf/doc/en/param/statement_timeout/

🏁 Script executed:

# Inspect the cache implementation and related configuration to check the health path, timeouts, and validation. set -e git ls-files 'ols/src/cache/postgres_cache.py' 'ols/src/**/*.py' | sed -n '1,120p' echo '--- postgres_cache outline ---' ast-grep outline ols/src/cache/postgres_cache.py --view expanded echo '--- config candidates ---' rg -n "health_check_interval|lock_timeout|statement_timeout|connect_timeout|ssl_mode|tls_security_profile" ols/src -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 4309

🏁 Script executed:

# Read the relevant sections of postgres_cache.py and the config model to verify how the health loop behaves. set -e sed -n '1,320p' ols/src/cache/postgres_cache.py echo '--- config file candidates ---' fd -a '.*(config|settings).*\.py$' ols/src | sed -n '1,120p'

Repository: openshift/lightspeed-service

Length of output: 12237

🏁 Script executed:

# Inspect the actual health loop and the PostgresConfig definition, focusing on interval/timeout validation. set -e python3 - <<'PY' from pathlib import Path for path in [ Path('ols/src/cache/postgres_cache.py'), Path('ols/src/config.py'), ]: if path.exists(): print(f'FILE: {path}') for i, line in enumerate(path.read_text().splitlines(), 1): if 'health_check_interval' in line or 'lock_timeout' in line or 'statement_timeout' in line or 'connect_timeout' in line or '_health_check_loop' in line or '_connect_health' in line: print(f'{i}: {line}') print() PY

Repository: openshift/lightspeed-service

Length of output: 707

Bound the health-check connect and probe. ready() now depends on this loop, but _connect_health() has no connect_timeout and the SELECT 1 probe has no statement_timeout. A slow network or hung server can block the health thread and leave _health_status stale; set explicit timeouts for both steps.

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 194-194: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/src/cache/postgres_cache.py` around lines 189 - 219, The health-check path in _connect_health and _health_check_loop can block indefinitely because neither the psycopg2.connect call nor the SELECT 1 probe has a timeout. Update the PostgresCache health connection setup to pass an explicit connect timeout in connect_kwargs and add a statement timeout before executing the cursor probe, so ready() does not wait on a hung network or server and _health_status can recover promptly.

coderabbitai · 2026-06-24T17:54:45Z

+        except (psycopg2.OperationalError, psycopg2.InterfaceError) as e:
+            logger.warning(
+                "Connection error in %s, attempting reconnect: %s", f.__name__, e
+            )
+            if hasattr(connectable, "_mark_unhealthy"):
+                connectable._mark_unhealthy()
+            try:
+                connectable.connect()
+            except Exception as reconnect_err:
+                raise CacheError(
+                    f"reconnect failed in {f.__name__}", reconnect_err
+                ) from reconnect_err
+            return f(connectable, *args, **kwargs)


🗄️ Data Integrity & Integration | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate the relevant file and nearby symbols. git ls-files | rg '^ols/utils/postgres\.py$|^ols/.+\.py$' >/dev/null ast-grep outline ols/utils/postgres.py --view expanded || true echo '--- file excerpt ---' wc -l ols/utils/postgres.py sed -n '1,220p' ols/utils/postgres.py echo '--- usages ---' rg -n "insert_or_append|_mark_unhealthy|connect\(|OperationalError|InterfaceError|CacheError" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 8349

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the cache implementation to understand commit/append semantics. fd -a 'postgres.py' ols fd -a 'cache.py' ols fd -a '*cache*' ols | sed -n '1,120p' echo '--- Postgres cache excerpt ---' sed -n '1,260p' ols/utils/postgres.py echo '--- cache-related excerpts ---' rg -n "class .*Cache|def insert_or_append|append|commit|upsert|ON CONFLICT|executemany|execute" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 639

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- postgres_cache outline ---' ast-grep outline ols/src/cache/postgres_cache.py --view expanded || true echo '--- insert_or_append section ---' sed -n '280,380p' ols/src/cache/postgres_cache.py echo '--- decorators/usages in postgres_cache ---' rg -n "^\\s*`@connection`|def (get|insert_or_append|delete|list|set_topic_summary)|commit\\(|ON CONFLICT|DO UPDATE|DO NOTHING|append" ols/src/cache/postgres_cache.py

Repository: openshift/lightspeed-service

Length of output: 6822

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- cache interface excerpt ---' sed -n '1,220p' ols/src/cache/cache.py echo '--- in-memory implementation excerpt ---' sed -n '1,180p' ols/src/cache/in_memory_cache.py

Repository: openshift/lightspeed-service

Length of output: 10989

🏁 Script executed:

#!/bin/bash set -euo pipefail python3 - <<'PY' from pathlib import Path path = Path('ols/src/cache/postgres_cache.py') text = path.read_text() for needle in ['`@connection`', 'def insert_or_append', 'def get', 'def delete', 'def list', 'def set_topic_summary']: idx = text.find(needle) if idx != -1: start = text.rfind('\n', 0, idx-120) end = text.find('\n\n', idx) print(f'### {needle}') print(text[max(0, idx-120):min(len(text), idx+1200)]) print() PY

Repository: openshift/lightspeed-service

Length of output: 8205

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n -A80 -B20 "^\\s*`@connection`|^\\s*def insert_or_append|^\\s*def get|^\\s*def delete|^\\s*def list|^\\s*def set_topic_summary" ols/src/cache/postgres_cache.py

Repository: openshift/lightspeed-service

Length of output: 11834

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '280,370p' ols/src/cache/postgres_cache.py

Repository: openshift/lightspeed-service

Length of output: 3988

🏁 Script executed:

#!/bin/bash set -euo pipefail cat -n ols/src/cache/postgres_cache.py | sed -n '280,370p'

Repository: openshift/lightspeed-service

Length of output: 4625

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "`@connection`" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 724

🏁 Script executed:

#!/bin/bash set -euo pipefail cat -n ols/src/cache/postgres_cache.py | sed -n '1,260p'

Repository: openshift/lightspeed-service

Length of output: 11615

🏁 Script executed:

#!/bin/bash set -euo pipefail cat -n ols/src/cache/postgres_cache.py | sed -n '285,360p'

Repository: openshift/lightspeed-service

Length of output: 3885

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "class .*\$PostgresBase\$" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 245

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- PostgresBase subclasses ---' rg -n "class .*\$PostgresBase\$" ols -g '*.py' echo '--- connection decorator uses ---' rg -n "`@connection`" ols -g '*.py' || true

Repository: openshift/lightspeed-service

Length of output: 869

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "`@connection`" ols -g '*.py' || true

Repository: openshift/lightspeed-service

Length of output: 724

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "`@connection`" ols -g '*.py' || true

Repository: openshift/lightspeed-service

Length of output: 724

Limit @connection retries on write methods. PostgresCache.insert_or_append is retried after OperationalError/InterfaceError, but it appends to the stored history and increments message_count before commit(). If the connection drops after PostgreSQL has already committed, the retry can duplicate the entry and metadata update.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/utils/postgres.py` around lines 33 - 45, The `@connection` retry logic in the postgres decorator is causing duplicate writes for methods like PostgresCache.insert_or_append because the operation may have already committed before reconnect and retry. Update the retry behavior in the connection wrapper to avoid automatically retrying write/mutating methods after OperationalError/InterfaceError, and keep the retry path only for safe read operations or explicitly idempotent calls identified by the wrapped function name.

coderabbitai · 2026-06-24T17:54:45Z

+        cursor = self.connection.cursor()
+        cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),))
+        cursor.close()


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Close the statement-timeout cursor on failure too.

If SET statement_timeout raises, cursor.close() is skipped after schema initialization. Use try/finally for the new cursor.

Suggested cleanup

self.connection.autocommit = True cursor = self.connection.cursor() - cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),)) - cursor.close() + try: + cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),)) + finally: + cursor.close()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cursor = self.connection.cursor()

cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),))

cursor.close()

cursor = self.connection.cursor()

try:

cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),))

finally:

cursor.close()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/utils/postgres.py` around lines 106 - 108, The new cursor in the statement-timeout setup can leak if `cursor.execute` fails during schema initialization. Update the `Postgres` initialization flow around the `cursor = self.connection.cursor()` and `SET statement_timeout` call to use a `try/finally` so `cursor.close()` always runs, even on exceptions; keep the fix localized to this cursor-handling block.

coderabbitai · 2026-06-24T17:54:45Z

+def _suppress_health_loop():
+    """Prevent the background health-check thread from making real DB calls."""
+    with patch.object(PostgresCache, "_health_check_loop"):
+        yield


📐 Maintainability & Code Quality | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate the file and inspect the relevant sections with line numbers. git ls-files tests/unit/cache/test_postgres_cache.py wc -l tests/unit/cache/test_postgres_cache.py sed -n '1,120p' tests/unit/cache/test_postgres_cache.py printf '\n--- mid ---\n' sed -n '760,890p' tests/unit/cache/test_postgres_cache.py

Repository: openshift/lightspeed-service

Length of output: 7621

🏁 Script executed:

#!/bin/bash set -euo pipefail # Extract function signatures and comments in the target file. python3 - <<'PY' from pathlib import Path import re path = Path('tests/unit/cache/test_postgres_cache.py') text = path.read_text() for i, line in enumerate(text.splitlines(), start=1): if re.match(r'^\s*(def|async def)\s+', line) or 'TODO' in line or line.strip().startswith('#'): print(f'{i}: {line}') PY

Repository: openshift/lightspeed-service

Length of output: 7721

Add return annotations to the new fixture and tests. tests/unit/cache/test_postgres_cache.py:20, 791-865 has an untyped autouse fixture and several untyped test functions; add Generator[None, None, None] for _suppress_health_loop and -> None for the new tests.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/cache/test_postgres_cache.py` around lines 20 - 23, The new pytest fixture and several test functions in Postgres cache tests are missing type annotations. Update _suppress_health_loop to declare Generator[None, None, None] as its return type, and add -> None to the affected test functions in test_postgres_cache so the new autouse fixture and tests are fully typed and consistent with the rest of the module.

Source: Coding guidelines

coderabbitai · 2026-06-24T17:54:45Z

+    def test_operational_error_triggers_reconnect_and_retry(self):
+        """OperationalError causes reconnect + retry, succeeding on second attempt."""
+        c = self.Connectable()
+        c.connect()
+        result = c.do_work_operational_error()
+        assert result == "recovered"
+        assert c._unhealthy_marked is True
+
+    def test_interface_error_triggers_reconnect_and_retry(self):
+        """InterfaceError causes reconnect + retry, succeeding on second attempt."""
+        c = self.Connectable()
+        c.connect()
+        result = c.do_work_interface_error()
+        assert result == "recovered"
+        assert c._unhealthy_marked is True
+
+    def test_database_error_wraps_in_cache_error_no_retry(self):
+        """DatabaseError is wrapped in CacheError without reconnect attempt."""
+        c = self.Connectable()
+        c.connect()
+        with pytest.raises(CacheError, match="SQL syntax error"):
+            c.do_work_database_error()
+
+    def test_connection_error_reconnect_failure_raises_cache_error(self):
+        """When reconnect fails after OperationalError, CacheError is raised."""
+        c = self.Connectable()
+        c.connect()
+        # Make connect() fail on reconnect attempt
+        original_connect = c.connect
+        call_count = [0]
+
+        def failing_connect():
+            call_count[0] += 1
+            if call_count[0] > 0:
+                raise psycopg2.OperationalError("cannot connect")
+            original_connect()
+
+        c.connect = failing_connect
+        with pytest.raises(CacheError, match="reconnect failed"):
+            c.do_work_operational_error()


📐 Maintainability & Code Quality | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

sed -n '1,240p' tests/unit/utils/test_postgres.py

Repository: openshift/lightspeed-service

Length of output: 9034

🏁 Script executed:

sed -n '1,240p' tests/unit/utils/test_postgres.py

Repository: openshift/lightspeed-service

Length of output: 9034

Add the missing -> None annotations and tighten the no-retry assertion

The new test_* methods and failing_connect still need explicit return types, and test_database_error_wraps_in_cache_error_no_retry should also assert that reconnect/retry is not attempted.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/utils/test_postgres.py` around lines 108 - 147, Add explicit -> None return annotations to the new test methods in test_postgres.py, including the helper failing_connect nested in test_connection_error_reconnect_failure_raises_cache_error, to keep the test signatures consistent. Also tighten test_database_error_wraps_in_cache_error_no_retry by asserting the Connectable instance does not attempt reconnect/retry when do_work_database_error is called; use the existing Connectable, connect, and do_work_database_error symbols to place the assertion near the current CacheError check.

Source: Coding guidelines

coderabbitai · 2026-06-24T17:54:45Z

+    def test_database_error_wraps_in_cache_error_no_retry(self):
+        """DatabaseError is wrapped in CacheError without reconnect attempt."""
+        c = self.Connectable()
+        c.connect()
+        with pytest.raises(CacheError, match="SQL syntax error"):
+            c.do_work_database_error()


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Assert the “no retry” part of this case.

This test only checks CacheError wrapping. If the decorator attempted a reconnect/retry and still raised CacheError, the test could pass. Patch connect after the initial connection and assert it is not called. As per coding guidelines, tests/**/*.py: Assert specific values and behaviors in tests, not just that code runs without error.

Suggested assertion

c = self.Connectable() c.connect() - with pytest.raises(CacheError, match="SQL syntax error"): - c.do_work_database_error() + with patch.object(c, "connect") as mock_connect: + with pytest.raises(CacheError, match="SQL syntax error"): + c.do_work_database_error() + mock_connect.assert_not_called() + assert c._unhealthy_marked is False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_database_error_wraps_in_cache_error_no_retry(self):

"""DatabaseError is wrapped in CacheError without reconnect attempt."""

c = self.Connectable()

c.connect()

with pytest.raises(CacheError, match="SQL syntax error"):

c.do_work_database_error()

def test_database_error_wraps_in_cache_error_no_retry(self):

"""DatabaseError is wrapped in CacheError without reconnect attempt."""

c = self.Connectable()

c.connect()

with patch.object(c, "connect") as mock_connect:

with pytest.raises(CacheError, match="SQL syntax error"):

c.do_work_database_error()

mock_connect.assert_not_called()

assert c._unhealthy_marked is False

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/utils/test_postgres.py` around lines 124 - 129, The test `test_database_error_wraps_in_cache_error_no_retry` only verifies that `do_work_database_error()` raises `CacheError`, but it does not prove the reconnect path was skipped. After the initial `c.connect()` call, patch or mock `Connectable.connect` and assert it is not called when `do_work_database_error()` raises, so this test checks the “no retry” behavior as well as the `CacheError` wrapping.

Source: Coding guidelines

@connection

Implement background health-check loop, smarter error classification in the @connection decorator, operation timeouts, and enhanced liveness/readiness probes so OLS can recover automatically when the backing PostgreSQL database is restarted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ols/src/cache/postgres_cache.py`:
- Around line 350-353: The DatabaseError handling in PostgresCache should use
the same rollback protection as the operational-error path, because calling
connection.rollback() directly can hide the original psycopg2.DatabaseError and
prevent the CacheError wrapping from happening. Update the exception blocks in
PostgresCache.insert_or_append and the other matching DatabaseError path to call
_safe_rollback() instead of self.connection.rollback(), while keeping the
existing logger.error and CacheError("PostgresCache.insert_or_append", e)
behavior intact.
- Around line 184-188: The `_mark_unhealthy()` path only updates
`_health_status`, so repeated request-path failures from the `@connection` retry
flow do not advance `_consecutive_failures` toward the liveness threshold.
Update `_mark_unhealthy()` in `PostgresCache` to also increment the failure
counter under `_health_lock`, keeping the unhealthy flag and failure tracking in
sync so `is_healthy`/liveness checks can trip promptly after repeated
cache-operation DB failures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c84be3c2-a66c-475a-91d3-5d332a00fb83

📥 Commits

Reviewing files that changed from the base of the PR and between 9868f61 and 6e37719.

📒 Files selected for processing (10)

docs/openapi.json
ols/app/endpoints/health.py
ols/app/models/config.py
ols/constants.py
ols/src/cache/postgres_cache.py
ols/utils/postgres.py
tests/unit/app/endpoints/test_health.py
tests/unit/app/models/test_config.py
tests/unit/cache/test_postgres_cache.py
tests/unit/utils/test_postgres.py

🚧 Files skipped from review as they are similar to previous changes (8)

ols/app/endpoints/health.py
ols/constants.py
tests/unit/cache/test_postgres_cache.py
tests/unit/app/endpoints/test_health.py
ols/app/models/config.py
tests/unit/utils/test_postgres.py
ols/utils/postgres.py
tests/unit/app/models/test_config.py

coderabbitai · 2026-06-24T20:38:06Z

+    def _mark_unhealthy(self) -> None:
+        """Mark health status as unhealthy (dual-feed from cache operations)."""
+        with self._health_lock:
+            self._health_status = False
+


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Count operation-driven unhealthy events toward the failure threshold.

_mark_unhealthy() is the cache-operation feed from the @connection retry path, but it only flips _health_status; liveness threshold checks based on _consecutive_failures can remain below threshold until the background probe also fails. Increment the counter here so repeated request-path DB failures can trip liveness promptly.

Suggested fix

def _mark_unhealthy(self) -> None: """Mark health status as unhealthy (dual-feed from cache operations).""" with self._health_lock: + self._consecutive_failures += 1 self._health_status = False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _mark_unhealthy(self) -> None:

"""Mark health status as unhealthy (dual-feed from cache operations)."""

with self._health_lock:

self._health_status = False

def _mark_unhealthy(self) -> None:

"""Mark health status as unhealthy (dual-feed from cache operations)."""

with self._health_lock:

self._consecutive_failures += 1

self._health_status = False

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/src/cache/postgres_cache.py` around lines 184 - 188, The `_mark_unhealthy()` path only updates `_health_status`, so repeated request-path failures from the `@connection` retry flow do not advance `_consecutive_failures` toward the liveness threshold. Update `_mark_unhealthy()` in `PostgresCache` to also increment the failure counter under `_health_lock`, keeping the unhealthy flag and failure tracking in sync so `is_healthy`/liveness checks can trip promptly after repeated cache-operation DB failures.

coderabbitai · 2026-06-24T20:38:06Z

                except psycopg2.DatabaseError as e:
                    self.connection.rollback()
                    logger.error("PostgresCache.insert_or_append: %s", e)
                    raise CacheError("PostgresCache.insert_or_append", e) from e


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Use the safe rollback helper in DatabaseError paths too.

These paths intend to wrap psycopg2.DatabaseError as CacheError, but a failed direct rollback() would mask the original error and skip the wrapping/log context. Reuse _safe_rollback() as in the operational-error path.

Suggested fix

except psycopg2.DatabaseError as e: - self.connection.rollback() + self._safe_rollback() logger.error("PostgresCache.insert_or_append: %s", e) raise CacheError("PostgresCache.insert_or_append", e) from e

except psycopg2.DatabaseError as e: - self.connection.rollback() + self._safe_rollback() logger.error("PostgresCache.delete: %s", e) raise CacheError("PostgresCache.delete", e) from e

Also applies to: 389-392

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/src/cache/postgres_cache.py` around lines 350 - 353, The DatabaseError handling in PostgresCache should use the same rollback protection as the operational-error path, because calling connection.rollback() directly can hide the original psycopg2.DatabaseError and prevent the CacheError wrapping from happening. Update the exception blocks in PostgresCache.insert_or_append and the other matching DatabaseError path to call _safe_rollback() instead of self.connection.rollback(), while keeping the existing logger.error and CacheError("PostgresCache.insert_or_append", e) behavior intact.

openshift-ci · 2026-06-24T22:26:29Z

@sriroopar: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/security	`6e37719`	link	false	`/test security`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from bparees and tisnik June 24, 2026 17:46

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

sriroopar force-pushed the postgres-auto-recovery branch 2 times, most recently from eea3052 to 15f8455 Compare June 24, 2026 20:16

sriroopar force-pushed the postgres-auto-recovery branch from 15f8455 to 6e37719 Compare June 24, 2026 20:32

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

		liveness_db_failure_threshold: int = constants.LIVENESS_DB_FAILURE_THRESHOLD

Uh oh!

Conversation

sriroopar commented Jun 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

openshift-ci Bot commented Jun 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sriroopar commented Jun 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading