Skip to content

OLS-3221 Add PostgreSQL auto-recovery after DB restart#2964

Open
sriroopar wants to merge 1 commit into
openshift:mainfrom
sriroopar:postgres-auto-recovery
Open

OLS-3221 Add PostgreSQL auto-recovery after DB restart#2964
sriroopar wants to merge 1 commit into
openshift:mainfrom
sriroopar:postgres-auto-recovery

Conversation

@sriroopar

@sriroopar sriroopar commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Implement background health-check loop, smarter error classification in the @connection decorator, operation timeouts, and enhanced liveness/readiness probes so OLS can recover automatically when the backing PostgreSQL database is restarted.

Description

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Improved service health checks for PostgreSQL-backed caches, with /liveness now able to report a 503 status when the database is unavailable.
    • Added new timeout and health-check configuration options for PostgreSQL cache behavior.
  • Bug Fixes

    • Strengthened cache connection handling and retry behavior during temporary database or connection failures.
    • ready() now reflects the latest background health status more reliably.
  • Tests

    • Expanded coverage for liveness behavior, configuration defaults, and PostgreSQL connection recovery scenarios.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

The PR adds new Postgres cache timeout and liveness-threshold configuration, updates Postgres connection and cache health handling, and changes /liveness to return 503 when the cache failure threshold is reached. Tests and OpenAPI output were updated to match.

Changes

Postgres cache health and liveness

Layer / File(s) Summary
Configuration defaults and thresholds
ols/constants.py, ols/app/models/config.py, tests/unit/app/models/test_config.py
New cache timeout constants and liveness threshold values are added, and PostgresConfig and OLSConfig expose those defaults and overrides.
Connection retry and statement timeout
ols/utils/postgres.py, tests/unit/utils/test_postgres.py
The Postgres connection decorator retries connection-level errors once, wraps database errors in CacheError, and connect() sets statement_timeout.
Cache health state and readiness
ols/src/cache/postgres_cache.py, tests/unit/cache/test_postgres_cache.py
PostgresCache starts a health-check thread, tracks health state and failure counts, set_topic_summary follows the new lock and error handling, and ready() uses the cached health state.
Cache locking and error handling
ols/src/cache/postgres_cache.py, tests/unit/cache/test_postgres_cache.py
The remaining cache read/write methods use bounded locks and separate operational/interface errors from database errors; tests cover lock timeouts.
Liveness endpoint and OpenAPI contract
ols/app/endpoints/health.py, docs/openapi.json, tests/unit/app/endpoints/test_health.py
/liveness returns 503 for Postgres cache threshold breaches, and the OpenAPI spec and endpoint tests include the new response.

Sequence Diagram(s)

sequenceDiagram
  participant "Client"
  participant "liveness_probe_get_method"
  participant "config._conversation_cache"
  participant "PostgresCache"
  participant "HTTPException"

  Client->>liveness_probe_get_method: GET /liveness
  liveness_probe_get_method->>config._conversation_cache: inspect cache implementation
  alt config._conversation_cache is PostgresCache
    liveness_probe_get_method->>PostgresCache: read _consecutive_failures under _health_lock
    alt failures >= liveness_db_failure_threshold
      liveness_probe_get_method->>HTTPException: raise 503 "database unreachable"
    else below threshold
      liveness_probe_get_method-->>Client: LivenessResponse(alive=True)
    end
  else other cache
    liveness_probe_get_method-->>Client: LivenessResponse(alive=True)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: PostgreSQL auto-recovery after a database restart.
Docstring Coverage ✅ Passed Docstring coverage is 96.72% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from bparees and tisnik June 24, 2026 17:46
@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign blublinsky for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (1)
tests/unit/cache/test_postgres_cache.py (1)

836-865: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy lift

Exercise the health-check behavior instead of assigning its result.

These tests directly set _health_status and _consecutive_failures, so they would still pass if _health_check_loop() stopped updating either field. Extract a single-iteration health-check helper or inject a controllable sleep/event so the tests drive the success and failure paths. As per coding guidelines, tests/**/*.py: Assert specific values and behaviors in tests, not just that code runs without error.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/cache/test_postgres_cache.py` around lines 836 - 865, The
health-check tests are bypassing the real behavior by mutating PostgresCache
internals directly, so they won’t catch regressions in _health_check_loop().
Update the tests to drive the loop logic through a single-iteration helper or a
controllable sleep/event on PostgresCache, then assert the resulting
_health_status and _consecutive_failures values after simulated success/failure.
Use the existing PostgresCache, _health_check_loop, _mark_unhealthy, and ready()
symbols to locate the behavior and keep the assertions focused on observable
state changes rather than direct field assignment.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ols/app/endpoints/health.py`:
- Around line 126-129: The 503 liveness response in the health endpoint is
returning an HTTPException shape instead of the declared LivenessResponse
contract. Update the liveness failure path in the health endpoint so the 503
body matches LivenessResponse (for example, by returning that model directly),
or change the documented 503 response model to the actual error payload; keep
the response schema and the OpenAPI declaration in sync around the liveness
handler.

In `@ols/app/models/config.py`:
- Around line 801-803: The config model currently accepts negative values for
statement_timeout, lock_timeout, and health_check_interval, so tighten
validation in the Pydantic config class that defines these fields. Add explicit
field constraints or validators on statement_timeout and lock_timeout to reject
negative integers, and make health_check_interval strictly positive so the
health-check loop and PostgresCache lock acquisition keep their timing
guarantees. Use the existing config model/field definitions in Config to enforce
these contracts at load time rather than downstream.
- Around line 1166-1167: `OLSConfig.__init__` is currently taking
`liveness_db_failure_threshold` directly from `data.get(...)`, so it can remain
a string or accept invalid values. Parse this field as an int when loading
config, and validate that the value is at least 1; if it is missing,
non-numeric, zero, or negative, raise `checks.InvalidConfigurationError` before
the value is used by `/liveness`.

In `@ols/src/cache/postgres_cache.py`:
- Around line 159-165: Validate the Postgres cache timing settings before they
reach the daemon: `health_check_interval` must not allow negative values or
zero-hot-loop behavior, and `lock_timeout` should reject negative values unless
indefinite blocking is explicitly intended. Add Pydantic validation in
`PostgresConfig` (using appropriate constrained types such as
non-negative/positive ints) and keep the runtime consumers in `PostgresCache`
safe by relying on those validated values in the initialization and health-check
logic. Update any related uses in `PostgresCache` and its config wiring so the
daemon never calls `time.sleep()` or lock acquisition with invalid values.
- Around line 189-219: The health-check path in _connect_health and
_health_check_loop can block indefinitely because neither the psycopg2.connect
call nor the SELECT 1 probe has a timeout. Update the PostgresCache health
connection setup to pass an explicit connect timeout in connect_kwargs and add a
statement timeout before executing the cursor probe, so ready() does not wait on
a hung network or server and _health_status can recover promptly.
- Around line 194-195: The broad exception handling in PostgresCache needs to be
narrowed to satisfy Ruff BLE001. Update the catch blocks in _connect_health,
_health_check_loop, _safe_rollback, and _safe_set_autocommit to handle only the
specific psycopg2 or attribute-related exceptions they can actually raise, and
keep the existing logger/debug behavior in those paths. If any catch-all is
intentionally required to keep the daemon alive, add an explicit # noqa: BLE001
on that except clause rather than leaving a bare except Exception.

In `@ols/utils/postgres.py`:
- Around line 106-108: The new cursor in the statement-timeout setup can leak if
`cursor.execute` fails during schema initialization. Update the `Postgres`
initialization flow around the `cursor = self.connection.cursor()` and `SET
statement_timeout` call to use a `try/finally` so `cursor.close()` always runs,
even on exceptions; keep the fix localized to this cursor-handling block.
- Around line 29-45: The wrapper in postgres connection handling is calling
connectable.connect() before the try/except, so a failed initial connect skips
the same recovery path as query failures. Move the initial connect() call into
the connection-error handling flow inside wrapper() so both the first connect
and the retry go through the same psycopg2.OperationalError/InterfaceError
handling, including _mark_unhealthy() and CacheError wrapping in the reconnect
path.
- Around line 33-45: The `@connection` retry logic in the postgres decorator is
causing duplicate writes for methods like PostgresCache.insert_or_append because
the operation may have already committed before reconnect and retry. Update the
retry behavior in the connection wrapper to avoid automatically retrying
write/mutating methods after OperationalError/InterfaceError, and keep the retry
path only for safe read operations or explicitly idempotent calls identified by
the wrapped function name.

In `@tests/unit/cache/test_postgres_cache.py`:
- Around line 20-23: The new pytest fixture and several test functions in
Postgres cache tests are missing type annotations. Update _suppress_health_loop
to declare Generator[None, None, None] as its return type, and add -> None to
the affected test functions in test_postgres_cache so the new autouse fixture
and tests are fully typed and consistent with the rest of the module.

In `@tests/unit/utils/test_postgres.py`:
- Around line 108-147: Add explicit -> None return annotations to the new test
methods in test_postgres.py, including the helper failing_connect nested in
test_connection_error_reconnect_failure_raises_cache_error, to keep the test
signatures consistent. Also tighten
test_database_error_wraps_in_cache_error_no_retry by asserting the Connectable
instance does not attempt reconnect/retry when do_work_database_error is called;
use the existing Connectable, connect, and do_work_database_error symbols to
place the assertion near the current CacheError check.
- Around line 124-129: The test
`test_database_error_wraps_in_cache_error_no_retry` only verifies that
`do_work_database_error()` raises `CacheError`, but it does not prove the
reconnect path was skipped. After the initial `c.connect()` call, patch or mock
`Connectable.connect` and assert it is not called when
`do_work_database_error()` raises, so this test checks the “no retry” behavior
as well as the `CacheError` wrapping.

---

Nitpick comments:
In `@tests/unit/cache/test_postgres_cache.py`:
- Around line 836-865: The health-check tests are bypassing the real behavior by
mutating PostgresCache internals directly, so they won’t catch regressions in
_health_check_loop(). Update the tests to drive the loop logic through a
single-iteration helper or a controllable sleep/event on PostgresCache, then
assert the resulting _health_status and _consecutive_failures values after
simulated success/failure. Use the existing PostgresCache, _health_check_loop,
_mark_unhealthy, and ready() symbols to locate the behavior and keep the
assertions focused on observable state changes rather than direct field
assignment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3804766b-967a-49bb-b307-520f135d2133

📥 Commits

Reviewing files that changed from the base of the PR and between b51ca02 and 9868f61.

📒 Files selected for processing (10)
  • ols/app/endpoints/health.py
  • ols/app/models/config.py
  • ols/app/models/models.py
  • ols/constants.py
  • ols/src/cache/postgres_cache.py
  • ols/utils/postgres.py
  • tests/unit/app/endpoints/test_health.py
  • tests/unit/app/models/test_config.py
  • tests/unit/cache/test_postgres_cache.py
  • tests/unit/utils/test_postgres.py

Comment on lines +126 to +129
503: {
"description": "Service is not alive",
"model": LivenessResponse,
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map relevant files first.
git ls-files | rg '^ols/app/endpoints/health\.py$|^.*health.*test.*\.py$|^.*test.*health.*\.py$|^.*liveness.*test.*\.py$|^.*health.*\.py$'

# Show outline for the target file.
ast-grep outline ols/app/endpoints/health.py --view expanded

# Inspect the relevant line range with numbers.
sed -n '1,240p' ols/app/endpoints/health.py | cat -n

# Search for liveness/503 response declarations and tests.
rg -n "liveness|503|LivenessResponse|HTTPException|JSONResponse|get_liveness_responses" ols -g '!**/__pycache__/**'

Repository: openshift/lightspeed-service

Length of output: 12991


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show tests related to the health endpoint if present.
rg -n "liveness_probe_get_method|LivenessResponse|database unreachable|service is not alive|HTTP_503_SERVICE_UNAVAILABLE" . -g '*test*.py' -g '*tests*.py'

# Print the nearby definitions in the health endpoint.
sed -n '1,220p' ols/app/endpoints/health.py | cat -n

Repository: openshift/lightspeed-service

Length of output: 7441


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the exact health endpoint implementation and adjacent helpers.
sed -n '1,220p' ols/app/endpoints/health.py | cat -n

Repository: openshift/lightspeed-service

Length of output: 6023


Align the 503 liveness response with LivenessResponse.

raise HTTPException(detail={"alive": False, "reason": ...}) returns {"detail": ...}, which does not match the declared 503 LivenessResponse schema or the generated OpenAPI contract. Return a LivenessResponse payload with the 503 response body, or change the documented 503 model to match the actual error shape.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/app/endpoints/health.py` around lines 126 - 129, The 503 liveness
response in the health endpoint is returning an HTTPException shape instead of
the declared LivenessResponse contract. Update the liveness failure path in the
health endpoint so the 503 body matches LivenessResponse (for example, by
returning that model directly), or change the documented 503 response model to
the actual error payload; keep the response schema and the OpenAPI declaration
in sync around the liveness handler.

Source: Coding guidelines

Comment thread ols/app/models/config.py
Comment on lines +801 to +803
statement_timeout: int = constants.POSTGRES_STATEMENT_TIMEOUT
lock_timeout: int = constants.POSTGRES_LOCK_TIMEOUT
health_check_interval: int = constants.CACHE_HEALTH_CHECK_INTERVAL

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Reject invalid timeout and interval values.

These fields currently allow negative integers. That breaks the new recovery path: a negative health_check_interval will terminate the health-check thread on time.sleep(), and a negative lock_timeout breaks the bounded-wait guarantee in PostgresCache._acquire_lock(). Validate these at config load and keep health_check_interval strictly positive.

Suggested constraint tightening
-    statement_timeout: int = constants.POSTGRES_STATEMENT_TIMEOUT
-    lock_timeout: int = constants.POSTGRES_LOCK_TIMEOUT
-    health_check_interval: int = constants.CACHE_HEALTH_CHECK_INTERVAL
+    statement_timeout: NonNegativeInt = constants.POSTGRES_STATEMENT_TIMEOUT
+    lock_timeout: NonNegativeInt = constants.POSTGRES_LOCK_TIMEOUT
+    health_check_interval: PositiveInt = constants.CACHE_HEALTH_CHECK_INTERVAL

As per coding guidelines, "Use Pydantic models for configuration and data validation" and "Define function signatures and data shapes (interface contracts) before implementing; prefer explicit contracts over implicit ones".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
statement_timeout: int = constants.POSTGRES_STATEMENT_TIMEOUT
lock_timeout: int = constants.POSTGRES_LOCK_TIMEOUT
health_check_interval: int = constants.CACHE_HEALTH_CHECK_INTERVAL
statement_timeout: NonNegativeInt = constants.POSTGRES_STATEMENT_TIMEOUT
lock_timeout: NonNegativeInt = constants.POSTGRES_LOCK_TIMEOUT
health_check_interval: PositiveInt = constants.CACHE_HEALTH_CHECK_INTERVAL
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/app/models/config.py` around lines 801 - 803, The config model currently
accepts negative values for statement_timeout, lock_timeout, and
health_check_interval, so tighten validation in the Pydantic config class that
defines these fields. Add explicit field constraints or validators on
statement_timeout and lock_timeout to reject negative integers, and make
health_check_interval strictly positive so the health-check loop and
PostgresCache lock acquisition keep their timing guarantees. Use the existing
config model/field definitions in Config to enforce these contracts at load time
rather than downstream.

Source: Coding guidelines

Comment thread ols/app/models/config.py
Comment on lines +1166 to +1167
liveness_db_failure_threshold: int = constants.LIVENESS_DB_FAILURE_THRESHOLD

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant config model and surrounding initialization logic.
FILE="ols/app/models/config.py"

wc -l "$FILE"
printf '\n--- outline ---\n'
ast-grep outline "$FILE" --view expanded | sed -n '1,220p'

printf '\n--- relevant line ranges ---\n'
sed -n '1120,1285p' "$FILE"

Repository: openshift/lightspeed-service

Length of output: 17658


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect supporting definitions used by the config model.
rg -n "liveness_db_failure_threshold|LIVENESS_DB_FAILURE_THRESHOLD|InvalidConfigurationError|class OLSConfig|def __init__" ols/app/models/config.py ols/utils/checks.py ols/app/models -S

Repository: openshift/lightspeed-service

Length of output: 13448


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Read the constructor and any validation helpers around the cited fields.
FILE="ols/app/models/config.py"

# Show the class definition and init method ranges if present.
ast-grep outline "$FILE" --match OLSConfig --view expanded | sed -n '1,260p'

printf '\n--- constructor / field assignments ---\n'
rg -n "def __init__|liveness_db_failure_threshold|data.get\\(\"liveness_db_failure_threshold\"|InvalidConfigurationError" "$FILE" -n -S -A 4 -B 4

Repository: openshift/lightspeed-service

Length of output: 26243


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check whether assignment validation is enabled anywhere for this model hierarchy.
rg -n "validate_assignment|model_config|ConfigDict|class BaseModel" ols/app/models/config.py ols/app/models -S

printf '\n--- OLSConfig / BaseModel relevant slice ---\n'
sed -n '1,120p' ols/app/models/config.py

Repository: openshift/lightspeed-service

Length of output: 5473


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find where the liveness threshold is consumed.
rg -n "liveness_db_failure_threshold|/liveness|failures >= threshold|failure_threshold" ols -S

Repository: openshift/lightspeed-service

Length of output: 829


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Read the liveness check implementation.
rg -n "/liveness|liveness" ols -S -A 6 -B 6 | sed -n '1,240p'

Repository: openshift/lightspeed-service

Length of output: 7132


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
from pathlib import Path
text = Path("ols/app/models/config.py").read_text()
for needle in ["validate_assignment", "model_config", "ConfigDict", "BaseModel"]:
    print(f"{needle}: {needle in text}")
PY

printf '\n--- liveness references ---\n'
rg -n "liveness_db_failure_threshold|/liveness|failures >?=|failure" ols -S -A 3 -B 3

Repository: openshift/lightspeed-service

Length of output: 10511


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check whether Pydantic assignment validation is enabled for OLSConfig or its base classes.
python3 - <<'PY'
from pathlib import Path

text = Path("ols/app/models/config.py").read_text()
for needle in ("validate_assignment", "ConfigDict", "model_config"):
    print(f"{needle}:", needle in text)
PY

printf '\n--- liveness usage ---\n'
rg -n "liveness_db_failure_threshold|/liveness|failures >=|failures >=" ols -S -A 4 -B 4

Repository: openshift/lightspeed-service

Length of output: 3476


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check whether Pydantic assignment validation is enabled for OLSConfig or its base classes.
python3 - <<'PY'
from pathlib import Path

text = Path("ols/app/models/config.py").read_text()
for needle in ("validate_assignment", "ConfigDict", "model_config"):
    print(f"{needle}:", needle in text)
PY

printf '\n--- liveness usage ---\n'
rg -n "liveness_db_failure_threshold|/liveness|failures >=|failures >=" ols -S -A 4 -B 4

Repository: openshift/lightspeed-service

Length of output: 3476


Validate liveness_db_failure_threshold in OLSConfig.__init__. data.get(...) bypasses Pydantic here, so a YAML string like "3" reaches /liveness as a string and breaks failures >= threshold; 0 or negative values also force 503 on Postgres-backed instances. Parse it as an int and reject values < 1 with checks.InvalidConfigurationError.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/app/models/config.py` around lines 1166 - 1167, `OLSConfig.__init__` is
currently taking `liveness_db_failure_threshold` directly from `data.get(...)`,
so it can remain a string or accept invalid values. Parse this field as an int
when loading config, and validate that the value is at least 1; if it is
missing, non-numeric, zero, or negative, raise
`checks.InvalidConfigurationError` before the value is used by `/liveness`.

Source: Coding guidelines

Comment on lines +159 to +165
self._lock_timeout = config.lock_timeout
self.capacity = config.max_entries

self._health_status = True
self._consecutive_failures = 0
self._health_lock = threading.Lock()
self._health_check_interval = config.health_check_interval

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate timeout and interval values before the daemon uses them.

health_check_interval reaches time.sleep() before the try, so a negative value kills the health thread and 0 creates a hot polling loop. lock_timeout should also reject negative values unless indefinite blocking is intentional. The related PostgresConfig contract currently exposes these as plain int; validate them there with PositiveInt/non-negative types. As per coding guidelines, ols/**/*.py: Use Pydantic models for configuration and data validation.

Also applies to: 211-215, 254-257

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/src/cache/postgres_cache.py` around lines 159 - 165, Validate the
Postgres cache timing settings before they reach the daemon:
`health_check_interval` must not allow negative values or zero-hot-loop
behavior, and `lock_timeout` should reject negative values unless indefinite
blocking is explicitly intended. Add Pydantic validation in `PostgresConfig`
(using appropriate constrained types such as non-negative/positive ints) and
keep the runtime consumers in `PostgresCache` safe by relying on those validated
values in the initialization and health-check logic. Update any related uses in
`PostgresCache` and its config wiring so the daemon never calls `time.sleep()`
or lock acquisition with invalid values.

Source: Coding guidelines

Comment on lines +189 to +219
def _connect_health(self) -> None:
"""Establish a dedicated lightweight connection for health checks."""
if self._health_connection is not None:
try:
self._health_connection.close()
except Exception as close_err:
logger.debug("Failed to close old health connection: %s", close_err)
config = self.connection_config
connect_kwargs: dict[str, Any] = {
"host": config.host,
"port": config.port,
"user": config.user,
"password": config.password,
"dbname": config.dbname,
"sslmode": config.ssl_mode,
"sslrootcert": config.ca_cert_path,
"gssencmode": config.gss_encmode,
**libpq_tls_params(config.tls_security_profile),
}
self._health_connection = psycopg2.connect(**connect_kwargs)
self._health_connection.autocommit = True

def _health_check_loop(self) -> None:
"""Background loop that periodically checks DB connectivity."""
while True:
time.sleep(self._health_check_interval)
try:
if self._health_connection is None or self._health_connection.closed:
self._connect_health()
with self._health_connection.cursor() as cursor:
cursor.execute("SELECT 1")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🌐 Web query:

psycopg2 2.9.9 libpq connect_timeout parameter and PostgreSQL SET statement_timeout session behavior

💡 Result:

In the context of psycopg2 (version 2.9.9), the connect_timeout parameter and the PostgreSQL SET statement_timeout command serve distinct purposes at different stages of the database interaction lifecycle [1]. connect_timeout Parameter The connect_timeout parameter is a client-side setting handled by libpq, the underlying C library used by psycopg2 [2][3]. It defines the maximum number of seconds to wait while attempting to establish the initial connection to the PostgreSQL server [1]. If the connection cannot be established within this duration (e.g., due to network issues or the server being unreachable), the attempt is aborted, and psycopg2 raises an OperationalError [4][1]. This parameter does not affect queries or operations once the connection is successfully established [1]. SET statement_timeout Behavior The SET statement_timeout command is a server-side configuration that is applied within the context of a specific database session [1][5]. When executed via psycopg2 (e.g., cursor.execute("SET statement_timeout = 5000")), it instructs the PostgreSQL server to automatically abort any subsequent SQL statement in that session that exceeds the specified duration (measured in milliseconds) [6][5][7]. Key Differences 1. Scope: connect_timeout is a connection-phase constraint enforced by the client driver (libpq) [1]; statement_timeout is a query-execution constraint enforced by the database server [6][7]. 2. Purpose: connect_timeout prevents the application from hanging during network attempts to reach the database [1]. statement_timeout prevents runaway or long-running queries from consuming excessive server resources [5][7]. 3. Interaction: The two settings are independent. Setting a short connect_timeout does not help if a query itself takes too long to execute once connected, and setting a statement_timeout does not help if the application cannot connect to the server in the first place [1]. For robust applications, it is common to use both: connect_timeout to handle initial connectivity failures, and statement_timeout to safeguard against unexpected query latency [1]. Note that because statement_timeout is session-scoped, it persists for the lifetime of that specific psycopg2 connection until explicitly changed or the session ends [5].

Citations:


🏁 Script executed:

# Inspect the cache implementation and related configuration to check the health path, timeouts, and validation.
set -e
git ls-files 'ols/src/cache/postgres_cache.py' 'ols/src/**/*.py' | sed -n '1,120p'
echo '--- postgres_cache outline ---'
ast-grep outline ols/src/cache/postgres_cache.py --view expanded
echo '--- config candidates ---'
rg -n "health_check_interval|lock_timeout|statement_timeout|connect_timeout|ssl_mode|tls_security_profile" ols/src -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 4309


🏁 Script executed:

# Read the relevant sections of postgres_cache.py and the config model to verify how the health loop behaves.
set -e
sed -n '1,320p' ols/src/cache/postgres_cache.py
echo '--- config file candidates ---'
fd -a '.*(config|settings).*\.py$' ols/src | sed -n '1,120p'

Repository: openshift/lightspeed-service

Length of output: 12237


🏁 Script executed:

# Inspect the actual health loop and the PostgresConfig definition, focusing on interval/timeout validation.
set -e
python3 - <<'PY'
from pathlib import Path
for path in [
    Path('ols/src/cache/postgres_cache.py'),
    Path('ols/src/config.py'),
]:
    if path.exists():
        print(f'FILE: {path}')
        for i, line in enumerate(path.read_text().splitlines(), 1):
            if 'health_check_interval' in line or 'lock_timeout' in line or 'statement_timeout' in line or 'connect_timeout' in line or '_health_check_loop' in line or '_connect_health' in line:
                print(f'{i}: {line}')
        print()
PY

Repository: openshift/lightspeed-service

Length of output: 707


Bound the health-check connect and probe. ready() now depends on this loop, but _connect_health() has no connect_timeout and the SELECT 1 probe has no statement_timeout. A slow network or hung server can block the health thread and leave _health_status stale; set explicit timeouts for both steps.

🧰 Tools
🪛 Ruff (0.15.18)

[warning] 194-194: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/src/cache/postgres_cache.py` around lines 189 - 219, The health-check
path in _connect_health and _health_check_loop can block indefinitely because
neither the psycopg2.connect call nor the SELECT 1 probe has a timeout. Update
the PostgresCache health connection setup to pass an explicit connect timeout in
connect_kwargs and add a statement timeout before executing the cursor probe, so
ready() does not wait on a hung network or server and _health_status can recover
promptly.

Comment thread ols/utils/postgres.py
Comment on lines +33 to +45
except (psycopg2.OperationalError, psycopg2.InterfaceError) as e:
logger.warning(
"Connection error in %s, attempting reconnect: %s", f.__name__, e
)
if hasattr(connectable, "_mark_unhealthy"):
connectable._mark_unhealthy()
try:
connectable.connect()
except Exception as reconnect_err:
raise CacheError(
f"reconnect failed in {f.__name__}", reconnect_err
) from reconnect_err
return f(connectable, *args, **kwargs)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the relevant file and nearby symbols.
git ls-files | rg '^ols/utils/postgres\.py$|^ols/.+\.py$' >/dev/null
ast-grep outline ols/utils/postgres.py --view expanded || true

echo '--- file excerpt ---'
wc -l ols/utils/postgres.py
sed -n '1,220p' ols/utils/postgres.py

echo '--- usages ---'
rg -n "insert_or_append|_mark_unhealthy|connect\(|OperationalError|InterfaceError|CacheError" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 8349


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the cache implementation to understand commit/append semantics.
fd -a 'postgres.py' ols
fd -a 'cache.py' ols
fd -a '*cache*' ols | sed -n '1,120p'

echo '--- Postgres cache excerpt ---'
sed -n '1,260p' ols/utils/postgres.py

echo '--- cache-related excerpts ---'
rg -n "class .*Cache|def insert_or_append|append|commit|upsert|ON CONFLICT|executemany|execute" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 639


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- postgres_cache outline ---'
ast-grep outline ols/src/cache/postgres_cache.py --view expanded || true

echo '--- insert_or_append section ---'
sed -n '280,380p' ols/src/cache/postgres_cache.py

echo '--- decorators/usages in postgres_cache ---'
rg -n "^\\s*`@connection`|def (get|insert_or_append|delete|list|set_topic_summary)|commit\\(|ON CONFLICT|DO UPDATE|DO NOTHING|append" ols/src/cache/postgres_cache.py

Repository: openshift/lightspeed-service

Length of output: 6822


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- cache interface excerpt ---'
sed -n '1,220p' ols/src/cache/cache.py

echo '--- in-memory implementation excerpt ---'
sed -n '1,180p' ols/src/cache/in_memory_cache.py

Repository: openshift/lightspeed-service

Length of output: 10989


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
from pathlib import Path
path = Path('ols/src/cache/postgres_cache.py')
text = path.read_text()
for needle in ['`@connection`', 'def insert_or_append', 'def get', 'def delete', 'def list', 'def set_topic_summary']:
    idx = text.find(needle)
    if idx != -1:
        start = text.rfind('\n', 0, idx-120)
        end = text.find('\n\n', idx)
        print(f'### {needle}')
        print(text[max(0, idx-120):min(len(text), idx+1200)])
        print()
PY

Repository: openshift/lightspeed-service

Length of output: 8205


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n -A80 -B20 "^\\s*`@connection`|^\\s*def insert_or_append|^\\s*def get|^\\s*def delete|^\\s*def list|^\\s*def set_topic_summary" ols/src/cache/postgres_cache.py

Repository: openshift/lightspeed-service

Length of output: 11834


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '280,370p' ols/src/cache/postgres_cache.py

Repository: openshift/lightspeed-service

Length of output: 3988


🏁 Script executed:

#!/bin/bash
set -euo pipefail

cat -n ols/src/cache/postgres_cache.py | sed -n '280,370p'

Repository: openshift/lightspeed-service

Length of output: 4625


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "`@connection`" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 724


🏁 Script executed:

#!/bin/bash
set -euo pipefail

cat -n ols/src/cache/postgres_cache.py | sed -n '1,260p'

Repository: openshift/lightspeed-service

Length of output: 11615


🏁 Script executed:

#!/bin/bash
set -euo pipefail

cat -n ols/src/cache/postgres_cache.py | sed -n '285,360p'

Repository: openshift/lightspeed-service

Length of output: 3885


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "class .*\\(PostgresBase\\)" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 245


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- PostgresBase subclasses ---'
rg -n "class .*\\(PostgresBase\\)" ols -g '*.py'

echo '--- connection decorator uses ---'
rg -n "`@connection`" ols -g '*.py' || true

Repository: openshift/lightspeed-service

Length of output: 869


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "`@connection`" ols -g '*.py' || true

Repository: openshift/lightspeed-service

Length of output: 724


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "`@connection`" ols -g '*.py' || true

Repository: openshift/lightspeed-service

Length of output: 724


Limit @connection retries on write methods. PostgresCache.insert_or_append is retried after OperationalError/InterfaceError, but it appends to the stored history and increments message_count before commit(). If the connection drops after PostgreSQL has already committed, the retry can duplicate the entry and metadata update.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/utils/postgres.py` around lines 33 - 45, The `@connection` retry logic in
the postgres decorator is causing duplicate writes for methods like
PostgresCache.insert_or_append because the operation may have already committed
before reconnect and retry. Update the retry behavior in the connection wrapper
to avoid automatically retrying write/mutating methods after
OperationalError/InterfaceError, and keep the retry path only for safe read
operations or explicitly idempotent calls identified by the wrapped function
name.

Comment thread ols/utils/postgres.py
Comment on lines +106 to +108
cursor = self.connection.cursor()
cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),))
cursor.close()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Close the statement-timeout cursor on failure too.

If SET statement_timeout raises, cursor.close() is skipped after schema initialization. Use try/finally for the new cursor.

Suggested cleanup
         self.connection.autocommit = True
         cursor = self.connection.cursor()
-        cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),))
-        cursor.close()
+        try:
+            cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),))
+        finally:
+            cursor.close()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cursor = self.connection.cursor()
cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),))
cursor.close()
cursor = self.connection.cursor()
try:
cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),))
finally:
cursor.close()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/utils/postgres.py` around lines 106 - 108, The new cursor in the
statement-timeout setup can leak if `cursor.execute` fails during schema
initialization. Update the `Postgres` initialization flow around the `cursor =
self.connection.cursor()` and `SET statement_timeout` call to use a
`try/finally` so `cursor.close()` always runs, even on exceptions; keep the fix
localized to this cursor-handling block.

Comment on lines +20 to +23
def _suppress_health_loop():
"""Prevent the background health-check thread from making real DB calls."""
with patch.object(PostgresCache, "_health_check_loop"):
yield

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the file and inspect the relevant sections with line numbers.
git ls-files tests/unit/cache/test_postgres_cache.py
wc -l tests/unit/cache/test_postgres_cache.py
sed -n '1,120p' tests/unit/cache/test_postgres_cache.py
printf '\n--- mid ---\n'
sed -n '760,890p' tests/unit/cache/test_postgres_cache.py

Repository: openshift/lightspeed-service

Length of output: 7621


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Extract function signatures and comments in the target file.
python3 - <<'PY'
from pathlib import Path
import re

path = Path('tests/unit/cache/test_postgres_cache.py')
text = path.read_text()

for i, line in enumerate(text.splitlines(), start=1):
    if re.match(r'^\s*(def|async def)\s+', line) or 'TODO' in line or line.strip().startswith('#'):
        print(f'{i}: {line}')
PY

Repository: openshift/lightspeed-service

Length of output: 7721


Add return annotations to the new fixture and tests. tests/unit/cache/test_postgres_cache.py:20, 791-865 has an untyped autouse fixture and several untyped test functions; add Generator[None, None, None] for _suppress_health_loop and -> None for the new tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/cache/test_postgres_cache.py` around lines 20 - 23, The new pytest
fixture and several test functions in Postgres cache tests are missing type
annotations. Update _suppress_health_loop to declare Generator[None, None, None]
as its return type, and add -> None to the affected test functions in
test_postgres_cache so the new autouse fixture and tests are fully typed and
consistent with the rest of the module.

Source: Coding guidelines

Comment on lines +108 to +147
def test_operational_error_triggers_reconnect_and_retry(self):
"""OperationalError causes reconnect + retry, succeeding on second attempt."""
c = self.Connectable()
c.connect()
result = c.do_work_operational_error()
assert result == "recovered"
assert c._unhealthy_marked is True

def test_interface_error_triggers_reconnect_and_retry(self):
"""InterfaceError causes reconnect + retry, succeeding on second attempt."""
c = self.Connectable()
c.connect()
result = c.do_work_interface_error()
assert result == "recovered"
assert c._unhealthy_marked is True

def test_database_error_wraps_in_cache_error_no_retry(self):
"""DatabaseError is wrapped in CacheError without reconnect attempt."""
c = self.Connectable()
c.connect()
with pytest.raises(CacheError, match="SQL syntax error"):
c.do_work_database_error()

def test_connection_error_reconnect_failure_raises_cache_error(self):
"""When reconnect fails after OperationalError, CacheError is raised."""
c = self.Connectable()
c.connect()
# Make connect() fail on reconnect attempt
original_connect = c.connect
call_count = [0]

def failing_connect():
call_count[0] += 1
if call_count[0] > 0:
raise psycopg2.OperationalError("cannot connect")
original_connect()

c.connect = failing_connect
with pytest.raises(CacheError, match="reconnect failed"):
c.do_work_operational_error()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

sed -n '1,240p' tests/unit/utils/test_postgres.py

Repository: openshift/lightspeed-service

Length of output: 9034


🏁 Script executed:

sed -n '1,240p' tests/unit/utils/test_postgres.py

Repository: openshift/lightspeed-service

Length of output: 9034


Add the missing -> None annotations and tighten the no-retry assertion

The new test_* methods and failing_connect still need explicit return types, and test_database_error_wraps_in_cache_error_no_retry should also assert that reconnect/retry is not attempted.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/utils/test_postgres.py` around lines 108 - 147, Add explicit ->
None return annotations to the new test methods in test_postgres.py, including
the helper failing_connect nested in
test_connection_error_reconnect_failure_raises_cache_error, to keep the test
signatures consistent. Also tighten
test_database_error_wraps_in_cache_error_no_retry by asserting the Connectable
instance does not attempt reconnect/retry when do_work_database_error is called;
use the existing Connectable, connect, and do_work_database_error symbols to
place the assertion near the current CacheError check.

Source: Coding guidelines

Comment on lines +124 to +129
def test_database_error_wraps_in_cache_error_no_retry(self):
"""DatabaseError is wrapped in CacheError without reconnect attempt."""
c = self.Connectable()
c.connect()
with pytest.raises(CacheError, match="SQL syntax error"):
c.do_work_database_error()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Assert the “no retry” part of this case.

This test only checks CacheError wrapping. If the decorator attempted a reconnect/retry and still raised CacheError, the test could pass. Patch connect after the initial connection and assert it is not called. As per coding guidelines, tests/**/*.py: Assert specific values and behaviors in tests, not just that code runs without error.

Suggested assertion
         c = self.Connectable()
         c.connect()
-        with pytest.raises(CacheError, match="SQL syntax error"):
-            c.do_work_database_error()
+        with patch.object(c, "connect") as mock_connect:
+            with pytest.raises(CacheError, match="SQL syntax error"):
+                c.do_work_database_error()
+            mock_connect.assert_not_called()
+        assert c._unhealthy_marked is False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_database_error_wraps_in_cache_error_no_retry(self):
"""DatabaseError is wrapped in CacheError without reconnect attempt."""
c = self.Connectable()
c.connect()
with pytest.raises(CacheError, match="SQL syntax error"):
c.do_work_database_error()
def test_database_error_wraps_in_cache_error_no_retry(self):
"""DatabaseError is wrapped in CacheError without reconnect attempt."""
c = self.Connectable()
c.connect()
with patch.object(c, "connect") as mock_connect:
with pytest.raises(CacheError, match="SQL syntax error"):
c.do_work_database_error()
mock_connect.assert_not_called()
assert c._unhealthy_marked is False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/utils/test_postgres.py` around lines 124 - 129, The test
`test_database_error_wraps_in_cache_error_no_retry` only verifies that
`do_work_database_error()` raises `CacheError`, but it does not prove the
reconnect path was skipped. After the initial `c.connect()` call, patch or mock
`Connectable.connect` and assert it is not called when
`do_work_database_error()` raises, so this test checks the “no retry” behavior
as well as the `CacheError` wrapping.

Source: Coding guidelines

@sriroopar sriroopar force-pushed the postgres-auto-recovery branch 2 times, most recently from eea3052 to 15f8455 Compare June 24, 2026 20:16
Implement background health-check loop, smarter error classification
in the @connection decorator, operation timeouts, and enhanced
liveness/readiness probes so OLS can recover automatically when the
backing PostgreSQL database is restarted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sriroopar sriroopar force-pushed the postgres-auto-recovery branch from 15f8455 to 6e37719 Compare June 24, 2026 20:32

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ols/src/cache/postgres_cache.py`:
- Around line 350-353: The DatabaseError handling in PostgresCache should use
the same rollback protection as the operational-error path, because calling
connection.rollback() directly can hide the original psycopg2.DatabaseError and
prevent the CacheError wrapping from happening. Update the exception blocks in
PostgresCache.insert_or_append and the other matching DatabaseError path to call
_safe_rollback() instead of self.connection.rollback(), while keeping the
existing logger.error and CacheError("PostgresCache.insert_or_append", e)
behavior intact.
- Around line 184-188: The `_mark_unhealthy()` path only updates
`_health_status`, so repeated request-path failures from the `@connection` retry
flow do not advance `_consecutive_failures` toward the liveness threshold.
Update `_mark_unhealthy()` in `PostgresCache` to also increment the failure
counter under `_health_lock`, keeping the unhealthy flag and failure tracking in
sync so `is_healthy`/liveness checks can trip promptly after repeated
cache-operation DB failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c84be3c2-a66c-475a-91d3-5d332a00fb83

📥 Commits

Reviewing files that changed from the base of the PR and between 9868f61 and 6e37719.

📒 Files selected for processing (10)
  • docs/openapi.json
  • ols/app/endpoints/health.py
  • ols/app/models/config.py
  • ols/constants.py
  • ols/src/cache/postgres_cache.py
  • ols/utils/postgres.py
  • tests/unit/app/endpoints/test_health.py
  • tests/unit/app/models/test_config.py
  • tests/unit/cache/test_postgres_cache.py
  • tests/unit/utils/test_postgres.py
🚧 Files skipped from review as they are similar to previous changes (8)
  • ols/app/endpoints/health.py
  • ols/constants.py
  • tests/unit/cache/test_postgres_cache.py
  • tests/unit/app/endpoints/test_health.py
  • ols/app/models/config.py
  • tests/unit/utils/test_postgres.py
  • ols/utils/postgres.py
  • tests/unit/app/models/test_config.py

Comment on lines +184 to +188
def _mark_unhealthy(self) -> None:
"""Mark health status as unhealthy (dual-feed from cache operations)."""
with self._health_lock:
self._health_status = False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Count operation-driven unhealthy events toward the failure threshold.

_mark_unhealthy() is the cache-operation feed from the @connection retry path, but it only flips _health_status; liveness threshold checks based on _consecutive_failures can remain below threshold until the background probe also fails. Increment the counter here so repeated request-path DB failures can trip liveness promptly.

Suggested fix
 def _mark_unhealthy(self) -> None:
     """Mark health status as unhealthy (dual-feed from cache operations)."""
     with self._health_lock:
+        self._consecutive_failures += 1
         self._health_status = False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _mark_unhealthy(self) -> None:
"""Mark health status as unhealthy (dual-feed from cache operations)."""
with self._health_lock:
self._health_status = False
def _mark_unhealthy(self) -> None:
"""Mark health status as unhealthy (dual-feed from cache operations)."""
with self._health_lock:
self._consecutive_failures += 1
self._health_status = False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/src/cache/postgres_cache.py` around lines 184 - 188, The
`_mark_unhealthy()` path only updates `_health_status`, so repeated request-path
failures from the `@connection` retry flow do not advance
`_consecutive_failures` toward the liveness threshold. Update
`_mark_unhealthy()` in `PostgresCache` to also increment the failure counter
under `_health_lock`, keeping the unhealthy flag and failure tracking in sync so
`is_healthy`/liveness checks can trip promptly after repeated cache-operation DB
failures.

Comment on lines 350 to 353
except psycopg2.DatabaseError as e:
self.connection.rollback()
logger.error("PostgresCache.insert_or_append: %s", e)
raise CacheError("PostgresCache.insert_or_append", e) from e

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Use the safe rollback helper in DatabaseError paths too.

These paths intend to wrap psycopg2.DatabaseError as CacheError, but a failed direct rollback() would mask the original error and skip the wrapping/log context. Reuse _safe_rollback() as in the operational-error path.

Suggested fix
 except psycopg2.DatabaseError as e:
-    self.connection.rollback()
+    self._safe_rollback()
     logger.error("PostgresCache.insert_or_append: %s", e)
     raise CacheError("PostgresCache.insert_or_append", e) from e
 except psycopg2.DatabaseError as e:
-    self.connection.rollback()
+    self._safe_rollback()
     logger.error("PostgresCache.delete: %s", e)
     raise CacheError("PostgresCache.delete", e) from e

Also applies to: 389-392

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/src/cache/postgres_cache.py` around lines 350 - 353, The DatabaseError
handling in PostgresCache should use the same rollback protection as the
operational-error path, because calling connection.rollback() directly can hide
the original psycopg2.DatabaseError and prevent the CacheError wrapping from
happening. Update the exception blocks in PostgresCache.insert_or_append and the
other matching DatabaseError path to call _safe_rollback() instead of
self.connection.rollback(), while keeping the existing logger.error and
CacheError("PostgresCache.insert_or_append", e) behavior intact.

@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

@sriroopar: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 6e37719 link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant