OLS-3221 Add PostgreSQL auto-recovery after DB restart#2964
Conversation
📝 WalkthroughWalkthroughThe PR adds new Postgres cache timeout and liveness-threshold configuration, updates Postgres connection and cache health handling, and changes ChangesPostgres cache health and liveness
Sequence Diagram(s)sequenceDiagram
participant "Client"
participant "liveness_probe_get_method"
participant "config._conversation_cache"
participant "PostgresCache"
participant "HTTPException"
Client->>liveness_probe_get_method: GET /liveness
liveness_probe_get_method->>config._conversation_cache: inspect cache implementation
alt config._conversation_cache is PostgresCache
liveness_probe_get_method->>PostgresCache: read _consecutive_failures under _health_lock
alt failures >= liveness_db_failure_threshold
liveness_probe_get_method->>HTTPException: raise 503 "database unreachable"
else below threshold
liveness_probe_get_method-->>Client: LivenessResponse(alive=True)
end
else other cache
liveness_probe_get_method-->>Client: LivenessResponse(alive=True)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (1)
tests/unit/cache/test_postgres_cache.py (1)
836-865: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy liftExercise the health-check behavior instead of assigning its result.
These tests directly set
_health_statusand_consecutive_failures, so they would still pass if_health_check_loop()stopped updating either field. Extract a single-iteration health-check helper or inject a controllable sleep/event so the tests drive the success and failure paths. As per coding guidelines,tests/**/*.py: Assert specific values and behaviors in tests, not just that code runs without error.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/cache/test_postgres_cache.py` around lines 836 - 865, The health-check tests are bypassing the real behavior by mutating PostgresCache internals directly, so they won’t catch regressions in _health_check_loop(). Update the tests to drive the loop logic through a single-iteration helper or a controllable sleep/event on PostgresCache, then assert the resulting _health_status and _consecutive_failures values after simulated success/failure. Use the existing PostgresCache, _health_check_loop, _mark_unhealthy, and ready() symbols to locate the behavior and keep the assertions focused on observable state changes rather than direct field assignment.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ols/app/endpoints/health.py`:
- Around line 126-129: The 503 liveness response in the health endpoint is
returning an HTTPException shape instead of the declared LivenessResponse
contract. Update the liveness failure path in the health endpoint so the 503
body matches LivenessResponse (for example, by returning that model directly),
or change the documented 503 response model to the actual error payload; keep
the response schema and the OpenAPI declaration in sync around the liveness
handler.
In `@ols/app/models/config.py`:
- Around line 801-803: The config model currently accepts negative values for
statement_timeout, lock_timeout, and health_check_interval, so tighten
validation in the Pydantic config class that defines these fields. Add explicit
field constraints or validators on statement_timeout and lock_timeout to reject
negative integers, and make health_check_interval strictly positive so the
health-check loop and PostgresCache lock acquisition keep their timing
guarantees. Use the existing config model/field definitions in Config to enforce
these contracts at load time rather than downstream.
- Around line 1166-1167: `OLSConfig.__init__` is currently taking
`liveness_db_failure_threshold` directly from `data.get(...)`, so it can remain
a string or accept invalid values. Parse this field as an int when loading
config, and validate that the value is at least 1; if it is missing,
non-numeric, zero, or negative, raise `checks.InvalidConfigurationError` before
the value is used by `/liveness`.
In `@ols/src/cache/postgres_cache.py`:
- Around line 159-165: Validate the Postgres cache timing settings before they
reach the daemon: `health_check_interval` must not allow negative values or
zero-hot-loop behavior, and `lock_timeout` should reject negative values unless
indefinite blocking is explicitly intended. Add Pydantic validation in
`PostgresConfig` (using appropriate constrained types such as
non-negative/positive ints) and keep the runtime consumers in `PostgresCache`
safe by relying on those validated values in the initialization and health-check
logic. Update any related uses in `PostgresCache` and its config wiring so the
daemon never calls `time.sleep()` or lock acquisition with invalid values.
- Around line 189-219: The health-check path in _connect_health and
_health_check_loop can block indefinitely because neither the psycopg2.connect
call nor the SELECT 1 probe has a timeout. Update the PostgresCache health
connection setup to pass an explicit connect timeout in connect_kwargs and add a
statement timeout before executing the cursor probe, so ready() does not wait on
a hung network or server and _health_status can recover promptly.
- Around line 194-195: The broad exception handling in PostgresCache needs to be
narrowed to satisfy Ruff BLE001. Update the catch blocks in _connect_health,
_health_check_loop, _safe_rollback, and _safe_set_autocommit to handle only the
specific psycopg2 or attribute-related exceptions they can actually raise, and
keep the existing logger/debug behavior in those paths. If any catch-all is
intentionally required to keep the daemon alive, add an explicit # noqa: BLE001
on that except clause rather than leaving a bare except Exception.
In `@ols/utils/postgres.py`:
- Around line 106-108: The new cursor in the statement-timeout setup can leak if
`cursor.execute` fails during schema initialization. Update the `Postgres`
initialization flow around the `cursor = self.connection.cursor()` and `SET
statement_timeout` call to use a `try/finally` so `cursor.close()` always runs,
even on exceptions; keep the fix localized to this cursor-handling block.
- Around line 29-45: The wrapper in postgres connection handling is calling
connectable.connect() before the try/except, so a failed initial connect skips
the same recovery path as query failures. Move the initial connect() call into
the connection-error handling flow inside wrapper() so both the first connect
and the retry go through the same psycopg2.OperationalError/InterfaceError
handling, including _mark_unhealthy() and CacheError wrapping in the reconnect
path.
- Around line 33-45: The `@connection` retry logic in the postgres decorator is
causing duplicate writes for methods like PostgresCache.insert_or_append because
the operation may have already committed before reconnect and retry. Update the
retry behavior in the connection wrapper to avoid automatically retrying
write/mutating methods after OperationalError/InterfaceError, and keep the retry
path only for safe read operations or explicitly idempotent calls identified by
the wrapped function name.
In `@tests/unit/cache/test_postgres_cache.py`:
- Around line 20-23: The new pytest fixture and several test functions in
Postgres cache tests are missing type annotations. Update _suppress_health_loop
to declare Generator[None, None, None] as its return type, and add -> None to
the affected test functions in test_postgres_cache so the new autouse fixture
and tests are fully typed and consistent with the rest of the module.
In `@tests/unit/utils/test_postgres.py`:
- Around line 108-147: Add explicit -> None return annotations to the new test
methods in test_postgres.py, including the helper failing_connect nested in
test_connection_error_reconnect_failure_raises_cache_error, to keep the test
signatures consistent. Also tighten
test_database_error_wraps_in_cache_error_no_retry by asserting the Connectable
instance does not attempt reconnect/retry when do_work_database_error is called;
use the existing Connectable, connect, and do_work_database_error symbols to
place the assertion near the current CacheError check.
- Around line 124-129: The test
`test_database_error_wraps_in_cache_error_no_retry` only verifies that
`do_work_database_error()` raises `CacheError`, but it does not prove the
reconnect path was skipped. After the initial `c.connect()` call, patch or mock
`Connectable.connect` and assert it is not called when
`do_work_database_error()` raises, so this test checks the “no retry” behavior
as well as the `CacheError` wrapping.
---
Nitpick comments:
In `@tests/unit/cache/test_postgres_cache.py`:
- Around line 836-865: The health-check tests are bypassing the real behavior by
mutating PostgresCache internals directly, so they won’t catch regressions in
_health_check_loop(). Update the tests to drive the loop logic through a
single-iteration helper or a controllable sleep/event on PostgresCache, then
assert the resulting _health_status and _consecutive_failures values after
simulated success/failure. Use the existing PostgresCache, _health_check_loop,
_mark_unhealthy, and ready() symbols to locate the behavior and keep the
assertions focused on observable state changes rather than direct field
assignment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3804766b-967a-49bb-b307-520f135d2133
📒 Files selected for processing (10)
ols/app/endpoints/health.pyols/app/models/config.pyols/app/models/models.pyols/constants.pyols/src/cache/postgres_cache.pyols/utils/postgres.pytests/unit/app/endpoints/test_health.pytests/unit/app/models/test_config.pytests/unit/cache/test_postgres_cache.pytests/unit/utils/test_postgres.py
| 503: { | ||
| "description": "Service is not alive", | ||
| "model": LivenessResponse, | ||
| }, |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Map relevant files first.
git ls-files | rg '^ols/app/endpoints/health\.py$|^.*health.*test.*\.py$|^.*test.*health.*\.py$|^.*liveness.*test.*\.py$|^.*health.*\.py$'
# Show outline for the target file.
ast-grep outline ols/app/endpoints/health.py --view expanded
# Inspect the relevant line range with numbers.
sed -n '1,240p' ols/app/endpoints/health.py | cat -n
# Search for liveness/503 response declarations and tests.
rg -n "liveness|503|LivenessResponse|HTTPException|JSONResponse|get_liveness_responses" ols -g '!**/__pycache__/**'Repository: openshift/lightspeed-service
Length of output: 12991
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show tests related to the health endpoint if present.
rg -n "liveness_probe_get_method|LivenessResponse|database unreachable|service is not alive|HTTP_503_SERVICE_UNAVAILABLE" . -g '*test*.py' -g '*tests*.py'
# Print the nearby definitions in the health endpoint.
sed -n '1,220p' ols/app/endpoints/health.py | cat -nRepository: openshift/lightspeed-service
Length of output: 7441
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the exact health endpoint implementation and adjacent helpers.
sed -n '1,220p' ols/app/endpoints/health.py | cat -nRepository: openshift/lightspeed-service
Length of output: 6023
Align the 503 liveness response with LivenessResponse.
raise HTTPException(detail={"alive": False, "reason": ...}) returns {"detail": ...}, which does not match the declared 503 LivenessResponse schema or the generated OpenAPI contract. Return a LivenessResponse payload with the 503 response body, or change the documented 503 model to match the actual error shape.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/app/endpoints/health.py` around lines 126 - 129, The 503 liveness
response in the health endpoint is returning an HTTPException shape instead of
the declared LivenessResponse contract. Update the liveness failure path in the
health endpoint so the 503 body matches LivenessResponse (for example, by
returning that model directly), or change the documented 503 response model to
the actual error payload; keep the response schema and the OpenAPI declaration
in sync around the liveness handler.
Source: Coding guidelines
| statement_timeout: int = constants.POSTGRES_STATEMENT_TIMEOUT | ||
| lock_timeout: int = constants.POSTGRES_LOCK_TIMEOUT | ||
| health_check_interval: int = constants.CACHE_HEALTH_CHECK_INTERVAL |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Reject invalid timeout and interval values.
These fields currently allow negative integers. That breaks the new recovery path: a negative health_check_interval will terminate the health-check thread on time.sleep(), and a negative lock_timeout breaks the bounded-wait guarantee in PostgresCache._acquire_lock(). Validate these at config load and keep health_check_interval strictly positive.
Suggested constraint tightening
- statement_timeout: int = constants.POSTGRES_STATEMENT_TIMEOUT
- lock_timeout: int = constants.POSTGRES_LOCK_TIMEOUT
- health_check_interval: int = constants.CACHE_HEALTH_CHECK_INTERVAL
+ statement_timeout: NonNegativeInt = constants.POSTGRES_STATEMENT_TIMEOUT
+ lock_timeout: NonNegativeInt = constants.POSTGRES_LOCK_TIMEOUT
+ health_check_interval: PositiveInt = constants.CACHE_HEALTH_CHECK_INTERVALAs per coding guidelines, "Use Pydantic models for configuration and data validation" and "Define function signatures and data shapes (interface contracts) before implementing; prefer explicit contracts over implicit ones".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| statement_timeout: int = constants.POSTGRES_STATEMENT_TIMEOUT | |
| lock_timeout: int = constants.POSTGRES_LOCK_TIMEOUT | |
| health_check_interval: int = constants.CACHE_HEALTH_CHECK_INTERVAL | |
| statement_timeout: NonNegativeInt = constants.POSTGRES_STATEMENT_TIMEOUT | |
| lock_timeout: NonNegativeInt = constants.POSTGRES_LOCK_TIMEOUT | |
| health_check_interval: PositiveInt = constants.CACHE_HEALTH_CHECK_INTERVAL |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/app/models/config.py` around lines 801 - 803, The config model currently
accepts negative values for statement_timeout, lock_timeout, and
health_check_interval, so tighten validation in the Pydantic config class that
defines these fields. Add explicit field constraints or validators on
statement_timeout and lock_timeout to reject negative integers, and make
health_check_interval strictly positive so the health-check loop and
PostgresCache lock acquisition keep their timing guarantees. Use the existing
config model/field definitions in Config to enforce these contracts at load time
rather than downstream.
Source: Coding guidelines
| liveness_db_failure_threshold: int = constants.LIVENESS_DB_FAILURE_THRESHOLD | ||
|
|
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant config model and surrounding initialization logic.
FILE="ols/app/models/config.py"
wc -l "$FILE"
printf '\n--- outline ---\n'
ast-grep outline "$FILE" --view expanded | sed -n '1,220p'
printf '\n--- relevant line ranges ---\n'
sed -n '1120,1285p' "$FILE"Repository: openshift/lightspeed-service
Length of output: 17658
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect supporting definitions used by the config model.
rg -n "liveness_db_failure_threshold|LIVENESS_DB_FAILURE_THRESHOLD|InvalidConfigurationError|class OLSConfig|def __init__" ols/app/models/config.py ols/utils/checks.py ols/app/models -SRepository: openshift/lightspeed-service
Length of output: 13448
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Read the constructor and any validation helpers around the cited fields.
FILE="ols/app/models/config.py"
# Show the class definition and init method ranges if present.
ast-grep outline "$FILE" --match OLSConfig --view expanded | sed -n '1,260p'
printf '\n--- constructor / field assignments ---\n'
rg -n "def __init__|liveness_db_failure_threshold|data.get\\(\"liveness_db_failure_threshold\"|InvalidConfigurationError" "$FILE" -n -S -A 4 -B 4Repository: openshift/lightspeed-service
Length of output: 26243
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check whether assignment validation is enabled anywhere for this model hierarchy.
rg -n "validate_assignment|model_config|ConfigDict|class BaseModel" ols/app/models/config.py ols/app/models -S
printf '\n--- OLSConfig / BaseModel relevant slice ---\n'
sed -n '1,120p' ols/app/models/config.pyRepository: openshift/lightspeed-service
Length of output: 5473
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find where the liveness threshold is consumed.
rg -n "liveness_db_failure_threshold|/liveness|failures >= threshold|failure_threshold" ols -SRepository: openshift/lightspeed-service
Length of output: 829
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Read the liveness check implementation.
rg -n "/liveness|liveness" ols -S -A 6 -B 6 | sed -n '1,240p'Repository: openshift/lightspeed-service
Length of output: 7132
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
from pathlib import Path
text = Path("ols/app/models/config.py").read_text()
for needle in ["validate_assignment", "model_config", "ConfigDict", "BaseModel"]:
print(f"{needle}: {needle in text}")
PY
printf '\n--- liveness references ---\n'
rg -n "liveness_db_failure_threshold|/liveness|failures >?=|failure" ols -S -A 3 -B 3Repository: openshift/lightspeed-service
Length of output: 10511
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check whether Pydantic assignment validation is enabled for OLSConfig or its base classes.
python3 - <<'PY'
from pathlib import Path
text = Path("ols/app/models/config.py").read_text()
for needle in ("validate_assignment", "ConfigDict", "model_config"):
print(f"{needle}:", needle in text)
PY
printf '\n--- liveness usage ---\n'
rg -n "liveness_db_failure_threshold|/liveness|failures >=|failures >=" ols -S -A 4 -B 4Repository: openshift/lightspeed-service
Length of output: 3476
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check whether Pydantic assignment validation is enabled for OLSConfig or its base classes.
python3 - <<'PY'
from pathlib import Path
text = Path("ols/app/models/config.py").read_text()
for needle in ("validate_assignment", "ConfigDict", "model_config"):
print(f"{needle}:", needle in text)
PY
printf '\n--- liveness usage ---\n'
rg -n "liveness_db_failure_threshold|/liveness|failures >=|failures >=" ols -S -A 4 -B 4Repository: openshift/lightspeed-service
Length of output: 3476
Validate liveness_db_failure_threshold in OLSConfig.__init__. data.get(...) bypasses Pydantic here, so a YAML string like "3" reaches /liveness as a string and breaks failures >= threshold; 0 or negative values also force 503 on Postgres-backed instances. Parse it as an int and reject values < 1 with checks.InvalidConfigurationError.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/app/models/config.py` around lines 1166 - 1167, `OLSConfig.__init__` is
currently taking `liveness_db_failure_threshold` directly from `data.get(...)`,
so it can remain a string or accept invalid values. Parse this field as an int
when loading config, and validate that the value is at least 1; if it is
missing, non-numeric, zero, or negative, raise
`checks.InvalidConfigurationError` before the value is used by `/liveness`.
Source: Coding guidelines
| self._lock_timeout = config.lock_timeout | ||
| self.capacity = config.max_entries | ||
|
|
||
| self._health_status = True | ||
| self._consecutive_failures = 0 | ||
| self._health_lock = threading.Lock() | ||
| self._health_check_interval = config.health_check_interval |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Validate timeout and interval values before the daemon uses them.
health_check_interval reaches time.sleep() before the try, so a negative value kills the health thread and 0 creates a hot polling loop. lock_timeout should also reject negative values unless indefinite blocking is intentional. The related PostgresConfig contract currently exposes these as plain int; validate them there with PositiveInt/non-negative types. As per coding guidelines, ols/**/*.py: Use Pydantic models for configuration and data validation.
Also applies to: 211-215, 254-257
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/src/cache/postgres_cache.py` around lines 159 - 165, Validate the
Postgres cache timing settings before they reach the daemon:
`health_check_interval` must not allow negative values or zero-hot-loop
behavior, and `lock_timeout` should reject negative values unless indefinite
blocking is explicitly intended. Add Pydantic validation in `PostgresConfig`
(using appropriate constrained types such as non-negative/positive ints) and
keep the runtime consumers in `PostgresCache` safe by relying on those validated
values in the initialization and health-check logic. Update any related uses in
`PostgresCache` and its config wiring so the daemon never calls `time.sleep()`
or lock acquisition with invalid values.
Source: Coding guidelines
| def _connect_health(self) -> None: | ||
| """Establish a dedicated lightweight connection for health checks.""" | ||
| if self._health_connection is not None: | ||
| try: | ||
| self._health_connection.close() | ||
| except Exception as close_err: | ||
| logger.debug("Failed to close old health connection: %s", close_err) | ||
| config = self.connection_config | ||
| connect_kwargs: dict[str, Any] = { | ||
| "host": config.host, | ||
| "port": config.port, | ||
| "user": config.user, | ||
| "password": config.password, | ||
| "dbname": config.dbname, | ||
| "sslmode": config.ssl_mode, | ||
| "sslrootcert": config.ca_cert_path, | ||
| "gssencmode": config.gss_encmode, | ||
| **libpq_tls_params(config.tls_security_profile), | ||
| } | ||
| self._health_connection = psycopg2.connect(**connect_kwargs) | ||
| self._health_connection.autocommit = True | ||
|
|
||
| def _health_check_loop(self) -> None: | ||
| """Background loop that periodically checks DB connectivity.""" | ||
| while True: | ||
| time.sleep(self._health_check_interval) | ||
| try: | ||
| if self._health_connection is None or self._health_connection.closed: | ||
| self._connect_health() | ||
| with self._health_connection.cursor() as cursor: | ||
| cursor.execute("SELECT 1") |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🌐 Web query:
psycopg2 2.9.9 libpq connect_timeout parameter and PostgreSQL SET statement_timeout session behavior
💡 Result:
In the context of psycopg2 (version 2.9.9), the connect_timeout parameter and the PostgreSQL SET statement_timeout command serve distinct purposes at different stages of the database interaction lifecycle [1]. connect_timeout Parameter The connect_timeout parameter is a client-side setting handled by libpq, the underlying C library used by psycopg2 [2][3]. It defines the maximum number of seconds to wait while attempting to establish the initial connection to the PostgreSQL server [1]. If the connection cannot be established within this duration (e.g., due to network issues or the server being unreachable), the attempt is aborted, and psycopg2 raises an OperationalError [4][1]. This parameter does not affect queries or operations once the connection is successfully established [1]. SET statement_timeout Behavior The SET statement_timeout command is a server-side configuration that is applied within the context of a specific database session [1][5]. When executed via psycopg2 (e.g., cursor.execute("SET statement_timeout = 5000")), it instructs the PostgreSQL server to automatically abort any subsequent SQL statement in that session that exceeds the specified duration (measured in milliseconds) [6][5][7]. Key Differences 1. Scope: connect_timeout is a connection-phase constraint enforced by the client driver (libpq) [1]; statement_timeout is a query-execution constraint enforced by the database server [6][7]. 2. Purpose: connect_timeout prevents the application from hanging during network attempts to reach the database [1]. statement_timeout prevents runaway or long-running queries from consuming excessive server resources [5][7]. 3. Interaction: The two settings are independent. Setting a short connect_timeout does not help if a query itself takes too long to execute once connected, and setting a statement_timeout does not help if the application cannot connect to the server in the first place [1]. For robust applications, it is common to use both: connect_timeout to handle initial connectivity failures, and statement_timeout to safeguard against unexpected query latency [1]. Note that because statement_timeout is session-scoped, it persists for the lifetime of that specific psycopg2 connection until explicitly changed or the session ends [5].
Citations:
- 1: https://www.geeksforgeeks.org/python/how-to-set-timeouts-in-psycopg2-using-python/
- 2: https://www.psycopg.org/docs/install.html
- 3: https://pypi.org/project/psycopg2/
- 4: connect_timeout parameter not enough in a failover event psycopg/psycopg2#985
- 5: https://github.com/jbranchaud/til/blob/master/postgres/set-a-statement-timeout-threshold-for-a-session.md
- 6: https://www.postgresql.org/docs/current/runtime-config-client.html
- 7: https://postgresqlco.nf/doc/en/param/statement_timeout/
🏁 Script executed:
# Inspect the cache implementation and related configuration to check the health path, timeouts, and validation.
set -e
git ls-files 'ols/src/cache/postgres_cache.py' 'ols/src/**/*.py' | sed -n '1,120p'
echo '--- postgres_cache outline ---'
ast-grep outline ols/src/cache/postgres_cache.py --view expanded
echo '--- config candidates ---'
rg -n "health_check_interval|lock_timeout|statement_timeout|connect_timeout|ssl_mode|tls_security_profile" ols/src -g '*.py'Repository: openshift/lightspeed-service
Length of output: 4309
🏁 Script executed:
# Read the relevant sections of postgres_cache.py and the config model to verify how the health loop behaves.
set -e
sed -n '1,320p' ols/src/cache/postgres_cache.py
echo '--- config file candidates ---'
fd -a '.*(config|settings).*\.py$' ols/src | sed -n '1,120p'Repository: openshift/lightspeed-service
Length of output: 12237
🏁 Script executed:
# Inspect the actual health loop and the PostgresConfig definition, focusing on interval/timeout validation.
set -e
python3 - <<'PY'
from pathlib import Path
for path in [
Path('ols/src/cache/postgres_cache.py'),
Path('ols/src/config.py'),
]:
if path.exists():
print(f'FILE: {path}')
for i, line in enumerate(path.read_text().splitlines(), 1):
if 'health_check_interval' in line or 'lock_timeout' in line or 'statement_timeout' in line or 'connect_timeout' in line or '_health_check_loop' in line or '_connect_health' in line:
print(f'{i}: {line}')
print()
PYRepository: openshift/lightspeed-service
Length of output: 707
Bound the health-check connect and probe. ready() now depends on this loop, but _connect_health() has no connect_timeout and the SELECT 1 probe has no statement_timeout. A slow network or hung server can block the health thread and leave _health_status stale; set explicit timeouts for both steps.
🧰 Tools
🪛 Ruff (0.15.18)
[warning] 194-194: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/src/cache/postgres_cache.py` around lines 189 - 219, The health-check
path in _connect_health and _health_check_loop can block indefinitely because
neither the psycopg2.connect call nor the SELECT 1 probe has a timeout. Update
the PostgresCache health connection setup to pass an explicit connect timeout in
connect_kwargs and add a statement timeout before executing the cursor probe, so
ready() does not wait on a hung network or server and _health_status can recover
promptly.
| except (psycopg2.OperationalError, psycopg2.InterfaceError) as e: | ||
| logger.warning( | ||
| "Connection error in %s, attempting reconnect: %s", f.__name__, e | ||
| ) | ||
| if hasattr(connectable, "_mark_unhealthy"): | ||
| connectable._mark_unhealthy() | ||
| try: | ||
| connectable.connect() | ||
| except Exception as reconnect_err: | ||
| raise CacheError( | ||
| f"reconnect failed in {f.__name__}", reconnect_err | ||
| ) from reconnect_err | ||
| return f(connectable, *args, **kwargs) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the relevant file and nearby symbols.
git ls-files | rg '^ols/utils/postgres\.py$|^ols/.+\.py$' >/dev/null
ast-grep outline ols/utils/postgres.py --view expanded || true
echo '--- file excerpt ---'
wc -l ols/utils/postgres.py
sed -n '1,220p' ols/utils/postgres.py
echo '--- usages ---'
rg -n "insert_or_append|_mark_unhealthy|connect\(|OperationalError|InterfaceError|CacheError" ols -g '*.py'Repository: openshift/lightspeed-service
Length of output: 8349
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the cache implementation to understand commit/append semantics.
fd -a 'postgres.py' ols
fd -a 'cache.py' ols
fd -a '*cache*' ols | sed -n '1,120p'
echo '--- Postgres cache excerpt ---'
sed -n '1,260p' ols/utils/postgres.py
echo '--- cache-related excerpts ---'
rg -n "class .*Cache|def insert_or_append|append|commit|upsert|ON CONFLICT|executemany|execute" ols -g '*.py'Repository: openshift/lightspeed-service
Length of output: 639
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- postgres_cache outline ---'
ast-grep outline ols/src/cache/postgres_cache.py --view expanded || true
echo '--- insert_or_append section ---'
sed -n '280,380p' ols/src/cache/postgres_cache.py
echo '--- decorators/usages in postgres_cache ---'
rg -n "^\\s*`@connection`|def (get|insert_or_append|delete|list|set_topic_summary)|commit\\(|ON CONFLICT|DO UPDATE|DO NOTHING|append" ols/src/cache/postgres_cache.pyRepository: openshift/lightspeed-service
Length of output: 6822
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- cache interface excerpt ---'
sed -n '1,220p' ols/src/cache/cache.py
echo '--- in-memory implementation excerpt ---'
sed -n '1,180p' ols/src/cache/in_memory_cache.pyRepository: openshift/lightspeed-service
Length of output: 10989
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
from pathlib import Path
path = Path('ols/src/cache/postgres_cache.py')
text = path.read_text()
for needle in ['`@connection`', 'def insert_or_append', 'def get', 'def delete', 'def list', 'def set_topic_summary']:
idx = text.find(needle)
if idx != -1:
start = text.rfind('\n', 0, idx-120)
end = text.find('\n\n', idx)
print(f'### {needle}')
print(text[max(0, idx-120):min(len(text), idx+1200)])
print()
PYRepository: openshift/lightspeed-service
Length of output: 8205
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -A80 -B20 "^\\s*`@connection`|^\\s*def insert_or_append|^\\s*def get|^\\s*def delete|^\\s*def list|^\\s*def set_topic_summary" ols/src/cache/postgres_cache.pyRepository: openshift/lightspeed-service
Length of output: 11834
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '280,370p' ols/src/cache/postgres_cache.pyRepository: openshift/lightspeed-service
Length of output: 3988
🏁 Script executed:
#!/bin/bash
set -euo pipefail
cat -n ols/src/cache/postgres_cache.py | sed -n '280,370p'Repository: openshift/lightspeed-service
Length of output: 4625
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "`@connection`" ols -g '*.py'Repository: openshift/lightspeed-service
Length of output: 724
🏁 Script executed:
#!/bin/bash
set -euo pipefail
cat -n ols/src/cache/postgres_cache.py | sed -n '1,260p'Repository: openshift/lightspeed-service
Length of output: 11615
🏁 Script executed:
#!/bin/bash
set -euo pipefail
cat -n ols/src/cache/postgres_cache.py | sed -n '285,360p'Repository: openshift/lightspeed-service
Length of output: 3885
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "class .*\\(PostgresBase\\)" ols -g '*.py'Repository: openshift/lightspeed-service
Length of output: 245
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- PostgresBase subclasses ---'
rg -n "class .*\\(PostgresBase\\)" ols -g '*.py'
echo '--- connection decorator uses ---'
rg -n "`@connection`" ols -g '*.py' || trueRepository: openshift/lightspeed-service
Length of output: 869
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "`@connection`" ols -g '*.py' || trueRepository: openshift/lightspeed-service
Length of output: 724
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "`@connection`" ols -g '*.py' || trueRepository: openshift/lightspeed-service
Length of output: 724
Limit @connection retries on write methods. PostgresCache.insert_or_append is retried after OperationalError/InterfaceError, but it appends to the stored history and increments message_count before commit(). If the connection drops after PostgreSQL has already committed, the retry can duplicate the entry and metadata update.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/utils/postgres.py` around lines 33 - 45, The `@connection` retry logic in
the postgres decorator is causing duplicate writes for methods like
PostgresCache.insert_or_append because the operation may have already committed
before reconnect and retry. Update the retry behavior in the connection wrapper
to avoid automatically retrying write/mutating methods after
OperationalError/InterfaceError, and keep the retry path only for safe read
operations or explicitly idempotent calls identified by the wrapped function
name.
| cursor = self.connection.cursor() | ||
| cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),)) | ||
| cursor.close() |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Close the statement-timeout cursor on failure too.
If SET statement_timeout raises, cursor.close() is skipped after schema initialization. Use try/finally for the new cursor.
Suggested cleanup
self.connection.autocommit = True
cursor = self.connection.cursor()
- cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),))
- cursor.close()
+ try:
+ cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),))
+ finally:
+ cursor.close()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cursor = self.connection.cursor() | |
| cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),)) | |
| cursor.close() | |
| cursor = self.connection.cursor() | |
| try: | |
| cursor.execute("SET statement_timeout = %s", (str(config.statement_timeout),)) | |
| finally: | |
| cursor.close() |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/utils/postgres.py` around lines 106 - 108, The new cursor in the
statement-timeout setup can leak if `cursor.execute` fails during schema
initialization. Update the `Postgres` initialization flow around the `cursor =
self.connection.cursor()` and `SET statement_timeout` call to use a
`try/finally` so `cursor.close()` always runs, even on exceptions; keep the fix
localized to this cursor-handling block.
| def _suppress_health_loop(): | ||
| """Prevent the background health-check thread from making real DB calls.""" | ||
| with patch.object(PostgresCache, "_health_check_loop"): | ||
| yield |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the file and inspect the relevant sections with line numbers.
git ls-files tests/unit/cache/test_postgres_cache.py
wc -l tests/unit/cache/test_postgres_cache.py
sed -n '1,120p' tests/unit/cache/test_postgres_cache.py
printf '\n--- mid ---\n'
sed -n '760,890p' tests/unit/cache/test_postgres_cache.pyRepository: openshift/lightspeed-service
Length of output: 7621
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Extract function signatures and comments in the target file.
python3 - <<'PY'
from pathlib import Path
import re
path = Path('tests/unit/cache/test_postgres_cache.py')
text = path.read_text()
for i, line in enumerate(text.splitlines(), start=1):
if re.match(r'^\s*(def|async def)\s+', line) or 'TODO' in line or line.strip().startswith('#'):
print(f'{i}: {line}')
PYRepository: openshift/lightspeed-service
Length of output: 7721
Add return annotations to the new fixture and tests. tests/unit/cache/test_postgres_cache.py:20, 791-865 has an untyped autouse fixture and several untyped test functions; add Generator[None, None, None] for _suppress_health_loop and -> None for the new tests.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/cache/test_postgres_cache.py` around lines 20 - 23, The new pytest
fixture and several test functions in Postgres cache tests are missing type
annotations. Update _suppress_health_loop to declare Generator[None, None, None]
as its return type, and add -> None to the affected test functions in
test_postgres_cache so the new autouse fixture and tests are fully typed and
consistent with the rest of the module.
Source: Coding guidelines
| def test_operational_error_triggers_reconnect_and_retry(self): | ||
| """OperationalError causes reconnect + retry, succeeding on second attempt.""" | ||
| c = self.Connectable() | ||
| c.connect() | ||
| result = c.do_work_operational_error() | ||
| assert result == "recovered" | ||
| assert c._unhealthy_marked is True | ||
|
|
||
| def test_interface_error_triggers_reconnect_and_retry(self): | ||
| """InterfaceError causes reconnect + retry, succeeding on second attempt.""" | ||
| c = self.Connectable() | ||
| c.connect() | ||
| result = c.do_work_interface_error() | ||
| assert result == "recovered" | ||
| assert c._unhealthy_marked is True | ||
|
|
||
| def test_database_error_wraps_in_cache_error_no_retry(self): | ||
| """DatabaseError is wrapped in CacheError without reconnect attempt.""" | ||
| c = self.Connectable() | ||
| c.connect() | ||
| with pytest.raises(CacheError, match="SQL syntax error"): | ||
| c.do_work_database_error() | ||
|
|
||
| def test_connection_error_reconnect_failure_raises_cache_error(self): | ||
| """When reconnect fails after OperationalError, CacheError is raised.""" | ||
| c = self.Connectable() | ||
| c.connect() | ||
| # Make connect() fail on reconnect attempt | ||
| original_connect = c.connect | ||
| call_count = [0] | ||
|
|
||
| def failing_connect(): | ||
| call_count[0] += 1 | ||
| if call_count[0] > 0: | ||
| raise psycopg2.OperationalError("cannot connect") | ||
| original_connect() | ||
|
|
||
| c.connect = failing_connect | ||
| with pytest.raises(CacheError, match="reconnect failed"): | ||
| c.do_work_operational_error() |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
sed -n '1,240p' tests/unit/utils/test_postgres.pyRepository: openshift/lightspeed-service
Length of output: 9034
🏁 Script executed:
sed -n '1,240p' tests/unit/utils/test_postgres.pyRepository: openshift/lightspeed-service
Length of output: 9034
Add the missing -> None annotations and tighten the no-retry assertion
The new test_* methods and failing_connect still need explicit return types, and test_database_error_wraps_in_cache_error_no_retry should also assert that reconnect/retry is not attempted.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/utils/test_postgres.py` around lines 108 - 147, Add explicit ->
None return annotations to the new test methods in test_postgres.py, including
the helper failing_connect nested in
test_connection_error_reconnect_failure_raises_cache_error, to keep the test
signatures consistent. Also tighten
test_database_error_wraps_in_cache_error_no_retry by asserting the Connectable
instance does not attempt reconnect/retry when do_work_database_error is called;
use the existing Connectable, connect, and do_work_database_error symbols to
place the assertion near the current CacheError check.
Source: Coding guidelines
| def test_database_error_wraps_in_cache_error_no_retry(self): | ||
| """DatabaseError is wrapped in CacheError without reconnect attempt.""" | ||
| c = self.Connectable() | ||
| c.connect() | ||
| with pytest.raises(CacheError, match="SQL syntax error"): | ||
| c.do_work_database_error() |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Assert the “no retry” part of this case.
This test only checks CacheError wrapping. If the decorator attempted a reconnect/retry and still raised CacheError, the test could pass. Patch connect after the initial connection and assert it is not called. As per coding guidelines, tests/**/*.py: Assert specific values and behaviors in tests, not just that code runs without error.
Suggested assertion
c = self.Connectable()
c.connect()
- with pytest.raises(CacheError, match="SQL syntax error"):
- c.do_work_database_error()
+ with patch.object(c, "connect") as mock_connect:
+ with pytest.raises(CacheError, match="SQL syntax error"):
+ c.do_work_database_error()
+ mock_connect.assert_not_called()
+ assert c._unhealthy_marked is False📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_database_error_wraps_in_cache_error_no_retry(self): | |
| """DatabaseError is wrapped in CacheError without reconnect attempt.""" | |
| c = self.Connectable() | |
| c.connect() | |
| with pytest.raises(CacheError, match="SQL syntax error"): | |
| c.do_work_database_error() | |
| def test_database_error_wraps_in_cache_error_no_retry(self): | |
| """DatabaseError is wrapped in CacheError without reconnect attempt.""" | |
| c = self.Connectable() | |
| c.connect() | |
| with patch.object(c, "connect") as mock_connect: | |
| with pytest.raises(CacheError, match="SQL syntax error"): | |
| c.do_work_database_error() | |
| mock_connect.assert_not_called() | |
| assert c._unhealthy_marked is False |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/utils/test_postgres.py` around lines 124 - 129, The test
`test_database_error_wraps_in_cache_error_no_retry` only verifies that
`do_work_database_error()` raises `CacheError`, but it does not prove the
reconnect path was skipped. After the initial `c.connect()` call, patch or mock
`Connectable.connect` and assert it is not called when
`do_work_database_error()` raises, so this test checks the “no retry” behavior
as well as the `CacheError` wrapping.
Source: Coding guidelines
eea3052 to
15f8455
Compare
Implement background health-check loop, smarter error classification in the @connection decorator, operation timeouts, and enhanced liveness/readiness probes so OLS can recover automatically when the backing PostgreSQL database is restarted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
15f8455 to
6e37719
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ols/src/cache/postgres_cache.py`:
- Around line 350-353: The DatabaseError handling in PostgresCache should use
the same rollback protection as the operational-error path, because calling
connection.rollback() directly can hide the original psycopg2.DatabaseError and
prevent the CacheError wrapping from happening. Update the exception blocks in
PostgresCache.insert_or_append and the other matching DatabaseError path to call
_safe_rollback() instead of self.connection.rollback(), while keeping the
existing logger.error and CacheError("PostgresCache.insert_or_append", e)
behavior intact.
- Around line 184-188: The `_mark_unhealthy()` path only updates
`_health_status`, so repeated request-path failures from the `@connection` retry
flow do not advance `_consecutive_failures` toward the liveness threshold.
Update `_mark_unhealthy()` in `PostgresCache` to also increment the failure
counter under `_health_lock`, keeping the unhealthy flag and failure tracking in
sync so `is_healthy`/liveness checks can trip promptly after repeated
cache-operation DB failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c84be3c2-a66c-475a-91d3-5d332a00fb83
📒 Files selected for processing (10)
docs/openapi.jsonols/app/endpoints/health.pyols/app/models/config.pyols/constants.pyols/src/cache/postgres_cache.pyols/utils/postgres.pytests/unit/app/endpoints/test_health.pytests/unit/app/models/test_config.pytests/unit/cache/test_postgres_cache.pytests/unit/utils/test_postgres.py
🚧 Files skipped from review as they are similar to previous changes (8)
- ols/app/endpoints/health.py
- ols/constants.py
- tests/unit/cache/test_postgres_cache.py
- tests/unit/app/endpoints/test_health.py
- ols/app/models/config.py
- tests/unit/utils/test_postgres.py
- ols/utils/postgres.py
- tests/unit/app/models/test_config.py
| def _mark_unhealthy(self) -> None: | ||
| """Mark health status as unhealthy (dual-feed from cache operations).""" | ||
| with self._health_lock: | ||
| self._health_status = False | ||
|
|
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Count operation-driven unhealthy events toward the failure threshold.
_mark_unhealthy() is the cache-operation feed from the @connection retry path, but it only flips _health_status; liveness threshold checks based on _consecutive_failures can remain below threshold until the background probe also fails. Increment the counter here so repeated request-path DB failures can trip liveness promptly.
Suggested fix
def _mark_unhealthy(self) -> None:
"""Mark health status as unhealthy (dual-feed from cache operations)."""
with self._health_lock:
+ self._consecutive_failures += 1
self._health_status = False📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _mark_unhealthy(self) -> None: | |
| """Mark health status as unhealthy (dual-feed from cache operations).""" | |
| with self._health_lock: | |
| self._health_status = False | |
| def _mark_unhealthy(self) -> None: | |
| """Mark health status as unhealthy (dual-feed from cache operations).""" | |
| with self._health_lock: | |
| self._consecutive_failures += 1 | |
| self._health_status = False |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/src/cache/postgres_cache.py` around lines 184 - 188, The
`_mark_unhealthy()` path only updates `_health_status`, so repeated request-path
failures from the `@connection` retry flow do not advance
`_consecutive_failures` toward the liveness threshold. Update
`_mark_unhealthy()` in `PostgresCache` to also increment the failure counter
under `_health_lock`, keeping the unhealthy flag and failure tracking in sync so
`is_healthy`/liveness checks can trip promptly after repeated cache-operation DB
failures.
| except psycopg2.DatabaseError as e: | ||
| self.connection.rollback() | ||
| logger.error("PostgresCache.insert_or_append: %s", e) | ||
| raise CacheError("PostgresCache.insert_or_append", e) from e |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Use the safe rollback helper in DatabaseError paths too.
These paths intend to wrap psycopg2.DatabaseError as CacheError, but a failed direct rollback() would mask the original error and skip the wrapping/log context. Reuse _safe_rollback() as in the operational-error path.
Suggested fix
except psycopg2.DatabaseError as e:
- self.connection.rollback()
+ self._safe_rollback()
logger.error("PostgresCache.insert_or_append: %s", e)
raise CacheError("PostgresCache.insert_or_append", e) from e except psycopg2.DatabaseError as e:
- self.connection.rollback()
+ self._safe_rollback()
logger.error("PostgresCache.delete: %s", e)
raise CacheError("PostgresCache.delete", e) from eAlso applies to: 389-392
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/src/cache/postgres_cache.py` around lines 350 - 353, The DatabaseError
handling in PostgresCache should use the same rollback protection as the
operational-error path, because calling connection.rollback() directly can hide
the original psycopg2.DatabaseError and prevent the CacheError wrapping from
happening. Update the exception blocks in PostgresCache.insert_or_append and the
other matching DatabaseError path to call _safe_rollback() instead of
self.connection.rollback(), while keeping the existing logger.error and
CacheError("PostgresCache.insert_or_append", e) behavior intact.
|
@sriroopar: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Implement background health-check loop, smarter error classification in the @connection decorator, operation timeouts, and enhanced liveness/readiness probes so OLS can recover automatically when the backing PostgreSQL database is restarted.
Description
Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
/livenessnow able to report a 503 status when the database is unavailable.Bug Fixes
ready()now reflects the latest background health status more reliably.Tests