Skip to content

chore(api): section 2 hardening — pino, slow-query, retries, ref-cache#203

Merged
kashkoool merged 5 commits into
mainfrom
chore/api-section2-hardening
May 11, 2026
Merged

chore(api): section 2 hardening — pino, slow-query, retries, ref-cache#203
kashkoool merged 5 commits into
mainfrom
chore/api-section2-hardening

Conversation

@kashkoool

Copy link
Copy Markdown
Owner

Summary

Section 2 of the perf/reliability audit — 5 items batched into one PR. All low-risk: no schema migrations, no API contract changes, no money paths. Section 1 already merged in #201.

Tag What Key file(s)
O4 (minimum-viable) nestjs-pino wired at bootstrap → JSON logs in CloudWatch Logs Insights. 2 CloudWatch-critical emission sites refactored to object-form. app.module.ts, main.ts, client-error.controller.ts, request-logger.middleware.ts
O2 Prisma slow-query middleware via $extends (Prisma 7 removed $use). 500 ms threshold; PII-safe (model + action + duration only). prisma.service.ts
R2 AWS SDK { maxAttempts: 3, retryMode: 'adaptive' } on SES, SNS, S3, SecretsManager. PAY2M getAccessToken inline retry loop (3 attempts, exponential w/ jitter). sms.service.ts, email.service.ts, upload.service.ts, prisma.service.ts, payment.service.ts
C1 New ReferenceDataCacheService (versioned-key INCR invalidation, fail-open). 5 catalog endpoints cached, 10 admin write-paths invalidate. reference-data-cache.service.ts, catalog.controller.ts, admin.service.ts
L2 DB pool runbook (off-repo per RULE #3*.md gitignored). ~/OneDrive/Desktop/DB_POOL_RUNBOOK.md

AWS-side compatibility (verified via AWS MCP before any code change)

CloudWatch metric filter Pattern Survives pino?
JadwalClientErrorCountJadwal/Web ClientErrorCount "event":"CLIENT_ERROR" ✅ — emission refactored to object-form so the top-level field survives
jadwal-redis-throttler-failedJadwal/API RedisThrottlerFailures "Redis throttler failed" substring ✅ — pino preserves substring inside msg

No subscription filters → no third-party log shipping breaks. No AWS infra changes needed.

Items deliberately out of scope (future PRs)

  • R3 Idempotency on payment initiate — money path, needs dedicated plan-mode PR
  • R4 Outbox pattern — needs new Prisma model
  • R1 Circuit breakers — needs measured baselines first
  • O4 full refactor — remaining ~102 string-form logger.X() calls will be converted opportunistically as files are touched
  • O3 Sentry spans — Sentry wired but inactive (no SENTRY_DSN); revisit if turned on

Test plan

  • npx tsc --noEmit clean
  • npx eslint on all 17 changed source files — no new warnings (only pre-existing "unused eslint-disable" entries)
  • 717 / 717 unit tests pass (up from 701 baseline — 16 net new tests)
    • New: client-error.controller.spec.ts pins the CloudWatch-filter contract
    • New: reference-data-cache.service.spec.ts — cache-aside + invalidation + fail-open + key safety + disabled-mode
    • Updated: prisma.service.spec.ts mock provides $extends
    • Updated: 5 admin specs add ReferenceDataCacheService provider
    • Updated: email-service.spec.ts asserts new SES retry config
  • Integration suite (CI gates this)
  • CI: lint + type-check + CodeQL + Semgrep + Trivy + Dependency Audit + Secret Detection

Post-deploy verification (AWS-side)

  • CloudWatch Logs Insights on /ecs/jadwal-api: filter event = "CLIENT_ERROR" — expect hits after a deliberate test client-error post
  • Jadwal/Web ClientErrorCount metric increments within 5 min on a triggered frontend error
  • aws logs describe-metric-filters --log-group-name /ecs/jadwal-api — both filters still present
  • filter event = "SLOW_QUERY" | stats count() by model, action — confirm slow queries surface
  • Jadwal/API RedisThrottlerFailures alarm unchanged

Risk

Low-medium. Logger swap touches every log line but pino's string-arg handling is well-understood. The 2 refactored emission sites are the real risk; the new unit test pins them. AWS SDK retry config is the AWS-recommended setting. p-retry replaced with a ~30-line inline loop (no new npm dep). Redis cache writes are fail-soft. Rollback = git revert — no infra state touched.

Operator note: CloudWatch Logs Insights queries written before this PR that depend on the legacy NestJS string log format may need rewriting against the new JSON shape.

Five Section-2 audit items batched in one PR. All low-risk: no schema
migrations, no API contract changes, no money path. AWS-side
compatibility verified via aws logs describe-metric-filters before
touching anything.

O4 (minimum-viable, CloudWatch-only):
  - Install nestjs-pino + pino-http + pino-pretty (dev).
  - LoggerModule.forRoot wired in app.module.ts with prod/dev transport,
    redact paths for secrets, customLogLevel, and genReqId that respects
    inbound x-request-id.
  - main.ts: bufferLogs: true → app.useLogger(app.get(PinoLogger)).
  - Two CRITICAL emission sites refactored from logger.X(JSON.stringify)
    to object-form: client-error.controller.ts (CloudWatch metric filter
    JadwalClientErrorCount depends on `"event":"CLIENT_ERROR"` at top
    level — string form would escape inside pino's msg field and silently
    flatline the metric) and request-logger.middleware.ts.
  - Remaining ~102 string-form logger calls keep working unchanged
    (pino wraps the string in `{msg:"..."}`); convert opportunistically
    as files are touched.

O2:
  - Prisma slow-query observability via $extends (Prisma 7 removed $use).
    500 ms threshold; emits `event: 'SLOW_QUERY'` with model + action +
    durationMs only. PII-safe by construction — never logs args.

R2 (retry on external calls):
  - { maxAttempts: 3, retryMode: 'adaptive' } on all 4 AWS SDK v3
    clients: SES, SNS, S3, SecretsManager. Adaptive retry only fires on
    retryable error types (network errors pre-request, throttling) — no
    risk of double-publish on application errors.
  - PAY2M getAccessToken wrapped in inline retry loop (3 attempts,
    exponential backoff with jitter; 4xx aborts immediately, 5xx + 429
    retry). Inline rather than p-retry dep because p-retry v5+ is
    ESM-only and project tsconfig is module:commonjs.

C1 (Redis reference-data cache):
  - New ReferenceDataCacheService modeled on AvailabilityCacheService:
    versioned-key INCR invalidation, fail-open on Redis errors, 256 KiB
    payload cap, configurable TTL (default 3600s, bounded [60, 86400]).
  - Registered in RedisModule (Global).
  - Wired into 5 catalog endpoints: countries, categories, all-cities,
    cities-by-country, platform-info.
  - 10 invalidation hooks in admin.service: country/category/city CRUD
    + updatePlatformSettings. Fire-and-forget — cache failure must
    NEVER fail the admin write.

L2:
  - DB pool runbook written off-repo at
    C:\Users\loaek\OneDrive\Desktop\DB_POOL_RUNBOOK.md (per RULE #3 —
    *.md gitignored).

Tests:
  - New client-error.controller.spec.ts pins the CloudWatch
    metric-filter compat contract (logger.error MUST receive object,
    not string).
  - New reference-data-cache.service.spec.ts covers cache-aside,
    versioned invalidation, fail-open, key safety.
  - prisma.service.spec.ts mock now provides $extends (returns self).
  - admin.service + admin-*  specs add ReferenceDataCacheService
    provider.
  - email-service.spec.ts SES constructor assertion updated to expect
    maxAttempts + retryMode.
  - 717 unit tests pass (up from 701 — 16 net new).

Type-check clean. Lint warnings unchanged (only pre-existing "unused
eslint-disable directive" entries in upload/email/payment services).

Items deliberately out of scope (future PRs): R1 circuit breakers,
R3 idempotency on payment initiate, R4 outbox pattern, O4 full
refactor of remaining 102 string-form logger calls, O3 Sentry spans
(Sentry inactive in prod).
@coderabbitai

coderabbitai Bot commented May 11, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@kashkoool has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 25 minutes and 42 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b5dd6f7-e50f-4005-9657-226ccca6263d

📥 Commits

Reviewing files that changed from the base of the PR and between 124dcc3 and 9f4fc4b.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json, !package-lock.json, !**/package-lock.json
📒 Files selected for processing (27)
  • apps/api/.env.example
  • apps/api/package.json
  • apps/api/src/admin/admin.service.ts
  • apps/api/src/app.module.ts
  • apps/api/src/catalog/catalog.controller.ts
  • apps/api/src/common/controllers/client-error.controller.ts
  • apps/api/src/common/middleware/request-logger.middleware.ts
  • apps/api/src/common/services/upload.service.ts
  • apps/api/src/email/email.service.ts
  • apps/api/src/main.ts
  • apps/api/src/payment/payment.service.ts
  • apps/api/src/prisma/prisma.service.ts
  • apps/api/src/redis/redis.module.ts
  • apps/api/src/redis/reference-data-cache.service.ts
  • apps/api/src/sms/sms.service.ts
  • apps/api/src/vendor/vendor.controller.ts
  • apps/api/test/mocks/bookings-deps.mock.ts
  • apps/api/test/unit/admin-dashboard-charts.spec.ts
  • apps/api/test/unit/admin-mark-payout-unpaid.spec.ts
  • apps/api/test/unit/admin-payouts-inflight.spec.ts
  • apps/api/test/unit/admin-revert-payout-request.spec.ts
  • apps/api/test/unit/admin.service.spec.ts
  • apps/api/test/unit/client-error.controller.spec.ts
  • apps/api/test/unit/email-service.spec.ts
  • apps/api/test/unit/prisma.service.spec.ts
  • apps/api/test/unit/reference-data-cache.service.spec.ts
  • infra/ecs/api-task.json
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/api-section2-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

kashkoool added 4 commits May 11, 2026 15:49
…tivity images

Two small follow-ups during the section-2 audit:

1. ECS task def — add `REFERENCE_CACHE_ENABLED` + `REFERENCE_CACHE_TTL_SEC`
   to the secrets array. Both SSM parameters are already created
   (REFERENCE_CACHE_ENABLED=true, REFERENCE_CACHE_TTL_SEC=3600). Without
   this, prod runs on the env-fallback defaults (functionally identical
   today) but the operator can't flip them without a code deploy.

2. POST /vendor/upload-image — admin gets 403 because VendorController is
   class-level @roles('VENDOR'). Admin moderation legitimately needs to
   upload images on behalf of vendors when editing activities. Method-
   level @roles('VENDOR', 'ADMIN') overrides the class-level via
   RolesGuard.getAllAndOverride(). No privilege-escalation risk: the
   endpoint only returns the S3 URL; setting it on an activity still
   requires the admin-gated /admin/activities/:id PATCH.

   Frontend was already calling this endpoint from the admin panel; the
   gate was the only thing blocking it.
…ddleware)

Found during the post-Section-2 conflict audit. `pino-http` defaults
`autoLogging` to `true` — it auto-emits one log line per HTTP response
on top of our existing `RequestLoggerMiddleware` which also emits a
structured HTTP_REQUEST log. Result: 2× log volume + CloudWatch
ingest cost per request.

Disable pino-http's auto-emit, keep our middleware as the sole HTTP
request logger (it has project-specific behavior — skip-path-prefixes,
masked IP in prod, userId resolution — that the default pino-http
emit doesn't carry).

Single-line config flip + clarifying comment. 717 unit tests still
green; no other code paths affected.
Recommendation B from the post-Section-2 AWS-wiring audit. The
/jadwal/prod/LOG_LEVEL SSM param has existed for a while and is
already injected into the container via the api-task.json secrets
array (line 46), but pino was hardcoded to NODE_ENV-based defaults
instead of reading it.

Operator can now flip the pino level for a debugging window
(`debug` / `trace`) without a code deploy — change the SSM value +
force-new-deployment on the ECS service. Current SSM value is
already 'info' (production-appropriate).

Defensive: pino's accepted levels are
{fatal, error, warn, info, debug, trace, silent} — different from
NestJS's legacy {log, error, warn, debug, verbose}. An invalid /
typo'd / NestJS-style SSM value silently falls back to the
NODE_ENV default rather than crashing pino at init.
Recommendation D from the AWS-wiring audit. The AvailabilityCacheService
has read AVAILABILITY_CACHE_ENABLED + AVAILABILITY_CACHE_TTL_SEC for
months but the env vars weren't surfaced to the container — operator
could only tune the cache by editing code. Created the two SSM params
(`/jadwal/prod/AVAILABILITY_CACHE_ENABLED=true`, `..._TTL_SEC=180`)
matching the existing code defaults, then added them to the api-task
secrets array so the running task reads them.

Functionally identical to today (values match code fallbacks). Adds
the operational lever to flip / retune without a redeploy — same
pattern as REFERENCE_CACHE_* added in this PR.
@kashkoool kashkoool merged commit d4e2228 into main May 11, 2026
22 checks passed
@kashkoool kashkoool deleted the chore/api-section2-hardening branch May 11, 2026 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant