chore(api): section 2 hardening — pino, slow-query, retries, ref-cache by kashkoool · Pull Request #203 · kashkoool/Jadwal

kashkoool · 2026-05-11T12:41:40Z

Summary

Section 2 of the perf/reliability audit — 5 items batched into one PR. All low-risk: no schema migrations, no API contract changes, no money paths. Section 1 already merged in #201.

Tag	What	Key file(s)
O4 (minimum-viable)	`nestjs-pino` wired at bootstrap → JSON logs in CloudWatch Logs Insights. 2 CloudWatch-critical emission sites refactored to object-form.	`app.module.ts`, `main.ts`, `client-error.controller.ts`, `request-logger.middleware.ts`
O2	Prisma slow-query middleware via `$extends` (Prisma 7 removed `$use`). 500 ms threshold; PII-safe (model + action + duration only).	`prisma.service.ts`
R2	AWS SDK `{ maxAttempts: 3, retryMode: 'adaptive' }` on SES, SNS, S3, SecretsManager. PAY2M `getAccessToken` inline retry loop (3 attempts, exponential w/ jitter).	`sms.service.ts`, `email.service.ts`, `upload.service.ts`, `prisma.service.ts`, `payment.service.ts`
C1	New `ReferenceDataCacheService` (versioned-key INCR invalidation, fail-open). 5 catalog endpoints cached, 10 admin write-paths invalidate.	`reference-data-cache.service.ts`, `catalog.controller.ts`, `admin.service.ts`
L2	DB pool runbook (off-repo per RULE #3 — `*.md` gitignored).	`~/OneDrive/Desktop/DB_POOL_RUNBOOK.md`

AWS-side compatibility (verified via AWS MCP before any code change)

CloudWatch metric filter	Pattern	Survives pino?
`JadwalClientErrorCount` → `Jadwal/Web ClientErrorCount`	`"event":"CLIENT_ERROR"`	✅ — emission refactored to object-form so the top-level field survives
`jadwal-redis-throttler-failed` → `Jadwal/API RedisThrottlerFailures`	`"Redis throttler failed"` substring	✅ — pino preserves substring inside `msg`

No subscription filters → no third-party log shipping breaks. No AWS infra changes needed.

Items deliberately out of scope (future PRs)

R3 Idempotency on payment initiate — money path, needs dedicated plan-mode PR
R4 Outbox pattern — needs new Prisma model
R1 Circuit breakers — needs measured baselines first
O4 full refactor — remaining ~102 string-form logger.X() calls will be converted opportunistically as files are touched
O3 Sentry spans — Sentry wired but inactive (no SENTRY_DSN); revisit if turned on

Test plan

npx tsc --noEmit clean
npx eslint on all 17 changed source files — no new warnings (only pre-existing "unused eslint-disable" entries)
717 / 717 unit tests pass (up from 701 baseline — 16 net new tests)
- New: client-error.controller.spec.ts pins the CloudWatch-filter contract
- New: reference-data-cache.service.spec.ts — cache-aside + invalidation + fail-open + key safety + disabled-mode
- Updated: prisma.service.spec.ts mock provides $extends
- Updated: 5 admin specs add ReferenceDataCacheService provider
- Updated: email-service.spec.ts asserts new SES retry config
Integration suite (CI gates this)
CI: lint + type-check + CodeQL + Semgrep + Trivy + Dependency Audit + Secret Detection

Post-deploy verification (AWS-side)

CloudWatch Logs Insights on /ecs/jadwal-api: filter event = "CLIENT_ERROR" — expect hits after a deliberate test client-error post
Jadwal/Web ClientErrorCount metric increments within 5 min on a triggered frontend error
aws logs describe-metric-filters --log-group-name /ecs/jadwal-api — both filters still present
filter event = "SLOW_QUERY" | stats count() by model, action — confirm slow queries surface
Jadwal/API RedisThrottlerFailures alarm unchanged

Risk

Low-medium. Logger swap touches every log line but pino's string-arg handling is well-understood. The 2 refactored emission sites are the real risk; the new unit test pins them. AWS SDK retry config is the AWS-recommended setting. p-retry replaced with a ~30-line inline loop (no new npm dep). Redis cache writes are fail-soft. Rollback = git revert — no infra state touched.

Operator note: CloudWatch Logs Insights queries written before this PR that depend on the legacy NestJS string log format may need rewriting against the new JSON shape.

Five Section-2 audit items batched in one PR. All low-risk: no schema migrations, no API contract changes, no money path. AWS-side compatibility verified via aws logs describe-metric-filters before touching anything. O4 (minimum-viable, CloudWatch-only): - Install nestjs-pino + pino-http + pino-pretty (dev). - LoggerModule.forRoot wired in app.module.ts with prod/dev transport, redact paths for secrets, customLogLevel, and genReqId that respects inbound x-request-id. - main.ts: bufferLogs: true → app.useLogger(app.get(PinoLogger)). - Two CRITICAL emission sites refactored from logger.X(JSON.stringify) to object-form: client-error.controller.ts (CloudWatch metric filter JadwalClientErrorCount depends on `"event":"CLIENT_ERROR"` at top level — string form would escape inside pino's msg field and silently flatline the metric) and request-logger.middleware.ts. - Remaining ~102 string-form logger calls keep working unchanged (pino wraps the string in `{msg:"..."}`); convert opportunistically as files are touched. O2: - Prisma slow-query observability via $extends (Prisma 7 removed $use). 500 ms threshold; emits `event: 'SLOW_QUERY'` with model + action + durationMs only. PII-safe by construction — never logs args. R2 (retry on external calls): - { maxAttempts: 3, retryMode: 'adaptive' } on all 4 AWS SDK v3 clients: SES, SNS, S3, SecretsManager. Adaptive retry only fires on retryable error types (network errors pre-request, throttling) — no risk of double-publish on application errors. - PAY2M getAccessToken wrapped in inline retry loop (3 attempts, exponential backoff with jitter; 4xx aborts immediately, 5xx + 429 retry). Inline rather than p-retry dep because p-retry v5+ is ESM-only and project tsconfig is module:commonjs. C1 (Redis reference-data cache): - New ReferenceDataCacheService modeled on AvailabilityCacheService: versioned-key INCR invalidation, fail-open on Redis errors, 256 KiB payload cap, configurable TTL (default 3600s, bounded [60, 86400]). - Registered in RedisModule (Global). - Wired into 5 catalog endpoints: countries, categories, all-cities, cities-by-country, platform-info. - 10 invalidation hooks in admin.service: country/category/city CRUD + updatePlatformSettings. Fire-and-forget — cache failure must NEVER fail the admin write. L2: - DB pool runbook written off-repo at C:\Users\loaek\OneDrive\Desktop\DB_POOL_RUNBOOK.md (per RULE #3 — *.md gitignored). Tests: - New client-error.controller.spec.ts pins the CloudWatch metric-filter compat contract (logger.error MUST receive object, not string). - New reference-data-cache.service.spec.ts covers cache-aside, versioned invalidation, fail-open, key safety. - prisma.service.spec.ts mock now provides $extends (returns self). - admin.service + admin-* specs add ReferenceDataCacheService provider. - email-service.spec.ts SES constructor assertion updated to expect maxAttempts + retryMode. - 717 unit tests pass (up from 701 — 16 net new). Type-check clean. Lint warnings unchanged (only pre-existing "unused eslint-disable directive" entries in upload/email/payment services). Items deliberately out of scope (future PRs): R1 circuit breakers, R3 idempotency on payment initiate, R4 outbox pattern, O4 full refactor of remaining 102 string-form logger calls, O3 Sentry spans (Sentry inactive in prod).

coderabbitai · 2026-05-11T12:48:59Z

Warning

Rate limit exceeded

@kashkoool has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 25 minutes and 42 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b5dd6f7-e50f-4005-9657-226ccca6263d

📥 Commits

Reviewing files that changed from the base of the PR and between 124dcc3 and 9f4fc4b.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json, !package-lock.json, !**/package-lock.json

📒 Files selected for processing (27)

apps/api/.env.example
apps/api/package.json
apps/api/src/admin/admin.service.ts
apps/api/src/app.module.ts
apps/api/src/catalog/catalog.controller.ts
apps/api/src/common/controllers/client-error.controller.ts
apps/api/src/common/middleware/request-logger.middleware.ts
apps/api/src/common/services/upload.service.ts
apps/api/src/email/email.service.ts
apps/api/src/main.ts
apps/api/src/payment/payment.service.ts
apps/api/src/prisma/prisma.service.ts
apps/api/src/redis/redis.module.ts
apps/api/src/redis/reference-data-cache.service.ts
apps/api/src/sms/sms.service.ts
apps/api/src/vendor/vendor.controller.ts
apps/api/test/mocks/bookings-deps.mock.ts
apps/api/test/unit/admin-dashboard-charts.spec.ts
apps/api/test/unit/admin-mark-payout-unpaid.spec.ts
apps/api/test/unit/admin-payouts-inflight.spec.ts
apps/api/test/unit/admin-revert-payout-request.spec.ts
apps/api/test/unit/admin.service.spec.ts
apps/api/test/unit/client-error.controller.spec.ts
apps/api/test/unit/email-service.spec.ts
apps/api/test/unit/prisma.service.spec.ts
apps/api/test/unit/reference-data-cache.service.spec.ts
infra/ecs/api-task.json

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/api-section2-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

@roles

…tivity images Two small follow-ups during the section-2 audit: 1. ECS task def — add `REFERENCE_CACHE_ENABLED` + `REFERENCE_CACHE_TTL_SEC` to the secrets array. Both SSM parameters are already created (REFERENCE_CACHE_ENABLED=true, REFERENCE_CACHE_TTL_SEC=3600). Without this, prod runs on the env-fallback defaults (functionally identical today) but the operator can't flip them without a code deploy. 2. POST /vendor/upload-image — admin gets 403 because VendorController is class-level @roles('VENDOR'). Admin moderation legitimately needs to upload images on behalf of vendors when editing activities. Method- level @roles('VENDOR', 'ADMIN') overrides the class-level via RolesGuard.getAllAndOverride(). No privilege-escalation risk: the endpoint only returns the S3 URL; setting it on an activity still requires the admin-gated /admin/activities/:id PATCH. Frontend was already calling this endpoint from the admin panel; the gate was the only thing blocking it.

…ddleware) Found during the post-Section-2 conflict audit. `pino-http` defaults `autoLogging` to `true` — it auto-emits one log line per HTTP response on top of our existing `RequestLoggerMiddleware` which also emits a structured HTTP_REQUEST log. Result: 2× log volume + CloudWatch ingest cost per request. Disable pino-http's auto-emit, keep our middleware as the sole HTTP request logger (it has project-specific behavior — skip-path-prefixes, masked IP in prod, userId resolution — that the default pino-http emit doesn't carry). Single-line config flip + clarifying comment. 717 unit tests still green; no other code paths affected.

Recommendation B from the post-Section-2 AWS-wiring audit. The /jadwal/prod/LOG_LEVEL SSM param has existed for a while and is already injected into the container via the api-task.json secrets array (line 46), but pino was hardcoded to NODE_ENV-based defaults instead of reading it. Operator can now flip the pino level for a debugging window (`debug` / `trace`) without a code deploy — change the SSM value + force-new-deployment on the ECS service. Current SSM value is already 'info' (production-appropriate). Defensive: pino's accepted levels are {fatal, error, warn, info, debug, trace, silent} — different from NestJS's legacy {log, error, warn, debug, verbose}. An invalid / typo'd / NestJS-style SSM value silently falls back to the NODE_ENV default rather than crashing pino at init.

Recommendation D from the AWS-wiring audit. The AvailabilityCacheService has read AVAILABILITY_CACHE_ENABLED + AVAILABILITY_CACHE_TTL_SEC for months but the env vars weren't surfaced to the container — operator could only tune the cache by editing code. Created the two SSM params (`/jadwal/prod/AVAILABILITY_CACHE_ENABLED=true`, `..._TTL_SEC=180`) matching the existing code defaults, then added them to the api-task secrets array so the running task reads them. Functionally identical to today (values match code fallbacks). Adds the operational lever to flip / retune without a redeploy — same pattern as REFERENCE_CACHE_* added in this PR.

kashkoool added 4 commits May 11, 2026 15:49

kashkoool merged commit d4e2228 into main May 11, 2026
22 checks passed

kashkoool deleted the chore/api-section2-hardening branch May 11, 2026 13:26

kashkoool mentioned this pull request May 11, 2026

fix(api): pino transport allowlist — only enable pino-pretty in development #204

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(api): section 2 hardening — pino, slow-query, retries, ref-cache#203

chore(api): section 2 hardening — pino, slow-query, retries, ref-cache#203
kashkoool merged 5 commits into
mainfrom
chore/api-section2-hardening

kashkoool commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kashkoool commented May 11, 2026

Summary

AWS-side compatibility (verified via AWS MCP before any code change)

Items deliberately out of scope (future PRs)

Test plan

Post-deploy verification (AWS-side)

Risk

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 11, 2026 •

edited

Loading