chore(api): section 2 hardening — pino, slow-query, retries, ref-cache#203
Conversation
Five Section-2 audit items batched in one PR. All low-risk: no schema
migrations, no API contract changes, no money path. AWS-side
compatibility verified via aws logs describe-metric-filters before
touching anything.
O4 (minimum-viable, CloudWatch-only):
- Install nestjs-pino + pino-http + pino-pretty (dev).
- LoggerModule.forRoot wired in app.module.ts with prod/dev transport,
redact paths for secrets, customLogLevel, and genReqId that respects
inbound x-request-id.
- main.ts: bufferLogs: true → app.useLogger(app.get(PinoLogger)).
- Two CRITICAL emission sites refactored from logger.X(JSON.stringify)
to object-form: client-error.controller.ts (CloudWatch metric filter
JadwalClientErrorCount depends on `"event":"CLIENT_ERROR"` at top
level — string form would escape inside pino's msg field and silently
flatline the metric) and request-logger.middleware.ts.
- Remaining ~102 string-form logger calls keep working unchanged
(pino wraps the string in `{msg:"..."}`); convert opportunistically
as files are touched.
O2:
- Prisma slow-query observability via $extends (Prisma 7 removed $use).
500 ms threshold; emits `event: 'SLOW_QUERY'` with model + action +
durationMs only. PII-safe by construction — never logs args.
R2 (retry on external calls):
- { maxAttempts: 3, retryMode: 'adaptive' } on all 4 AWS SDK v3
clients: SES, SNS, S3, SecretsManager. Adaptive retry only fires on
retryable error types (network errors pre-request, throttling) — no
risk of double-publish on application errors.
- PAY2M getAccessToken wrapped in inline retry loop (3 attempts,
exponential backoff with jitter; 4xx aborts immediately, 5xx + 429
retry). Inline rather than p-retry dep because p-retry v5+ is
ESM-only and project tsconfig is module:commonjs.
C1 (Redis reference-data cache):
- New ReferenceDataCacheService modeled on AvailabilityCacheService:
versioned-key INCR invalidation, fail-open on Redis errors, 256 KiB
payload cap, configurable TTL (default 3600s, bounded [60, 86400]).
- Registered in RedisModule (Global).
- Wired into 5 catalog endpoints: countries, categories, all-cities,
cities-by-country, platform-info.
- 10 invalidation hooks in admin.service: country/category/city CRUD
+ updatePlatformSettings. Fire-and-forget — cache failure must
NEVER fail the admin write.
L2:
- DB pool runbook written off-repo at
C:\Users\loaek\OneDrive\Desktop\DB_POOL_RUNBOOK.md (per RULE #3 —
*.md gitignored).
Tests:
- New client-error.controller.spec.ts pins the CloudWatch
metric-filter compat contract (logger.error MUST receive object,
not string).
- New reference-data-cache.service.spec.ts covers cache-aside,
versioned invalidation, fail-open, key safety.
- prisma.service.spec.ts mock now provides $extends (returns self).
- admin.service + admin-* specs add ReferenceDataCacheService
provider.
- email-service.spec.ts SES constructor assertion updated to expect
maxAttempts + retryMode.
- 717 unit tests pass (up from 701 — 16 net new).
Type-check clean. Lint warnings unchanged (only pre-existing "unused
eslint-disable directive" entries in upload/email/payment services).
Items deliberately out of scope (future PRs): R1 circuit breakers,
R3 idempotency on payment initiate, R4 outbox pattern, O4 full
refactor of remaining 102 string-form logger calls, O3 Sentry spans
(Sentry inactive in prod).
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (27)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…tivity images Two small follow-ups during the section-2 audit: 1. ECS task def — add `REFERENCE_CACHE_ENABLED` + `REFERENCE_CACHE_TTL_SEC` to the secrets array. Both SSM parameters are already created (REFERENCE_CACHE_ENABLED=true, REFERENCE_CACHE_TTL_SEC=3600). Without this, prod runs on the env-fallback defaults (functionally identical today) but the operator can't flip them without a code deploy. 2. POST /vendor/upload-image — admin gets 403 because VendorController is class-level @roles('VENDOR'). Admin moderation legitimately needs to upload images on behalf of vendors when editing activities. Method- level @roles('VENDOR', 'ADMIN') overrides the class-level via RolesGuard.getAllAndOverride(). No privilege-escalation risk: the endpoint only returns the S3 URL; setting it on an activity still requires the admin-gated /admin/activities/:id PATCH. Frontend was already calling this endpoint from the admin panel; the gate was the only thing blocking it.
…ddleware) Found during the post-Section-2 conflict audit. `pino-http` defaults `autoLogging` to `true` — it auto-emits one log line per HTTP response on top of our existing `RequestLoggerMiddleware` which also emits a structured HTTP_REQUEST log. Result: 2× log volume + CloudWatch ingest cost per request. Disable pino-http's auto-emit, keep our middleware as the sole HTTP request logger (it has project-specific behavior — skip-path-prefixes, masked IP in prod, userId resolution — that the default pino-http emit doesn't carry). Single-line config flip + clarifying comment. 717 unit tests still green; no other code paths affected.
Recommendation B from the post-Section-2 AWS-wiring audit. The
/jadwal/prod/LOG_LEVEL SSM param has existed for a while and is
already injected into the container via the api-task.json secrets
array (line 46), but pino was hardcoded to NODE_ENV-based defaults
instead of reading it.
Operator can now flip the pino level for a debugging window
(`debug` / `trace`) without a code deploy — change the SSM value +
force-new-deployment on the ECS service. Current SSM value is
already 'info' (production-appropriate).
Defensive: pino's accepted levels are
{fatal, error, warn, info, debug, trace, silent} — different from
NestJS's legacy {log, error, warn, debug, verbose}. An invalid /
typo'd / NestJS-style SSM value silently falls back to the
NODE_ENV default rather than crashing pino at init.
Recommendation D from the AWS-wiring audit. The AvailabilityCacheService has read AVAILABILITY_CACHE_ENABLED + AVAILABILITY_CACHE_TTL_SEC for months but the env vars weren't surfaced to the container — operator could only tune the cache by editing code. Created the two SSM params (`/jadwal/prod/AVAILABILITY_CACHE_ENABLED=true`, `..._TTL_SEC=180`) matching the existing code defaults, then added them to the api-task secrets array so the running task reads them. Functionally identical to today (values match code fallbacks). Adds the operational lever to flip / retune without a redeploy — same pattern as REFERENCE_CACHE_* added in this PR.
Summary
Section 2 of the perf/reliability audit — 5 items batched into one PR. All low-risk: no schema migrations, no API contract changes, no money paths. Section 1 already merged in #201.
nestjs-pinowired at bootstrap → JSON logs in CloudWatch Logs Insights. 2 CloudWatch-critical emission sites refactored to object-form.app.module.ts,main.ts,client-error.controller.ts,request-logger.middleware.ts$extends(Prisma 7 removed$use). 500 ms threshold; PII-safe (model + action + duration only).prisma.service.ts{ maxAttempts: 3, retryMode: 'adaptive' }on SES, SNS, S3, SecretsManager. PAY2MgetAccessTokeninline retry loop (3 attempts, exponential w/ jitter).sms.service.ts,email.service.ts,upload.service.ts,prisma.service.ts,payment.service.tsReferenceDataCacheService(versioned-key INCR invalidation, fail-open). 5 catalog endpoints cached, 10 admin write-paths invalidate.reference-data-cache.service.ts,catalog.controller.ts,admin.service.ts*.mdgitignored).~/OneDrive/Desktop/DB_POOL_RUNBOOK.mdAWS-side compatibility (verified via AWS MCP before any code change)
JadwalClientErrorCount→Jadwal/Web ClientErrorCount"event":"CLIENT_ERROR"jadwal-redis-throttler-failed→Jadwal/API RedisThrottlerFailures"Redis throttler failed"substringmsgNo subscription filters → no third-party log shipping breaks. No AWS infra changes needed.
Items deliberately out of scope (future PRs)
logger.X()calls will be converted opportunistically as files are touchedSENTRY_DSN); revisit if turned onTest plan
npx tsc --noEmitcleannpx eslinton all 17 changed source files — no new warnings (only pre-existing "unused eslint-disable" entries)client-error.controller.spec.tspins the CloudWatch-filter contractreference-data-cache.service.spec.ts— cache-aside + invalidation + fail-open + key safety + disabled-modeprisma.service.spec.tsmock provides$extendsReferenceDataCacheServiceprovideremail-service.spec.tsasserts new SES retry configPost-deploy verification (AWS-side)
/ecs/jadwal-api:filter event = "CLIENT_ERROR"— expect hits after a deliberate test client-error postJadwal/Web ClientErrorCountmetric increments within 5 min on a triggered frontend erroraws logs describe-metric-filters --log-group-name /ecs/jadwal-api— both filters still presentfilter event = "SLOW_QUERY" | stats count() by model, action— confirm slow queries surfaceJadwal/API RedisThrottlerFailuresalarm unchangedRisk
Low-medium. Logger swap touches every log line but pino's string-arg handling is well-understood. The 2 refactored emission sites are the real risk; the new unit test pins them. AWS SDK retry config is the AWS-recommended setting. p-retry replaced with a ~30-line inline loop (no new npm dep). Redis cache writes are fail-soft. Rollback =
git revert— no infra state touched.Operator note: CloudWatch Logs Insights queries written before this PR that depend on the legacy NestJS string log format may need rewriting against the new JSON shape.