Skip to content

feat(slack): add Slack integration with @mention-based task submission#42

Closed
isadeks wants to merge 1 commit intoaws-samples:mainfrom
isadeks:main
Closed

feat(slack): add Slack integration with @mention-based task submission#42
isadeks wants to merge 1 commit intoaws-samples:mainfrom
isadeks:main

Conversation

@isadeks
Copy link
Copy Markdown
Contributor

@isadeks isadeks commented Apr 20, 2026

Summary

  • Adds full Slack integration: submit tasks via @Shoof mentions or DMs, receive threaded notifications with emoji reaction progress
  • OAuth multi-workspace install flow with bgagent slack setup CLI wizard for zero-friction onboarding
  • Account linking via /bgagent link slash command
  • Cancel button with instant feedback, session message cleanup on completion

What's included

CDK Constructs: SlackIntegration, SlackInstallationTable, SlackUserMappingTable

Lambda Handlers: slack-events, slack-commands, slack-command-processor, slack-interactions, slack-oauth-callback, slack-link, slack-notify

Shared Utilities: slack-verify (HMAC signature verification), slack-blocks (Block Kit renderer)

CLI: bgagent slack setup (interactive wizard), bgagent slack link <code> (account linking)

Docs: SLACK_SETUP_GUIDE.md with screenshots, Developer Guide updated

UX Flow

  1. User mentions @Shoof fix the bug in org/repo#42
  2. 👀 reaction appears instantly
  3. Task created → ⏳ reaction, threaded notification with Cancel button
  4. Agent completes → ✅ reaction, "Task completed" with View PR button, session message deleted

Other changes (non-Slack)

  • Blueprint repo via context/env var: cdk deploy -c blueprintRepo=org/repo or BLUEPRINT_REPO=org/repo — no longer requires editing agent.ts directly. Developer Guide updated with the new method.
  • X-Ray tracing commented out: Requires account-level UpdateTraceSegmentDestination setup that blocks first-time deploys. Disabled to unblock new users; re-enable once the prerequisite is documented more clearly.

Test plan

  • bgagent slack setup — full flow from scratch (deploy, create app, credentials, install, link)
  • @Shoof mention in channel — happy path with reactions and threaded notifications
  • Cancel button mid-task — instant feedback and cleanup
  • Error cases: no repo, repo not onboarded, unlinked user
  • DM to Shoof — private task submission
  • /bgagent help and /bgagent link
  • Multi-workspace: second workspace installs via OAuth URL
  • -c blueprintRepo=org/repo deploys with custom repo onboarded

@isadeks isadeks force-pushed the main branch 3 times, most recently from aef74c1 to 833c770 Compare April 20, 2026 17:03
Adds full Slack integration enabling users to submit coding tasks by
mentioning @shoof in any channel or DM, with real-time emoji reactions
and threaded notifications showing task progress.

Key features:
- @mention task submission with natural language repo extraction
- Emoji reaction progression: 👀 → ⌛ → ✅
- Threaded notifications (created, started, completed/failed)
- Cancel button with instant feedback
- DM support for private task submissions
- OAuth multi-workspace install flow
- `bgagent slack setup` CLI wizard for zero-friction onboarding
- Account linking via /bgagent link

New files:
- CDK constructs: SlackIntegration, SlackInstallationTable, SlackUserMappingTable
- Lambda handlers: events, commands, command-processor, interactions,
  oauth-callback, link, notify
- Shared utilities: slack-verify, slack-blocks
- CLI: slack.ts (setup, link commands)
- Docs: SLACK_SETUP_GUIDE.md with screenshots
@isadeks isadeks marked this pull request as ready for review April 20, 2026 17:31
@isadeks isadeks requested a review from a team as a code owner April 20, 2026 17:31
@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 23, 2026

Thank you @isadeks ! Great addition

From a first pass review:

What's Done Well

  • HMAC verification (slack-verify.ts): timingSafeEqual, replay protection, proper error logging
  • Secret upsert lifecycle (slack-oauth-callback.ts): correctly handles create/update/restore with specific exception matching
  • CDK construct design: clean separation, proper cdk-nag suppressions, least-privilege IAM (with minor exceptions noted)
  • UX design: reaction progression, threaded notifications, DM support, cancel confirmation dialog, message cleanup on completion
  • Blueprint repo configurability: clean env ?? context ?? default pattern
  • Shared utility test quality: slack-verify.test.ts covers valid/invalid signatures, stale timestamps, length mismatches; slack-blocks.test.ts covers all event types

The inbound path (task creation) is well-architected. The existing createTaskCore() function is a genuinely clean seam — it accepts a TaskCreationContext with a channelSource tag and
an opaque channelMetadata: Record<string, string> bag, and never inspects their contents. The Slack handler correctly follows the same pattern as the API and webhook handlers:
authenticate in its own way, extract a userId, build metadata, delegate to the shared core. Adding 'slack' to the channelSource union is a one-line change that doesn't contaminate shared code.

My architectural problem: This PR hardwires Slack as the only notification consumer, bypassing the opportunity to build a proper event fan-out layer. Today on main, the TaskEventsTable is a pull-only audit log — no streams, no push notifications. When this PR adds stream: dynamodb.StreamViewType.NEW_IMAGE to the events table, it's
a meaningful infrastructure change. But then it wires the Slack notify Lambda directly as the sole stream consumer, with Slack-specific filtering baked in. This creates two problems for adding additional channels later:

  1. Each new channel needs its own DynamoDB Stream consumer on the same table. DynamoDB Streams supports a maximum of 2 consumers per table (without Kinesis). By the third adapter, you
    hit a hard AWS limit.
  2. The filtering logic (channel_source === 'slack') is duplicated inside each consumer. Every adapter handler would independently load the task record, check the channel source, and decide whether to act. This is the "notification routing" concern scattered across N handlers rather than centralized. I think there is an opportunity for a notification dispatcher pattern (maybe using eventbridge ?). This gives you:
  • Single stream consumer — no DynamoDB Streams consumer limit concern
  • Centralized routing — the dispatcher loads the task once, resolves the channel, and routes
  • Decoupled adapters — each notifier receives a pre-filtered, normalized event; it only cares about rendering and posting
  • Easy addition — a new channel is: one Lambda + one EventBridge rule, no changes to the dispatcher

The current slack-notify.ts (333 lines) mixes three concerns that should be separated: Event routing, Message rendering, and Delivery + state management. If you separate the routing concern, each future adapter only needs to implement rendering + delivery. The slack-notify.ts handler currently does all three, which means the next adapter would duplicate the routing and deduplication logic.

ALso, the opaque bag pattern (channel_metadata: Record<string, string>) works because each adapter writes and reads its own keys. But it means the task record accumulates state for the active channel's notification lifecycle (message timestamps, thread context). This couples the task record to the notification system. A cleaner approach would store notification state in a separate table or in the channel adapter's own state, not in the task's channel_metadata.

I don't want to block the PR for the notification architecture alone since this is the first channel we are adding, so premature abstractions is a real risk. What I would like to see in the current PR:

  1. Add a // TODO: Extract to notification dispatcher when adding a second channel comment on the DynamoDB Stream wiring in slack-integration.ts, making the architectural intent explicit.
  2. Move the deduplication logic out of slack-notify.ts into a shared utility now. The slack_notified_terminal conditional update pattern is channel-agnostic and will be needed by
    every adapter. Rename it to something generic like channel_notified_terminal.
  3. Don't store notification lifecycle state in channel_metadata — or at least document that this is a conscious tradeoff for V1. The slack_session_msg_ts, slack_created_msg_ts fields are Slack-specific delivery state, not channel context. A separate "notification state" map or table would be cleaner.
  4. Extract the channelSource type into a shared ChannelSource type in types.ts used by both TaskCreationContext and TaskRecord. This is a 3-line change that makes adding channels a type-safe, grep-able operation.
  5. Document the adapter contract — even informally. Future engineers adding a new adapter need to know: "implement inbound auth + call createTaskCore(), implement outbound rendering + delivery, wire a CDK construct." The PR adds 4,100 lines but no ADR or design doc explaining the pattern for the next person.

Some captured issues in the current code, please revise once the recommendations above are implemented to see if they are still relevant:

    1. Dead code: handleStatus and handleCancel never called

    File: cdk/src/handlers/slack-command-processor.ts

    The functions handleStatus (~40 lines) and handleCancel (~55 lines) are fully implemented but never wired into the switch(subcommand) in the handler function. Users running /bgagent
    status or /bgagent cancel get the default fallback message instead. Either add the case 'status': and case 'cancel': branches, or remove the dead code.

    1. url_verification challenge bypasses signature verification

    File: cdk/src/handlers/slack-events.ts

    The handler returns the challenge before verifying the Slack signing secret. Slack does sign url_verification requests, so verification should be attempted first. Only skip it if the
    signing secret is not yet populated (initial setup). As-is, an unauthenticated caller can confirm the endpoint is a Slack event handler.

    1. Retry dedup drops ALL retried events unconditionally

    File: cdk/src/handlers/slack-events.ts

    Any request with X-Slack-Retry-Num is immediately acked with 200 — no event-type discrimination. If the first delivery of app_uninstalled or tokens_revoked fails transiently, the
    retry is silently dropped and the revocation is permanently lost (bot token stays active). At minimum, allow security-critical event retries through — they are idempotent.

    1. DynamoDB Streams handler never fails the batch

    File: cdk/src/handlers/slack-notify.ts

    Every error is caught per-record and logged as warn, but never thrown. The construct configures retryAttempts: 3 and bisectBatchOnError: true, but these are useless because Lambda
    never sees a failure. Infrastructure errors (DynamoDB throttling, Secrets Manager outage) should be rethrown so Lambda retries the batch; only Slack API errors should be swallowed.

    1. OAuth callback missing state parameter (CSRF)

    File: cdk/src/handlers/slack-oauth-callback.ts

    The OAuth flow lacks a state parameter, the standard CSRF countermeasure (RFC 6749 §10.12). Practical risk is mitigated since client_secret is required for code exchange, but this is
    a well-known OAuth best practice. At minimum, document the tradeoff explicitly.

    1. revokeInstallation deletes secret even after DynamoDB failure

    File: cdk/src/handlers/slack-events.ts

    If the DynamoDB update to mark the installation as revoked fails, execution falls through to delete the bot token secret anyway. Result: installation record still shows status:
    'active' but the token is gone — every subsequent Slack API call fails with "secret not found."

    1. Six .catch(() => {}) calls with zero logging

    Files: slack-events.ts, slack-command-processor.ts

    Five+ Slack API calls (reactions.add/remove, chat.postMessage) use .catch(() => {}). Network failures, revoked tokens, rate limiting — all silently disappear. The slack-notify.ts
    handlers already log properly in equivalent helpers. Replace with .catch((err) => logger.warn(...)).

    1. handleAppMention Lambda invocation failure gives no user feedback

    File: cdk/src/handlers/slack-events.ts

    When the async Lambda invocation fails, the error is logged but the user sees the 👀 reaction and then... nothing. No reaction swap to ❌, no error thread. The "no repo found"
    path in the same function already handles this — apply the same pattern.

    1. Secret cache causes total outage after rotation

    File: cdk/src/handlers/shared/slack-verify.ts

    The 5-minute in-memory cache has no invalidation. After the signing secret is rotated, all warm Lambda instances reject every request for up to 5 minutes. Implement verify-then-retry:
    if verification fails, clear the cache, re-fetch, and verify once more.

    1. checkChannelAccess fails open on all errors

    File: cdk/src/handlers/slack-command-processor.ts

    Returns { ok: true } when bot token is missing, Slack API returns unknown errors, or fetch throws. Tasks are created for channels where notifications can never be delivered, with no
    user warning.

    1. CLI getStackOutput empty catch hides auth/permission errors

    File: cli/src/commands/slack.ts

    Catches all CloudFormation errors and returns null. An expired session or missing IAM permissions shows "Stack has not been deployed yet" instead of the real error. Only catch
    ValidationError (stack not found); surface other errors.

    1. Coverage
      The shared utilities (slack-verify, slack-blocks) and CDK constructs have solid tests. However, all 7 Lambda handlers have zero unit tests — a significant departure from the existing
      codebase pattern where every handler has a dedicated test file.
    1. MentionPayload extends SlackCommandPayload creates misleading type contract

    File: cdk/src/handlers/slack-command-processor.ts

    Mentions inherit response_url: string (required) but always set it to ''. Five other fields are similarly phantom. The handler branches on source === 'mention' at runtime, but the types don't express this. Refactor into a discriminated union:

    type CommandProcessorEvent = SlashCommandPayload | MentionPayload;
    // where each variant only has the fields it actually uses

    1. SlackBlock type requires as unknown as SlackBlock cast

    File: cdk/src/handlers/shared/slack-blocks.ts

    The actions() helper casts through unknown because SlackBlock doesn't model action blocks correctly. Button helpers return Record<string, unknown>. A discriminated union of
    SectionBlock | ActionsBlock with typed button elements would eliminate the cast and catch typos at compile time.

    1. channelSource union not used in TaskRecord

    File: cdk/src/handlers/shared/create-task-core.ts / types.ts

    TaskCreationContext.channelSource is 'api' | 'webhook' | 'slack' but TaskRecord.channel_source is plain string. Extract type ChannelSource = 'api' | 'webhook' | 'slack' and use it in
    both places.

    1. Duplicate utilities in slack-blocks.ts,slack-command-processor.ts -> truncate() and formatDuration() are identical in both files. Export from shared.
    1. Dedup ordering in slack-notify.ts. Terminal dedup write runs before checking channel_source === 'slack', writing to non-Slack task records unnecessarily.
    1. Excess IAM grant. slack-integration.ts. slackCommandsFn gets readSlackSecretsPolicy (bot token access) but only needs the signing secret, already granted separately.
    1. Inconsistent Number() in slack-blocks.ts. taskTimedOutMessage calls formatDuration(task.duration_s) without Number(), unlike taskCompletedMessage.
    1. safeJsonParse empty in slack-notify.ts. Parse errors are swallowed with no logging — users see "Unknown error" instead of the actual failure reason.

Thank you ! Also, @scoropeza since you are working on the related task, could you please have a look to see if this is aligned with your work ? Thank you !

scoropeza pushed a commit to scoropeza/sample-autonomous-cloud-coding-agents that referenced this pull request May 4, 2026
…triage

Hands-on testing against the deployed stack revealed two UX gaps in the
default ``bgagent status`` compact snapshot:

  1. A ``pr_review`` task renders indistinguishably from a ``new_task``,
     so a user checking status on a review submission has no default-
     view confirmation the task_type was set as intended.
  2. A FAILED / CANCELLED / TIMED_OUT task shows the terminal status
     but not *why*. Users had to either re-run with ``--wait`` (which
     blocks), switch to ``--output json``, or grep ``bgagent events``
     to recover the error classification that the API already returns.

Both fields are already present on the ``TaskDetail`` payload that
``formatStatusSnapshot`` receives — the previous template just did not
render them. This is pure render-layer plumbing; no new API calls, no
backend changes.

Changes in ``cli/src/format.ts::formatStatusSnapshot``:

  - Renders ``Type: pr_iteration (PR aws-samples#42)`` / ``Type: pr_review (PR aws-samples#7)``
    after ``Repo:`` when ``task_type !== 'new_task'``. Matches the
    ``formatTaskDetail`` treatment (``Type:`` line at format.ts:29-31)
    so the compact and detailed views agree. Omits the ``(PR #N)``
    suffix defensively if ``pr_number`` is null.
  - Renders ``Reason: <category>: <title>`` after ``Cost:`` for
    non-COMPLETED terminal statuses, prefering
    ``error_classification.{category, title}`` from the server's
    classifier and falling back to a trimmed ``error_message`` if no
    structured classification is available. Emits nothing when neither
    is populated, or when the task is still running or COMPLETED —
    so the line never appears as a dangling ``Reason:`` with no value.

Example (before → after):

  Before (FAILED task, user had to chase error in events):
    Task 01KQ… — FAILED (3m 12s total)
      Repo:          scoropeza/agent-plugins
      Turn:          8 / 20
      Last milestone: agent_execution_complete (2s ago)
      Current:       task failed
      Cost:          $0.18 / budget $0.05
      Last event:    2026-05-04T21:25:04Z

  After:
    Task 01KQ… — FAILED (3m 12s total)
      Repo:          scoropeza/agent-plugins
      Turn:          8 / 20
      Last milestone: agent_execution_complete (2s ago)
      Current:       task failed
      Cost:          $0.18 / budget $0.05
      Reason:        compute: Exceeded max budget
      Last event:    2026-05-04T21:25:04Z

Tests: +10 regression tests covering Type line for pr_iteration /
pr_review / new_task / missing-pr-number, Reason line for
classification / fallback-message / neither / CANCELLED / TIMED_OUT /
COMPLETED-never-emits / RUNNING-never-emits. CLI suite: 181 passing
(was 171).

Refs: PR aws-samples#52 hands-on testing feedback
@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented May 6, 2026

Closing since #63 has new content

@krokoko krokoko closed this May 6, 2026
krokoko pushed a commit to tomjwxf/sample-autonomous-cloud-coding-agents that referenced this pull request May 7, 2026
…mples#63)

* feat(slack): add Slack integration with @mention-based task submission

Adds full Slack integration enabling users to submit coding tasks by
mentioning @shoof in any channel or DM, with real-time emoji reactions
and threaded notifications showing task progress.

Key features:
- @mention task submission with natural language repo extraction
- Emoji reaction progression: 👀 → ⌛ → ✅
- Threaded notifications (created, started, completed/failed)
- Cancel button with instant feedback
- DM support for private task submissions
- OAuth multi-workspace install flow
- `bgagent slack setup` CLI wizard for zero-friction onboarding
- Account linking via /bgagent link

New files:
- CDK constructs: SlackIntegration, SlackInstallationTable, SlackUserMappingTable
- Lambda handlers: events, commands, command-processor, interactions,
  oauth-callback, link, notify
- Shared utilities: slack-verify, slack-blocks
- CLI: slack.ts (setup, link commands)
- Docs: SLACK_SETUP_GUIDE.md with screenshots

* fix(docs): sync SLACK_SETUP_GUIDE into starlight site

Add SLACK_SETUP_GUIDE.md to the explicit route map in sync-starlight.mjs,
mirror it to using/Slack-setup-guide.md, and wire it into the astro sidebar.
Without this, links to the Slack setup guide rendered as /architecture/... and
the page itself never appeared on the site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(slack): address Alain's PR aws-samples#42 review feedback (items 1-20)

Full response to the 20-point review on PR aws-samples#42. Grouped by theme below
rather than strictly by item number so related edits stay together.

**Signature + auth (items 2, 3, 9)**

- Item 2: url_verification challenge used to be answered before signature
  verification, exposing the endpoint as a Slack handler without auth. Now
  every request is verified first. During initial setup — before the
  signing secret is populated — url_verification is still answered so the
  Slack App can be wired up, but every other request is rejected.
- Item 3: previously any request with X-Slack-Retry-Num was acked with 200
  immediately. Transient failures on app_uninstalled or tokens_revoked
  silently dropped the revocation and left the bot token live forever.
  Security-critical event retries now reprocess; UX events still short-
  circuit on retry.
- Item 9: slack-verify's 5-minute secret cache had no invalidation. After
  a signing-secret rotation, warm Lambdas rejected every request for up
  to 5 minutes. New verifySlackRequest helper does verify-then-retry with
  forced refresh; invalidateSlackSecretCache exported for explicit eviction.
  All three authed handlers (events, commands, interactions) go through
  verifySlackRequest.

**Error surfacing (items 4, 7, 20)**

- Item 4: slack-notify caught every error per-record and logged warn, which
  made retryAttempts and bisectBatchOnError useless. Now distinguishes
  SlackApiError (swallow — not retryable via stream replay) from infra
  errors (rethrow so Lambda retries the batch).
- Item 7: six .catch(() => {}) swallows in slack-events and
  slack-command-processor replaced with a shared slackFetch helper in
  shared/slack-api.ts that logs at warn. Benign Slack errors
  (already_reacted, no_reaction) are still silenced as expected.
- Item 20: safeJsonParse in slack-notify logs the parse error and the
  truncated payload preview instead of swallowing silently.

**Revocation ordering (item 6)**

- revokeInstallation previously deleted the bot-token secret even if the
  DDB update to mark the installation revoked had failed, leaving DB
  status "active" with no backing token. Now the secret delete runs only
  if the DDB update succeeded; both steps throw on failure so Slack
  retries the event end-to-end.

**UX fallbacks (items 8, 10)**

- Item 8: when the command-processor Lambda invocation fails from
  handleAppMention, swap 👀 → ❌ and reply in-thread, mirroring the
  no-repo-found path. Previously the user was left with a stuck 👀.
- Item 10: checkChannelAccess fails closed on definitively-blocking Slack
  errors (channel_not_found, not_in_channel, missing_scope). Transient
  or unknown errors (ratelimited, internal_error, network blips) fail
  open so a Slack outage doesn't block task submission; slack-notify
  surfaces any real delivery failure downstream.

**Dedup ordering (item 17)**

- Moved the channel_source === 'slack' check before the terminal dedup
  write so we don't poke non-Slack task records' channel_metadata with a
  Slack-specific marker.

**Dead code + type safety (items 1, 13, 14)**

- Item 1: delete unused handleStatus / handleCancel / statusEmoji from
  slack-command-processor. They were fully implemented but never wired
  into the switch(subcommand). Sam's PR aws-samples#52 roadmap delivers `bgagent
  status` via the CLI; the Slack Cancel button already exists.
- Item 13: MentionPayload used to extend SlackCommandPayload and set
  phantom fields (`response_url: ''`, etc.). Refactored into a discriminated
  union `CommandProcessorEvent = SlashCommandEvent | MentionEvent`.
  slack-events.ts constructs MentionEvent directly; handleSubmit takes
  MentionEvent only.
- Item 14: SlackBlock was a single interface requiring
  `as unknown as SlackBlock` to construct actions blocks. Now
  SectionBlock | ActionsBlock discriminated union with typed button
  variants. Casts removed; tests updated with narrowing helpers.

**Shared utilities (items 15, 16, 19)**

- Items 15/16: extract ChannelSource union from shared/types.ts and use
  it on both TaskCreationContext.channelSource and TaskRecord.channel_source.
  Extract truncate() and formatDuration() into shared/slack-format.ts so
  they stop drifting between slack-blocks and slack-command-processor.
- Item 19: formatDuration now coerces via Number() for all callers, fixing
  the inconsistency where taskTimedOutMessage skipped the cast.

**CLI error surfacing (item 11)**

- cli/src/commands/slack.ts:getStackOutput used to swallow every
  CloudFormation error as "not deployed yet". Now only swallows
  ValidationError / does-not-exist; auth and permission errors bubble up
  to the user.

**Excess IAM grant (item 18)**

- slackCommandsFn had readSlackSecretsPolicy attached, but the slash-
  command acknowledger only needs the signing secret (already granted
  separately). Grant removed; comment added explaining the intent.

**OAuth CSRF (item 5)**

- Documented the tradeoff in slack-oauth-callback.ts. Slack's client_secret
  requirement on the code exchange is the primary defense; a full state
  nonce requires per-session signed storage and is tracked as a follow-up
  coordinated with PR aws-samples#52's dispatcher rework.

**Not addressed in this PR (architectural feedback for follow-up)**

Alain's "What I would like to see" items 1-5 — dispatcher pattern,
notification-state extraction from channel_metadata, ADR for the adapter
contract — are scoped to the PR that will rebase onto Sam's PR aws-samples#52
(feat(interactive-agents)) once it lands. Sam's locked plan establishes
the FanOutConsumer router that those items depend on.

* test(slack): add handler-level tests for all 7 Slack Lambda handlers

Addresses review item 12 / general feedback 4: the shared utilities
(slack-verify, slack-blocks) and CDK constructs had solid coverage but
all 7 Lambda handlers had zero unit tests, a departure from the rest of
the codebase.

Added test files covering the most load-bearing paths:

- slack-events.test.ts (11 tests) — url_verification challenge with/
  without signature (item 2), retry semantics (drop non-critical,
  reprocess app_uninstalled + tokens_revoked, item 3), revoke-installation
  ordering (item 6: no secret delete on DDB failure), app_mention
  forwarding, no-repo + Lambda-failure reaction swaps (item 8)
- slack-commands.test.ts (4 tests) — signature rejection, inline help,
  async processor invocation, graceful processor failure
- slack-interactions.test.ts (6 tests) — signature rejection, cancel_task
  happy path + owner rejection + terminal-state warn, unknown action_id,
  unlinked account
- slack-notify.test.ts (8 tests) — non-Slack task short-circuit, dedup
  ordering after source check (item 17), dedup write skip on duplicate,
  SlackApiError swallow vs infra-error rethrow (item 4), non-INSERT /
  non-notifiable filter, malformed metadata logging (item 20)
- slack-link.test.ts (6 tests) — 401 without JWT, 400 without code,
  404 on missing/stale code, happy path write+delete, code normalization
- slack-oauth-callback.test.ts (5 tests) — missing code, unpopulated
  client secret, token exchange failure, create-secret-on-RNF path,
  restore-then-update path for reinstall after uninstall
- slack-command-processor.test.ts (9 tests) — legacy source-less payload
  normalization, slash-submit-rejection UX, mention link-prompt + repo
  validation + private channel guard (item 10 hard failure) + transient
  fail-open (item 10 softening), createTaskCore invocation, link code
  persistence, help text

The regression tests for items 2, 3, 4, 6, 9 all fail the buggy pre-fix
version of the code, verified by running them against the unchanged
source during review.

* fix(slack): soften checkChannelAccess + drop stale status claim in docs

Addresses blind-review feedback S2 + S3.

**S2 — checkChannelAccess transient failures now fail open.**

The prior "fail closed on any unknown Slack error" was too strict: a
30-second Slack ratelimited or internal_error during conversations.info
would block every Slack task submission for that window, even though
the notification path downstream could have recovered. Now:

- Hard failures (channel_not_found, not_in_channel, missing_scope) still
  fail closed — these definitively mean the bot cannot post there, and
  silently creating an undeliverable task is the worse UX.
- Transient or unknown errors (ratelimited, internal_error, fatal_error,
  network blips) fail open. slack-notify surfaces any real delivery
  failure when the event plays through.

Regression test added: `mention submit fails open on transient Slack
errors` asserts createTaskCore IS called when conversations.info returns
`ratelimited`.

**S3 — USER_GUIDE Slack bullet drops "check status".**

handleStatus was deleted in this branch (review item 1) because Sam's
locked plan in PR aws-samples#52 delivers `bgagent status` via the CLI rather than
a Slack slash subcommand. The user-facing docs still claimed Slack could
"check status" via `/bgagent`. Rewritten to describe the shipped
behavior: @mention-based submission plus threaded progress notifications
with reaction-based status. Starlight content regenerated via
`node docs/scripts/sync-starlight.mjs`.

* chore(types): extend ChannelSource to include 'linear'

Trivial enabler for the Linear integration branch. No functional change:
all existing code already uses the broad ChannelSource union; adding
'linear' makes the Linear-origin task records type-check once the
inbound adapter starts writing them.

* feat(linear): Phase 1 — CDK backend handlers + mapping tables

- shared/linear-verify.ts: HMAC-SHA256 hex verify over raw body,
  60s webhookTimestamp replay window, secret caching + rotation retry.
- LinearProjectMappingTable: linear_project_id → repo + label_filter.
- LinearUserMappingTable: {workspace}#{user} confirmed / pending#{code}
  with TTL and PlatformUserIndex GSI.
- linear-webhook.ts: verify + dedup (60s TTL on (issue_id, action))
  + async-invoke processor; replies fast inside Linear's 5s budget.
- linear-webhook-processor.ts: Issue label-add detection, project→repo
  + actor→platform user resolution, createTaskCore with
  channelSource 'linear' + linear_* channel metadata.
- linear-link.ts: mirror of slack-link.ts.

* feat(linear): Phase 2 — LinearIntegration construct + stack wiring

- LinearIntegration construct: provisions both mapping tables, the webhook
  dedup table, 3 Lambdas (webhook / processor / link), 2 Secrets Manager
  placeholders, and API Gateway routes under /v1/linear/*.
  Deliberately no DynamoDB Streams consumer — outbound updates go through
  the Linear MCP from inside the agent container.
- Stack: instantiates LinearIntegration in agent.ts and emits CfnOutputs
  for the two secret ARNs and both table names, consumed by
  `bgagent linear setup` / onboard-project.

* feat(linear): Phase 3 — agent-side Linear MCP + channel plumbing

- Orchestrator invoke payload now carries `channel_source` +
  `channel_metadata`, so the agent knows which inbound channel a task came
  from (and the Linear issue identifier for Linear-origin tasks).
- TaskConfig + build_config accept the new fields; server.py extracts them
  from the /invocations payload and forwards into pipeline.run_task.
- channel_mcp.py: new `configure_channel_mcp(repo_dir, channel_source)`.
  For channel_source=='linear' it merges a `linear-server` entry into
  `.mcp.json` (keyed so tools surface as `mcp__linear-server__*`, matching
  the verified MCP spec). For any other channel it's a no-op — the gate
  that keeps Slack/API tasks from loading Linear tools.
- config.py: `resolve_linear_api_token()` reads LinearApiTokenSecret via
  LINEAR_API_TOKEN_SECRET_ARN and caches into LINEAR_API_TOKEN so the SDK
  child process inherits it for the MCP's ${LINEAR_API_TOKEN} placeholder.
- pipeline.py: resolves the token + writes .mcp.json just before run_agent.
- prompt_builder.py: appends a short Linear-specific addendum naming
  `mcp__linear-server__save_comment` / `update_issue` by tool name.
- Stack: grants runtime read on LinearApiTokenSecret and sets the secret
  ARN in the AgentCore runtime environment via CFN override.
- run.sh: forwards LINEAR_API_TOKEN / LINEAR_API_TOKEN_SECRET_ARN into the
  local container for manual testing.

* feat(linear): Phase 4 — CLI commands + setup guide + docs

- cli/src/commands/linear.ts: new `bgagent linear` command group:
  - `setup` — prompts for the Linear webhook signing secret and personal
    API key, populates both placeholders in Secrets Manager, prints the
    next-step checklist including the webhook URL.
  - `link <code>` — calls POST /v1/linear/link with the Cognito JWT.
  - `onboard-project <linear-project-id> --repo owner/repo [--label bgagent]`
    — writes the Linear project → repo mapping row directly to
    LinearProjectMappingTable via DynamoDBDocumentClient (admin IAM path,
    matching the current onboarding UX for Slack + webhooks).
- api-client + types: LinearLinkResponse type + linearLink() method.
- bgagent.ts: registers the new command group.
- cli/package.json: adds @aws-sdk/client-dynamodb + @aws-sdk/lib-dynamodb
  so the CLI can write project mappings.
- docs/guides/LINEAR_SETUP_GUIDE.md: new end-to-end setup walkthrough
  mirroring SLACK_SETUP_GUIDE, covering API-key generation, webhook
  configuration, onboarding, linking, and troubleshooting.
- docs/scripts/sync-starlight.mjs + docs/astro.config.mjs: routing + sidebar
  entry for the new guide.
- docs/guides/USER_GUIDE.md: Linear now shows up as the 5th channel.
- cdk/test/stacks/agent.test.ts: DDB table count bumped 7 → 10 to cover
  the three new Linear tables.

* test(linear): Phase 5 — handler + construct + agent channel_mcp tests

New test coverage for the Linear integration:

- cdk/test/handlers/linear-webhook.test.ts — HMAC pass/fail, stale
  webhookTimestamp, non-Issue type is 200/no-op, missing data.id is 400,
  dedup-hit short-circuits processor invoke, Lambda invoke failure surfaces
  as 500 so Linear retries, malformed JSON with valid signature is 400.
- cdk/test/handlers/linear-webhook-processor.test.ts — project-not-onboarded
  and status=removed skip, label filter gates on create AND on update
  (updatedFrom.labelIds transition), unlinked Linear actor skips, happy
  path asserts channel_source='linear' + full linear_* channelMetadata,
  custom label_filter honored per project mapping.
- cdk/test/handlers/linear-link.test.ts — mirror of slack-link.test.ts:
  401/400/404 error paths + confirmed-mapping write + code normalization.
- cdk/test/constructs/linear-integration.test.ts — resource-count
  assertions (3 DDB tables, 3 Lambdas, 2 secrets, ZERO DDB Streams
  mappings), env-var wiring for both handlers, dedup-table TTL shape.
- agent/tests/test_channel_mcp.py — channel gate (slack/api/webhook/empty
  are no-ops; only linear writes), .mcp.json shape (server key
  `linear-server`, hosted URL, ${LINEAR_API_TOKEN} placeholder), merge
  semantics (preserves other servers, overwrites stale linear entry,
  tolerates malformed JSON), missing-repo_dir guard.

All 853 CDK tests and 294 agent tests pass locally.

* fix(waf): scope-down CRS for Linear webhook route

Exempt /v1/linear/webhook from AWSManagedRulesCommonRuleSet via a
scopeDownStatement. Linear Issue webhook payloads routinely exceed the
ruleset's default 8 KB body-size limit (observed 8614 bytes), and the
full payload is needed by the webhook Lambda for HMAC verification.

The route has stronger, bespoke protection already:
- HMAC verification against LinearWebhookSecret on the raw body
- webhookTimestamp replay guard (60s window)
- Structured JSON parsing with strict field validation
- RateLimitRule (1000/IP/5min) is priority 3 and still applies

CRS continues to block on every other route (user API, Slack, etc.).

* fix(linear-webhook): dedup on issueId+action+webhookTimestamp, 8h TTL

The prior dedup key was issueId#action, which collapsed distinct
label-toggle deliveries on the same issue into a single dispatch — a
label remove/reapply within 60s would be silently dropped.

Also, the prior TTL (60s) was shorter than Linear's retry horizon
(+1m, +1h, +6h). A dispatched-but-ack-lost delivery would re-dispatch
on the +1h or +6h retry after the dedup row had expired, double-
creating the task.

Fix:
- Compose the dedup key as issueId#action#webhookTimestamp. Linear
  reuses webhookTimestamp across retries of one delivery (correctly
  collapsing them) but assigns a fresh timestamp to each new delivery
  (no collapse). This is the primitive documented in linear.app's
  webhook spec; the payload's webhookId field is the webhook-config
  id, not per-delivery.
- Raise DEDUP_TTL_SECONDS to 8h, comfortably over Linear's 7h retry
  horizon.

Added a regression test that two distinct deliveries for the same
issue both dispatch, and an assertion that the TTL is >= 7h.

* chore(linear-processor): log label/updatedFrom details on trigger reject

When shouldTrigger rejects an update event, add current_labels,
updated_from_keys, updated_from_label_ids, and current_label_ids to
the log line so we can diagnose cases where the filter drops a valid
label-added transition.

No behaviour change; diagnostic-only.

* feat(linear-cli): auto-link, list-projects, UUID validation

Three CLI UX improvements discovered during first-time setup:

1. Auto-link the token owner during `bgagent linear setup`.
   After storing the webhook secret and API key, query Linear's
   `viewer` + `organization` via GraphQL using the token, decode the
   caller's Cognito sub from the cached id_token, and write the
   LinearUserMapping row directly (status='active',
   link_method='auto_setup'). This skips the manual code-exchange
   ceremony for the common case where the person installing ABCA for
   their workspace is the same person who'll trigger tasks.

   Failures are fail-open: if GraphQL errors, the API key is
   malformed, or the CLI isn't logged in, setup prints a warning and
   continues. Secrets are still stored; the user can run the link
   flow manually.

2. `bgagent linear list-projects` — lists all projects visible to the
   stored API token with their full UUIDs. Solves the "Linear project
   URL contains a truncated UUID" footgun.

3. UUID validation on `onboard-project`. RFC-4122 shape check up-
   front, with a clear error message pointing at `list-projects`.
   Avoids silent mapping-row writes for slug-shaped UUIDs that would
   never match a real Linear payload.

Tests cover the auto-link branch, fail-open cases, and the GraphQL
wire shape.

* feat(linear-agent): 👀 → ✅/❌ issue reactions via GraphQL

Mirror the Slack integration's terminal-emoji status UX on Linear
issues: post 👀 when the agent container starts, swap it for ✅ on
success or ❌ on failure (including the pipeline's outer-except
crash path).

Implementation notes:
- Linear's MCP v1 does not expose a reactions tool (verified
  2026-05-06 against the `tools` array in the Claude Agent SDK's
  system init). We use direct GraphQL against
  `https://api.linear.app/graphql` with the same token the MCP
  server uses. Retire this module in favour of a prompt addendum
  once Linear MCP ships `create_reaction`.
- Gated on `channel_source == 'linear'` + `linear_issue_id` in
  channel_metadata. No-op for Slack/API/webhook channels.
- `react_task_started` returns the reaction id; the pipeline threads
  it into `react_task_finished` so we can `reactionDelete` the 👀
  before posting the terminal emoji. Matches Slack's behaviour of
  not leaving a lingering "watching" marker.
- All GraphQL failures are logged and swallowed — reactions are
  advisory UX, never load-bearing.
- Crash-path (pipeline.py outer except): calls
  react_task_finished(success=False, ...) as a best-effort so a
  crashed/failed task doesn't leave a dangling 👀.

Tests cover:
- Channel gate (slack/api/webhook all no-op; linear with no
  issue_id also no-op)
- GraphQL wire shape and variables
- Delete-then-create swap on terminal transition
- Fallback path when started_reaction_id is None
- All error swallowing paths (HTTP error, connection exception,
  GraphQL errors, missing token)

* fix(linear-agent): correct MCP tool name, make updates required

The Linear MCP's tool for updating an issue is
mcp__linear-server__save_issue, not update_issue. Earlier prompt
referenced the wrong name; the agent silently skipped the state
transition on small/fast tasks because the advisory framing
("use its tools to keep the reporter informed") let the model
deprioritize the calls.

Rewrite the addendum as a required three-step contract:
1. On start — save_comment 🤖 Starting
2. On PR open — save_comment with PR URL + save_issue (transition
   to In Review; fall back to In Progress; skip if neither state
   exists rather than loop list_issue_statuses)
3. On completion/failure — save_comment with final status

Added explicit "If neither exists, skip the state transition — the
PR comment alone is enough. Do not invent state names or loop on
list_issue_statuses." so an impoverished workspace state-set
doesn't trigger retry behaviour.

* fix(test): remove race in test_server thread-failure test

test_background_thread_failure_503_and_backup_terminal_write polled
/ping for _background_pipeline_failed = True, but the flag flips
several lines before write_terminal runs in the same except block.
On the first run after a fresh import the poll could win the race
and mock_write.assert_called() would fire before the call happened.

Swap the poll-the-flag pattern for a wait-until-the-background-thread-
has-exited loop, which guarantees both side effects (the flag flip
and the backup write_terminal call) have completed before we assert.

Observed flaky 2026-05-06 in a full-suite run; reliably green after
the fix across 5 consecutive runs.

* docs(linear): reflect auto-link + list-projects UX

Restructure the setup guide so the webhook URL is available before
the user needs to create the webhook (prior ordering was circular).

- Step 1: generate Linear API key
- Step 2: run `bgagent linear setup` — it prints the webhook URL
  and pauses at the signing-secret prompt
- Step 3: create the webhook in Linear, copy the signing secret
- Step 4: back in the terminal, paste the signing secret + API key;
  auto-link runs for the token owner
- Step 5: `bgagent linear onboard-project <uuid>` — using the
  `list-projects` command to find the full UUID (the URL slug is
  truncated)
- Step 6: manual code-exchange flow, now marked as fallback for
  linking additional Linear users beyond the token owner

Synced to docs/src/content/docs/using/Linear-setup-guide.md via
mise //docs:sync.

* docs(linear): drop stale ROADMAP.md reference

Roadmap doc has no Linear content; referring to it misled readers.
Replace the link with a plain explanation of the admin-assisted path.

* docs(linear): honest v1 multi-user story

Describe the v1 linking model accurately: ABCA ships the consumer
command (`bgagent linear link <code>`) but no Linear-side code
generator — the comment-triggered flow is a planned v1.1 follow-up.
Design around API-token-owner-as-sole-submitter until then.

* chore(lint): apply ESLint + ruff auto-fixes

Fix two `no-duplicate-imports` violations in the Slack tests by merging
`import type` + `import` pairs into single imports. ESLint/ruff also
applied cosmetic reformatting (whitespace alignment in object literals,
line-splitting long inline objects). One `# noqa: S105` added to
`LINEAR_API_TOKEN_ENV` — it's an env var *name*, not a secret value, so
ruff's hardcoded-password rule is a false positive there.

No behavioural changes. Ensures `mise run build` passes the self-
mutation guard in .github/workflows/build.yml.

* fix(linear): address Alain's PR aws-samples#63 review blockers

1) Linear webhook: roll back dedup row on Lambda invoke failure

Previously, the dedup row was written before invoking the processor
Lambda. If InvokeCommand failed (throttle, 5xx, etc.), the handler
returned 500 and relied on Linear's retry schedule (+1m / +1h / +6h) to
redeliver — but all three retries fell inside the 8h dedup TTL, so each
one hit ConditionalCheckFailedException and silently acked as "already
processed." Net: a single transient invoke failure dropped the task
for 8 hours.

Fix: on invoke failure, delete the dedup row before returning 500 so
the next retry can reserve it again. A secondary try/catch logs (but
does not re-raise) if the delete itself fails — the primary invoke
error is the user-visible signal. Added two regression tests.

2) CLI ChannelSource sync with CDK

cli/src/types.ts:30 still had 'api' | 'webhook'; CDK had
'api' | 'webhook' | 'slack' | 'linear' since the Slack/Linear inbound
adapters landed. Any CLI switch on TaskDetail.channel_source was
narrowing against a stale union. Widened the CLI type and refreshed
the jsdoc to match the CDK side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: bgagent <bgagent@noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants