Skip to content

feat(taskbroker): propagate TaskError from worker and log failure context on broker#607

Open
s-starostin wants to merge 2 commits intogetsentry:mainfrom
s-starostin:feat/taskerror-propagation
Open

feat(taskbroker): propagate TaskError from worker and log failure context on broker#607
s-starostin wants to merge 2 commits intogetsentry:mainfrom
s-starostin:feat/taskerror-propagation

Conversation

@s-starostin
Copy link
Copy Markdown

@s-starostin s-starostin commented Apr 25, 2026

Summary

Propagate task execution exception context from the Python worker client to taskbroker and emit structured broker-side failure logs.

This PR does two things:

  1. extends the Python taskbroker client with error_hook support and propagates TaskError in SetTaskStatusRequest
  2. updates taskbroker server logging to emit structured failure context when TaskError is present

Dependency

Depends on

Downstream PR

Consumed by:

Changes

Python client

  • add error_hook to TaskbrokerApp
  • call error_hook.on_exception(task_meta, exc) in the worker exception path
  • attach returned TaskError | None to ProcessingResult
  • send error in SetTaskStatusRequest

Broker

  • read request.error in set_task_status
  • log structured failure context after status persistence succeeds
  • log for:
    • Failure always
    • Retry only when an error envelope is present
  • keep old-worker compatibility when error is absent

Why

This lets taskbroker logs reflect what the worker actually observed when a task failed, instead of forcing operators to dig only through worker logs.

Logging behavior

Expected broker log shape:

task reported failure task_id=... taskname=... namespace=... status=Failure attempts=... exception_type="..." exception_message="..."

Compatibility

  • old workers remain supported
  • requests without error still work
  • no schema changes to inflight storage
  • no DLQ format changes

Tests

  • verify Failure with error logs structured context
  • verify Retry with error logs structured context
  • verify Retry without error does not emit failure log
  • verify Complete does not emit failure log
  • verify old-worker path without error still works

Follow-up

getsentry/sentry will consume the new error_hook surface and provide the Sentry-specific hook implementation.

@s-starostin s-starostin requested a review from a team as a code owner April 25, 2026 17:52
try:
with timeout_alarm(inflight.activation.processing_deadline_duration, handle_alarm):
_execute_activation(task_func, inflight.activation, app.context_hooks)
_execute_activation(task_func, inflight.activation, app.context_hooks, app.error_hook)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The function _execute_activation is called with an extra argument, app.error_hook, which will cause a TypeError at runtime.
Severity: CRITICAL

Suggested Fix

Remove the fourth argument, app.error_hook, from the call to _execute_activation on line 242. The error hook is already correctly invoked in the surrounding exception handlers.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: clients/python/src/taskbroker_client/worker/workerchild.py#L242

Potential issue: The function `_execute_activation` is called on line 242 with four
positional arguments: `task_func`, `inflight.activation`, `app.context_hooks`, and
`app.error_hook`. However, the function's definition only accepts a maximum of three
arguments. The fourth argument, `app.error_hook`, has no corresponding parameter. Since
the function does not use `*args` or `**kwargs` to accept arbitrary arguments, this
mismatch will cause a `TypeError` every time a task is processed, effectively preventing
any task execution.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid bug.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks - this is a valid bug.

I had a mismatch between the _execute_activation(...) call site and the function signature in workerchild.py. I've just pushed the fix.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ca68733. Configure here.

Comment thread clients/python/src/taskbroker_client/worker/workerchild.py Outdated
Comment thread src/grpc/server.rs Outdated
try:
with timeout_alarm(inflight.activation.processing_deadline_duration, handle_alarm):
_execute_activation(task_func, inflight.activation, app.context_hooks)
_execute_activation(task_func, inflight.activation, app.context_hooks, app.error_hook)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid bug.

_execute_activation(task_func, inflight.activation, app.context_hooks, app.error_hook)
next_state = TASK_ACTIVATION_STATUS_COMPLETE
except ProcessingDeadlineExceeded as err:
if app.error_hook is not None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this happen first? Can we make this happen after all the other logging, so we can sure that a bug in a hook doesn't cause our reporting to break?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

I changed this so the existing reporting path runs first, and the hook is now invoked later via _get_task_error_from_hook(...) as best-effort only. A hook bug should no longer interfere with retry / logging / Sentry reporting; it only affects the optional TaskError payload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants