Skip to content

feat: 新增报错日志分析功能#7768

Open
Sisyphbaous-DT-Project wants to merge 5 commits intoAstrBotDevs:masterfrom
Sisyphbaous-DT-Project:feat/error-analysis-diagnosis
Open

feat: 新增报错日志分析功能#7768
Sisyphbaous-DT-Project wants to merge 5 commits intoAstrBotDevs:masterfrom
Sisyphbaous-DT-Project:feat/error-analysis-diagnosis

Conversation

@Sisyphbaous-DT-Project
Copy link
Copy Markdown
Contributor

@Sisyphbaous-DT-Project Sisyphbaous-DT-Project commented Apr 24, 2026

背景

AstrBot 和插件生态更新频率很高,新手用户在遇到报错时,通常不知道该先看哪里、怎么判断问题归属,也很难直接把有效信息整理出来交给别人或交给 AI 分析。很多原本只需要看一眼日志、traceback 或插件信息就能定位的问题,会因为缺少一个统一的诊断入口而反复沟通,进一步提高了小白用户的纠错成本

所以这次新增“报错日志分析”功能,核心目标不是做一个普通日志查看器,而是先在 WebUI 里提供一个面向小白的故障诊断入口,把日志、错误上下文和后续分析链路串起来,方便用户更快判断:

  • 是 AstrBot Core、某个插件,还是 Provider / 配置 / 网络问题
  • 这个问题大致属于什么严重程度
  • 下一步应该怎么处理
  • 让 LLM 基于当前错误上下文自动分析根因,并给出更可执行的修复建议
  • 让 LLM 结合本地相关源码、插件文件片段、README / CHANGELOG 等信息一起分析,而不是只靠空口推测

本次实现的功能

1. 新增报错诊断入口

  • 在 Dashboard 侧边栏新增“报错诊断”入口
  • 新增独立页面,并接入现有路由体系,可通过 /error-analysis 直接访问
  • 同步补充中、英、俄三套导航文案,保证界面在多语言环境下可用

2. 后端接入日志诊断链路

  • 新增报错诊断路由模块,用于承接后续的日志采集、归档和分析能力
  • 将诊断路由挂载到 Dashboard 启动流程中,确保初始化时机正确,避免事件循环相关问题
  • 在日志结构中补充更适合诊断的结构化字段,为后续按插件、模块、文件、行号等维度定位问题留出空间
  • 为后续 LLM 自动分析报错根因、读取本地相关源码片段和插件元信息、生成修复建议预留了上下文拼装和对话扩展能力

3. 兼容静态包和生产态访问

  • 补充静态路由白名单,确保生产环境下访问 /error-analysis 时可以正确回落到前端页面
  • 同步更新前端构建产物,保证主服务直接读取 data/dist 时也能正常看到新增页面

对应规格文档的落地范围

这次提交对应的是规格文档里的第一阶段目标,也就是先把“可访问、可承接、可扩展”的基础骨架做出来:

  • 先把诊断入口放进 WebUI
  • 先把日志结构扩展到足够支撑后续分析
  • 先把路由和静态资源接通
  • 先把测试和构建链路跑通
  • 为后续 LLM 自动分析、结合本地源码/插件文件片段、问答追问、源码上下文拼装和问题卡片细分打基础

文档里提到的更完整能力,例如自动判定严重程度、按插件/核心/Provider 分类、LLM 自动分析与追问、结合本地相关文件做上下文拼装、问题卡片细分等,属于后续可以继续迭代的方向。这一版主要先把底座搭好,降低后续实现复杂度

修改内容

  • 新增后端报错诊断路由和初始化逻辑
  • 新增前端报错诊断页面、路由和侧边栏入口
  • 补充日志结构化字段,便于后续分析
  • 修复静态包中报错诊断页面不可见的问题
  • 补充后端路由测试,并验证 Dashboard 回归场景
  • 完成前端类型检查和构建验证

测试

后端:

  • uv sync
  • uv run ruff format .
  • uv run ruff check .
  • uv run pytest tests/test_error_analysis_route.py tests/test_dashboard.py -k "error_analysis or dashboard_ssl_missing_cert_and_key_falls_back_to_http or test_auth_login"

前端:

  • cd dashboard && pnpm install
  • cd dashboard && pnpm run typecheck
  • cd dashboard && pnpm run build

本地联调:

  • 启动主服务后访问 http://localhost:6185
  • 直接访问 http://localhost:6185/error-analysis 可正常打开新增页面

测试结果

  • ruff format 通过
  • ruff check 通过
  • 后端相关 pytest 通过
  • 前端 typecheck 通过
  • 前端 build 通过
  • 本地 Dashboard 页面可正常访问,/error-analysis 路由返回前端页面,而不是 404 提示

说明

规格文档里提到的完整报错分析体系,最终会包括自动分析、问答追问、结合本地源码和插件文件做上下文补强、源码上下文、问题卡片、严重程度分类等能力。本次 PR 先完成了最关键的 MVP 基础设施,为后续继续迭代这些能力预留了实现空间

@auto-assign auto-assign Bot requested review from LIghtJUNction and anka-afk April 24, 2026 10:38
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. area:core The bug / feature is about astrbot's core, backend area:webui The bug / feature is about webui(dashboard) of astrbot. labels Apr 24, 2026
@Sisyphbaous-DT-Project Sisyphbaous-DT-Project force-pushed the feat/error-analysis-diagnosis branch from e8fc081 to 61e5be1 Compare April 24, 2026 10:40
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The new ErrorAnalysisRoute performs synchronous file I/O (read_text, write_text, read_bytes) inside async methods, which can block the event loop under load; consider offloading these to an executor or using aiofiles-style async I/O wrappers.
  • ErrorAnalysisPage.vue inlines a lot of Chinese UI text directly in the template instead of going through the existing i18n system, which makes localization inconsistent compared to the navigation labels added in the locale JSON files.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new ErrorAnalysisRoute performs synchronous file I/O (read_text, write_text, read_bytes) inside async methods, which can block the event loop under load; consider offloading these to an executor or using aiofiles-style async I/O wrappers.
- ErrorAnalysisPage.vue inlines a lot of Chinese UI text directly in the template instead of going through the existing i18n system, which makes localization inconsistent compared to the navigation labels added in the locale JSON files.

## Individual Comments

### Comment 1
<location path="astrbot/dashboard/routes/error_analysis.py" line_range="608-617" />
<code_context>
+                payload = json.loads(self.settings_file.read_text(encoding="utf-8"))
</code_context>
<issue_to_address>
**issue (performance):** Synchronous filesystem access in async context can block the event loop.

This method (and others like `_load_settings`, `_save_settings`, `_load_records`, `_save_records`, `_read_file_excerpt`) performs blocking filesystem calls (`exists`, `read_text`, `read_bytes`, `write_text`, `open`) directly in async handlers/background tasks, which can stall the event loop under load.

Consider offloading these calls to an executor (e.g. `asyncio.to_thread`) or using an async filesystem library so log ingestion and SSE streaming stay responsive as the store grows.
</issue_to_address>

### Comment 2
<location path="dashboard/src/views/ErrorAnalysisPage.vue" line_range="462-263" />
<code_context>
+          qaMessages.value.push({ role: 'assistant', content: streamingAnswer.value })
+          question.value = ''
+          await loadRecords()
+        } else if (payload.type === 'error') {
+          showMessage(payload.message || '追问失败', 'error')
+        }
+      }
+    }
</code_context>
<issue_to_address>
**issue (bug_risk):** Error responses from the ask/stream endpoint still result in an empty QA pair being appended.

Because the `error` branch only shows a message and doesn’t change control flow, the `done` handler still appends the QA pair using an empty `streamingAnswer` and calls `loadRecords()`. Consider tracking an `hadError` flag (or early-returning from the loop) so that, once an error event is received, you skip appending the QA messages and avoid further processing for that request.
</issue_to_address>

### Comment 3
<location path="astrbot/dashboard/routes/error_analysis.py" line_range="172-181" />
<code_context>
+            self.log_broker.unregister(queue)
+
+    async def _handle_log(self, raw_log: dict[str, Any]):
+        settings = await self._load_settings()
+        if str(raw_log.get("level") or "").upper() not in settings["levels"]:
+            return
+        if not settings["passive_record"] and not settings["auto_analyze"]:
</code_context>
<issue_to_address>
**suggestion (performance):** Log watcher applies settings on every message, which may be unnecessary overhead.

In `_handle_log`, every message triggers `await self._load_settings()`, which may hit the filesystem (especially when settings are corrupted/rewritten). At high log volumes this becomes unnecessary overhead.

If settings don’t need to update per-message, consider caching them in memory and only refreshing on `update_settings` (or via a periodic refresh) to keep the watcher lightweight while still allowing configuration changes.
</issue_to_address>

### Comment 4
<location path="astrbot/dashboard/routes/error_analysis.py" line_range="134" />
<code_context>
+    return None, text
+
+
+class ErrorAnalysisRoute(Route):
+    def __init__(self, context: RouteContext, core_lifecycle: AstrBotCoreLifecycle):
+        super().__init__(context)
</code_context>
<issue_to_address>
**issue (complexity):** Consider breaking ErrorAnalysisRoute into smaller storage, SSE/provider helper, and analysis helper components to avoid a god‑class with long, mixed‑responsibility methods and duplicated logic.

- `ErrorAnalysisRoute` has grown into a god‑class with several long methods (`ask_stream`, `_analyze_record`, `_create_record_from_log`, etc.). This makes it harder to reason about and test. You can reduce complexity without changing behavior by extracting a few small, focused collaborators and helpers.

### 1. Extract storage concerns into small classes

Right now `_load_settings`, `_save_settings`, `_load_records`, `_save_records`, `_update_record`, `_append_qa_message` all live on the route and manage locks + filesystem. You can pull them into two small classes and inject them, moving IO and locking out of the route.

```python
# storage.py
class ErrorAnalysisSettingsStore:
    def __init__(self, base_dir: Path, lock: asyncio.Lock):
        self.base_dir = base_dir
        self.file = base_dir / "settings.json"
        self.lock = lock

    async def load(self) -> dict[str, Any]:
        async with self.lock:
            if not self.file.exists():
                await self.save(DEFAULT_SETTINGS.copy())
                return DEFAULT_SETTINGS.copy()
            try:
                payload = json.loads(self.file.read_text(encoding="utf-8"))
                if not isinstance(payload, dict):
                    raise ValueError
            except Exception:  # noqa: BLE001
                payload = DEFAULT_SETTINGS.copy()
                await self.save(payload)
            return _sanitize_settings(payload)

    async def save(self, payload: dict[str, Any]) -> None:
        self.base_dir.mkdir(parents=True, exist_ok=True)
        self.file.write_text(
            json.dumps(_sanitize_settings(payload), ensure_ascii=False, indent=2),
            encoding="utf-8",
        )
```

```python
class ErrorRecordStore:
    def __init__(self, base_dir: Path, lock: asyncio.Lock):
        self.base_dir = base_dir
        self.file = base_dir / "records.jsonl"
        self.lock = lock

    async def load_all(self) -> list[dict[str, Any]]:
        async with self.lock:
            # body of _load_records unchanged
            ...

    async def save_all(self, records: list[dict[str, Any]]) -> None:
        async with self.lock:
            # body of _save_records unchanged
            ...

    async def get(self, record_id: str) -> dict[str, Any] | None:
        records = await self.load_all()
        return next((r for r in records if r.get("id") == record_id), None)

    async def update(self, record_id: str, updates: dict[str, Any]) -> dict[str, Any] | None:
        records = await self.load_all()
        # body of _update_record, minus settings/max_records logic
        ...
```

Then in `ErrorAnalysisRoute.__init__`:

```python
self.settings_store = ErrorAnalysisSettingsStore(self.base_dir, self.settings_lock)
self.record_store = ErrorRecordStore(self.base_dir, self.records_lock)
```

And callers change from:

```python
settings = await self._load_settings()
record = await self._get_record(record_id)
await self._save_records(records)
```

to:

```python
settings = await self.settings_store.load()
record = await self.record_store.get(record_id)
await self.record_store.save_all(records)
```

This removes most IO + locking logic from the route and keeps functionality identical.

### 2. Extract SSE response + provider resolution helpers

`events`, `ask_stream`, and `_error_stream_response` all re‑create SSE boilerplate; provider resolution logic is duplicated in `_analyze_record` and `ask_stream`. You can centralize both to shorten methods.

```python
def _make_sse_response(self, stream) -> QuartResponse:
    response = await make_response(
        stream(),
        {
            "Content-Type": "text/event-stream",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Transfer-Encoding": "chunked",
        },
    )
    response.timeout = None  # type: ignore[attr-defined]
    return response

async def _resolve_provider(self, provider_id: str | None, record: dict[str, Any] | None = None) -> Provider | None:
    pid = (provider_id or "") or str(record.get("provider_id") or "") if record else ""
    if not pid:
        settings = await self.settings_store.load()
        pid = str(settings.get("provider_id") or "")
    return self._get_provider(pid)
```

Then `events` becomes:

```python
async def events(self) -> QuartResponse:
    async def stream():
        queue = asyncio.Queue(maxsize=200)
        self.event_queues.append(queue)
        try:
            yield "data: " + json.dumps({'type': 'connected'}) + "\n\n"
            while True:
                try:
                    event = await asyncio.wait_for(queue.get(), timeout=20)
                    yield "data: " + json.dumps(event, ensure_ascii=False) + "\n\n"
                except asyncio.TimeoutError:
                    yield SSE_HEARTBEAT
        finally:
            if queue in self.event_queues:
                self.event_queues.remove(queue)

    return await self._make_sse_response(stream)
```

And `ask_stream` reduces to:

```python
provider = await self._resolve_provider(provider_id, record)
if not provider:
    return await self._error_stream_response("Provider ... not available")

contexts = self._build_qa_context(record, question)

async def stream():
    # existing streaming loop unchanged
    ...

return await self._make_sse_response(stream)
```

This reduces repetition and makes the control flow in `ask_stream` and `events` easier to follow.

### 3. Split `_analyze_record` into smaller helpers

`_analyze_record` currently does: provider lookup, record lookup, settings load, context build, status update, model call, JSON parsing, validation, mapping, and final update. You can keep the behavior but move pieces into private helpers so the main method reads as orchestration only.

```python
async def _prepare_analysis(self, record_id: str, provider_id: str):
    provider = self._get_provider(provider_id)
    if not provider:
        return None, None, None, "Provider {provider_id} not available"

    record = await self.record_store.get(record_id)
    if not record:
        return None, None, None, f"Record {record_id} not found"

    settings = await self.settings_store.load()
    related_files = self._build_related_files(record, settings)
    prompt = self._build_diagnosis_prompt(record, settings, related_files)
    return provider, record, {"related_files": related_files, "prompt": prompt}, None
```

```python
async def _analyze_record(self, record_id: str, provider_id: str) -> tuple[bool, str]:
    provider, record, prep, error = await self._prepare_analysis(record_id, provider_id)
    if error:
        return False, error

    related_files = prep["related_files"]
    prompt = prep["prompt"]

    updated = await self._update_record(
        record_id,
        {
            "status": "analyzing",
            "updated_at": time.time(),
            "provider_id": provider_id,
            "related_files": related_files,
        },
    )
    if updated:
        await self._emit_event("record_updated", updated)

    try:
        resp = await provider.text_chat(
            contexts=[{"role": "system", "content": SYSTEM_PROMPT_DIAGNOSIS},
                      {"role": "user", "content": prompt}],
            model=provider.get_model() or None,
        )
        return await self._handle_analysis_response(record_id, resp.completion_text or "")
    except Exception as exc:  # noqa: BLE001
        ...
```

```python
async def _handle_analysis_response(self, record_id: str, raw_text: str) -> tuple[bool, str]:
    parsed, parsed_text = parse_json_from_model_output(raw_text)
    if parsed is None:
        updated = await self._update_record(
            record_id,
            {
                "status": "failed",
                "updated_at": time.time(),
                "summary": "Model returned non-JSON output",
                "raw_model_output": redact_sensitive_text(parsed_text),
                "error_message": "Model output is not valid JSON",
            },
        )
        if updated:
            await self._emit_event("record_updated", updated)
        return False, "Model output is not valid JSON"

    # existing mapping from parsed -> analysis + update
    ...
```

This keeps `_analyze_record` short and focused on the high‑level flow, while the details (prep, error handling, response mapping) live in dedicated helpers.

These changes are incremental, keep all behavior, and directly address the complexity concerns (god‑class, long methods, mixed IO/business logic, duplicated SSE/provider logic) without requiring a full redesign.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/dashboard/routes/error_analysis.py Outdated
Comment thread dashboard/src/views/ErrorAnalysisPage.vue
Comment thread astrbot/dashboard/routes/error_analysis.py
Comment thread astrbot/dashboard/routes/error_analysis.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an "Error Diagnosis" feature that automatically captures ERROR and CRITICAL logs, provides AI-powered analysis of the causes, and allows users to interact with an AI assistant for further troubleshooting. Key changes include a new backend route for log monitoring and analysis, a frontend dashboard page, and enhanced log metadata collection. Feedback focuses on improving the robustness of the log watcher, ensuring atomicity in file operations for settings and records, preventing potential memory issues when reading source files, and implementing concurrency limits for AI analysis tasks.

Comment thread astrbot/dashboard/routes/error_analysis.py
Comment thread astrbot/dashboard/routes/error_analysis.py Outdated
Comment thread astrbot/dashboard/routes/error_analysis.py
Comment thread astrbot/dashboard/routes/error_analysis.py Outdated
Comment thread astrbot/dashboard/routes/error_analysis.py Outdated
Copy link
Copy Markdown
Contributor Author

这轮我把 review 里影响稳定性和归因准确性的点先处理了:

  • ErrorAnalysisRoute 的 watcher、记录更新、设置写入、文件读取、自动分析限流都补了保护
  • ask/stream 的错误流不会再追加空 QA
  • 插件 traceback 归因优先级也提高了

有两类建议我这轮先没有继续扩成重构:

  • ErrorAnalysisRoute 拆成多个协作者
  • 页面文案全面切 i18n

原因是这两个点都属于结构性改造,不影响当前功能闭环,我这次优先保留可交付范围,避免把本轮变成大重构。后面如果功能继续扩展,再单独拆更合适。

Copy link
Copy Markdown
Contributor Author

补充一个小更新(最新 commit: 196c7ebe):

  1. 修复 ErrorAnalysisPage.vue 编码问题
  • 页面中的乱码文案已清理,恢复为可读文本
  • 文件 BOM 已移除,避免后续继续出现编码异常
  1. 调整 _classify_target 归因优先级
  • 现在在插件路径判断后,优先判断 provider/network/config 关键词,再回落到 core
  • 主要解决 core wrapper 日志场景里,Provider 错误被误归类为 Core 的问题
  1. ask_stream 追加 QA 的保护
  • 仅当 answer.strip() 非空时才写入 QA,避免正常结束但空输出时落一条空 assistant 消息
  1. 测试补充
  • 新增 provider 信号优先归因测试
  • watcher 测试里 cancel 等待改为 suppress CancelledError,提升稳定性

已本地通过:

  • uv run ruff check astrbot/dashboard/routes/error_analysis.py tests/test_error_analysis_route.py
  • $env:PYTHONPATH='.'; uv run pytest tests/test_error_analysis_route.py
  • $env:PYTHONPATH='.'; uv run pytest tests/test_dashboard.py -k "error_analysis or dashboard_ssl_missing_cert_and_key_falls_back_to_http or test_auth_login"
  • cd dashboard && pnpm run typecheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend area:webui The bug / feature is about webui(dashboard) of astrbot. size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant