feat: 新增报错日志分析功能#7768
Conversation
e8fc081 to
61e5be1
Compare
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The new ErrorAnalysisRoute performs synchronous file I/O (read_text, write_text, read_bytes) inside async methods, which can block the event loop under load; consider offloading these to an executor or using aiofiles-style async I/O wrappers.
- ErrorAnalysisPage.vue inlines a lot of Chinese UI text directly in the template instead of going through the existing i18n system, which makes localization inconsistent compared to the navigation labels added in the locale JSON files.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new ErrorAnalysisRoute performs synchronous file I/O (read_text, write_text, read_bytes) inside async methods, which can block the event loop under load; consider offloading these to an executor or using aiofiles-style async I/O wrappers.
- ErrorAnalysisPage.vue inlines a lot of Chinese UI text directly in the template instead of going through the existing i18n system, which makes localization inconsistent compared to the navigation labels added in the locale JSON files.
## Individual Comments
### Comment 1
<location path="astrbot/dashboard/routes/error_analysis.py" line_range="608-617" />
<code_context>
+ payload = json.loads(self.settings_file.read_text(encoding="utf-8"))
</code_context>
<issue_to_address>
**issue (performance):** Synchronous filesystem access in async context can block the event loop.
This method (and others like `_load_settings`, `_save_settings`, `_load_records`, `_save_records`, `_read_file_excerpt`) performs blocking filesystem calls (`exists`, `read_text`, `read_bytes`, `write_text`, `open`) directly in async handlers/background tasks, which can stall the event loop under load.
Consider offloading these calls to an executor (e.g. `asyncio.to_thread`) or using an async filesystem library so log ingestion and SSE streaming stay responsive as the store grows.
</issue_to_address>
### Comment 2
<location path="dashboard/src/views/ErrorAnalysisPage.vue" line_range="462-263" />
<code_context>
+ qaMessages.value.push({ role: 'assistant', content: streamingAnswer.value })
+ question.value = ''
+ await loadRecords()
+ } else if (payload.type === 'error') {
+ showMessage(payload.message || '追问失败', 'error')
+ }
+ }
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** Error responses from the ask/stream endpoint still result in an empty QA pair being appended.
Because the `error` branch only shows a message and doesn’t change control flow, the `done` handler still appends the QA pair using an empty `streamingAnswer` and calls `loadRecords()`. Consider tracking an `hadError` flag (or early-returning from the loop) so that, once an error event is received, you skip appending the QA messages and avoid further processing for that request.
</issue_to_address>
### Comment 3
<location path="astrbot/dashboard/routes/error_analysis.py" line_range="172-181" />
<code_context>
+ self.log_broker.unregister(queue)
+
+ async def _handle_log(self, raw_log: dict[str, Any]):
+ settings = await self._load_settings()
+ if str(raw_log.get("level") or "").upper() not in settings["levels"]:
+ return
+ if not settings["passive_record"] and not settings["auto_analyze"]:
</code_context>
<issue_to_address>
**suggestion (performance):** Log watcher applies settings on every message, which may be unnecessary overhead.
In `_handle_log`, every message triggers `await self._load_settings()`, which may hit the filesystem (especially when settings are corrupted/rewritten). At high log volumes this becomes unnecessary overhead.
If settings don’t need to update per-message, consider caching them in memory and only refreshing on `update_settings` (or via a periodic refresh) to keep the watcher lightweight while still allowing configuration changes.
</issue_to_address>
### Comment 4
<location path="astrbot/dashboard/routes/error_analysis.py" line_range="134" />
<code_context>
+ return None, text
+
+
+class ErrorAnalysisRoute(Route):
+ def __init__(self, context: RouteContext, core_lifecycle: AstrBotCoreLifecycle):
+ super().__init__(context)
</code_context>
<issue_to_address>
**issue (complexity):** Consider breaking ErrorAnalysisRoute into smaller storage, SSE/provider helper, and analysis helper components to avoid a god‑class with long, mixed‑responsibility methods and duplicated logic.
- `ErrorAnalysisRoute` has grown into a god‑class with several long methods (`ask_stream`, `_analyze_record`, `_create_record_from_log`, etc.). This makes it harder to reason about and test. You can reduce complexity without changing behavior by extracting a few small, focused collaborators and helpers.
### 1. Extract storage concerns into small classes
Right now `_load_settings`, `_save_settings`, `_load_records`, `_save_records`, `_update_record`, `_append_qa_message` all live on the route and manage locks + filesystem. You can pull them into two small classes and inject them, moving IO and locking out of the route.
```python
# storage.py
class ErrorAnalysisSettingsStore:
def __init__(self, base_dir: Path, lock: asyncio.Lock):
self.base_dir = base_dir
self.file = base_dir / "settings.json"
self.lock = lock
async def load(self) -> dict[str, Any]:
async with self.lock:
if not self.file.exists():
await self.save(DEFAULT_SETTINGS.copy())
return DEFAULT_SETTINGS.copy()
try:
payload = json.loads(self.file.read_text(encoding="utf-8"))
if not isinstance(payload, dict):
raise ValueError
except Exception: # noqa: BLE001
payload = DEFAULT_SETTINGS.copy()
await self.save(payload)
return _sanitize_settings(payload)
async def save(self, payload: dict[str, Any]) -> None:
self.base_dir.mkdir(parents=True, exist_ok=True)
self.file.write_text(
json.dumps(_sanitize_settings(payload), ensure_ascii=False, indent=2),
encoding="utf-8",
)
```
```python
class ErrorRecordStore:
def __init__(self, base_dir: Path, lock: asyncio.Lock):
self.base_dir = base_dir
self.file = base_dir / "records.jsonl"
self.lock = lock
async def load_all(self) -> list[dict[str, Any]]:
async with self.lock:
# body of _load_records unchanged
...
async def save_all(self, records: list[dict[str, Any]]) -> None:
async with self.lock:
# body of _save_records unchanged
...
async def get(self, record_id: str) -> dict[str, Any] | None:
records = await self.load_all()
return next((r for r in records if r.get("id") == record_id), None)
async def update(self, record_id: str, updates: dict[str, Any]) -> dict[str, Any] | None:
records = await self.load_all()
# body of _update_record, minus settings/max_records logic
...
```
Then in `ErrorAnalysisRoute.__init__`:
```python
self.settings_store = ErrorAnalysisSettingsStore(self.base_dir, self.settings_lock)
self.record_store = ErrorRecordStore(self.base_dir, self.records_lock)
```
And callers change from:
```python
settings = await self._load_settings()
record = await self._get_record(record_id)
await self._save_records(records)
```
to:
```python
settings = await self.settings_store.load()
record = await self.record_store.get(record_id)
await self.record_store.save_all(records)
```
This removes most IO + locking logic from the route and keeps functionality identical.
### 2. Extract SSE response + provider resolution helpers
`events`, `ask_stream`, and `_error_stream_response` all re‑create SSE boilerplate; provider resolution logic is duplicated in `_analyze_record` and `ask_stream`. You can centralize both to shorten methods.
```python
def _make_sse_response(self, stream) -> QuartResponse:
response = await make_response(
stream(),
{
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Transfer-Encoding": "chunked",
},
)
response.timeout = None # type: ignore[attr-defined]
return response
async def _resolve_provider(self, provider_id: str | None, record: dict[str, Any] | None = None) -> Provider | None:
pid = (provider_id or "") or str(record.get("provider_id") or "") if record else ""
if not pid:
settings = await self.settings_store.load()
pid = str(settings.get("provider_id") or "")
return self._get_provider(pid)
```
Then `events` becomes:
```python
async def events(self) -> QuartResponse:
async def stream():
queue = asyncio.Queue(maxsize=200)
self.event_queues.append(queue)
try:
yield "data: " + json.dumps({'type': 'connected'}) + "\n\n"
while True:
try:
event = await asyncio.wait_for(queue.get(), timeout=20)
yield "data: " + json.dumps(event, ensure_ascii=False) + "\n\n"
except asyncio.TimeoutError:
yield SSE_HEARTBEAT
finally:
if queue in self.event_queues:
self.event_queues.remove(queue)
return await self._make_sse_response(stream)
```
And `ask_stream` reduces to:
```python
provider = await self._resolve_provider(provider_id, record)
if not provider:
return await self._error_stream_response("Provider ... not available")
contexts = self._build_qa_context(record, question)
async def stream():
# existing streaming loop unchanged
...
return await self._make_sse_response(stream)
```
This reduces repetition and makes the control flow in `ask_stream` and `events` easier to follow.
### 3. Split `_analyze_record` into smaller helpers
`_analyze_record` currently does: provider lookup, record lookup, settings load, context build, status update, model call, JSON parsing, validation, mapping, and final update. You can keep the behavior but move pieces into private helpers so the main method reads as orchestration only.
```python
async def _prepare_analysis(self, record_id: str, provider_id: str):
provider = self._get_provider(provider_id)
if not provider:
return None, None, None, "Provider {provider_id} not available"
record = await self.record_store.get(record_id)
if not record:
return None, None, None, f"Record {record_id} not found"
settings = await self.settings_store.load()
related_files = self._build_related_files(record, settings)
prompt = self._build_diagnosis_prompt(record, settings, related_files)
return provider, record, {"related_files": related_files, "prompt": prompt}, None
```
```python
async def _analyze_record(self, record_id: str, provider_id: str) -> tuple[bool, str]:
provider, record, prep, error = await self._prepare_analysis(record_id, provider_id)
if error:
return False, error
related_files = prep["related_files"]
prompt = prep["prompt"]
updated = await self._update_record(
record_id,
{
"status": "analyzing",
"updated_at": time.time(),
"provider_id": provider_id,
"related_files": related_files,
},
)
if updated:
await self._emit_event("record_updated", updated)
try:
resp = await provider.text_chat(
contexts=[{"role": "system", "content": SYSTEM_PROMPT_DIAGNOSIS},
{"role": "user", "content": prompt}],
model=provider.get_model() or None,
)
return await self._handle_analysis_response(record_id, resp.completion_text or "")
except Exception as exc: # noqa: BLE001
...
```
```python
async def _handle_analysis_response(self, record_id: str, raw_text: str) -> tuple[bool, str]:
parsed, parsed_text = parse_json_from_model_output(raw_text)
if parsed is None:
updated = await self._update_record(
record_id,
{
"status": "failed",
"updated_at": time.time(),
"summary": "Model returned non-JSON output",
"raw_model_output": redact_sensitive_text(parsed_text),
"error_message": "Model output is not valid JSON",
},
)
if updated:
await self._emit_event("record_updated", updated)
return False, "Model output is not valid JSON"
# existing mapping from parsed -> analysis + update
...
```
This keeps `_analyze_record` short and focused on the high‑level flow, while the details (prep, error handling, response mapping) live in dedicated helpers.
These changes are incremental, keep all behavior, and directly address the complexity concerns (god‑class, long methods, mixed IO/business logic, duplicated SSE/provider logic) without requiring a full redesign.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces an "Error Diagnosis" feature that automatically captures ERROR and CRITICAL logs, provides AI-powered analysis of the causes, and allows users to interact with an AI assistant for further troubleshooting. Key changes include a new backend route for log monitoring and analysis, a frontend dashboard page, and enhanced log metadata collection. Feedback focuses on improving the robustness of the log watcher, ensuring atomicity in file operations for settings and records, preventing potential memory issues when reading source files, and implementing concurrency limits for AI analysis tasks.
|
这轮我把 review 里影响稳定性和归因准确性的点先处理了:
有两类建议我这轮先没有继续扩成重构:
原因是这两个点都属于结构性改造,不影响当前功能闭环,我这次优先保留可交付范围,避免把本轮变成大重构。后面如果功能继续扩展,再单独拆更合适。 |
|
补充一个小更新(最新 commit:
已本地通过:
|
背景
AstrBot 和插件生态更新频率很高,新手用户在遇到报错时,通常不知道该先看哪里、怎么判断问题归属,也很难直接把有效信息整理出来交给别人或交给 AI 分析。很多原本只需要看一眼日志、traceback 或插件信息就能定位的问题,会因为缺少一个统一的诊断入口而反复沟通,进一步提高了小白用户的纠错成本
所以这次新增“报错日志分析”功能,核心目标不是做一个普通日志查看器,而是先在 WebUI 里提供一个面向小白的故障诊断入口,把日志、错误上下文和后续分析链路串起来,方便用户更快判断:
本次实现的功能
1. 新增报错诊断入口
/error-analysis直接访问2. 后端接入日志诊断链路
3. 兼容静态包和生产态访问
/error-analysis时可以正确回落到前端页面data/dist时也能正常看到新增页面对应规格文档的落地范围
这次提交对应的是规格文档里的第一阶段目标,也就是先把“可访问、可承接、可扩展”的基础骨架做出来:
文档里提到的更完整能力,例如自动判定严重程度、按插件/核心/Provider 分类、LLM 自动分析与追问、结合本地相关文件做上下文拼装、问题卡片细分等,属于后续可以继续迭代的方向。这一版主要先把底座搭好,降低后续实现复杂度
修改内容
测试
后端:
uv syncuv run ruff format .uv run ruff check .uv run pytest tests/test_error_analysis_route.py tests/test_dashboard.py -k "error_analysis or dashboard_ssl_missing_cert_and_key_falls_back_to_http or test_auth_login"前端:
cd dashboard && pnpm installcd dashboard && pnpm run typecheckcd dashboard && pnpm run build本地联调:
http://localhost:6185http://localhost:6185/error-analysis可正常打开新增页面测试结果
ruff format通过ruff check通过/error-analysis路由返回前端页面,而不是 404 提示说明
规格文档里提到的完整报错分析体系,最终会包括自动分析、问答追问、结合本地源码和插件文件做上下文补强、源码上下文、问题卡片、严重程度分类等能力。本次 PR 先完成了最关键的 MVP 基础设施,为后续继续迭代这些能力预留了实现空间