Skip to content

add detect for a5 board#737

Open
Likai-19 wants to merge 1 commit into
hw-native-sys:mainfrom
Likai-19:borad_detect_a5
Open

add detect for a5 board#737
Likai-19 wants to merge 1 commit into
hw-native-sys:mainfrom
Likai-19:borad_detect_a5

Conversation

@Likai-19
Copy link
Copy Markdown

@Likai-19 Likai-19 commented Jun 2, 2026

check whether PTOAS_BOARD_IS_A3=1 is enable in a5 board test

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the remote NPU validation script to detect A5 targets from SOC_VERSION and SIM_SOC_VERSION, skipping the A3 simulator directory fallback. The review feedback suggests optimizing the target detection logic to avoid redundant subshell executions by using Bash's regular expression matching.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +173 to +174
elif [[ "$(printf '%s' "${SOC_VERSION} ${SIM_SOC_VERSION}" | tr '[:upper:]' '[:lower:]')" == *950* \
|| "$(printf '%s' "${SOC_VERSION} ${SIM_SOC_VERSION}" | tr '[:upper:]' '[:lower:]')" == *a5* ]]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The subshell command printf '%s' ... | tr '[:upper:]' '[:lower:]' is executed twice redundantly. We can optimize this by using Bash's regular expression operator =~ to perform both checks in a single match, which avoids spawning duplicate subshells.

Suggested change
elif [[ "$(printf '%s' "${SOC_VERSION} ${SIM_SOC_VERSION}" | tr '[:upper:]' '[:lower:]')" == *950* \
|| "$(printf '%s' "${SOC_VERSION} ${SIM_SOC_VERSION}" | tr '[:upper:]' '[:lower:]')" == *a5* ]]; then
elif [[ "$(printf '%s' "${SOC_VERSION} ${SIM_SOC_VERSION}" | tr '[:upper:]' '[:lower:]')" =~ 950|a5 ]]; then

@Likai-19
Copy link
Copy Markdown
Author

Likai-19 commented Jun 2, 2026

/run a5 rope_kv_cache,post_rmsnorm --pto-level=level3

@reedhecre
Copy link
Copy Markdown

已接收 /run a5 rope_kv_cache post_rmsnorm --pto-level=level3,A5 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:df7b25bf7501
  • 结果汇总:OK 0 / FAIL 2 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260602_185406_manual_pr737.log
  • 手动指令:/run a5 rope_kv_cache post_rmsnorm --pto-level=level3
  • 触发人:Likai-19
  • 指定用例:rope_kv_cache,post_rmsnorm
  • PTOAS 参数:--pto-level=level3
  • 触发评论:add detect for a5 board #737 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • rope_kv_cache (run, exit=1)
  • post_rmsnorm (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #737

rope_kv_cache

stage=run info=exit=1

[ERROR] aclrtSetDevice(deviceId) failed: 507033 (/tmp/ptoas-board-monitor-a5/runs/20260602_185406_manual_pr737/npu_validation/Qwen3DecodeA5/rope_kv_cache/main.cpp:110)
[ERROR] RecentErrMsg: [PID: 486778] 2026-06-02-18:57:43.048.507 Invalid_Argument(EE1001): The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        TsdOpen failed. devId=0, tdt error=1[FUNC:PrintfTsdError][FILE:runtime.cc][LINE:2618]
        Check param failed, dev can not be NULL![FUNC:DeviceRetain][FILE:runtime.cc][LINE:3536]
        Check param failed, dev can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3153]
        Check param failed, ctx can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3184]
        Check param failed, context can not be null.[FUNC:SetDevice][FILE:api_impl.cc][LINE:3321]
        rtSetDevice execution failed, reason=device retain error[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
        open device 0 failed, runtime result = 507033.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:6120]
        The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
[2026-06-02 18:57:43] ERROR: testcase failed (exit 1): rope_kv_cache
post_rmsnorm

stage=run info=exit=1

[ERROR] aclrtSetDevice(deviceId) failed: 507033 (/tmp/ptoas-board-monitor-a5/runs/20260602_185406_manual_pr737/npu_validation/Qwen3DecodeA5/post_rmsnorm/main.cpp:80)
[ERROR] RecentErrMsg: [PID: 487440] 2026-06-02-18:57:46.585.903 Invalid_Argument(EE1001): The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        TsdOpen failed. devId=0, tdt error=1[FUNC:PrintfTsdError][FILE:runtime.cc][LINE:2618]
        Check param failed, dev can not be NULL![FUNC:DeviceRetain][FILE:runtime.cc][LINE:3536]
        Check param failed, dev can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3153]
        Check param failed, ctx can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3184]
        Check param failed, context can not be null.[FUNC:SetDevice][FILE:api_impl.cc][LINE:3321]
        rtSetDevice execution failed, reason=device retain error[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
        open device 0 failed, runtime result = 507033.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:6120]
        The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
[2026-06-02 18:57:47] ERROR: testcase failed (exit 1): post_rmsnorm
[2026-06-02 18:57:47] === SUMMARY ===
[2026-06-02 18:57:47] OK=0 FAIL=2 SKIP=0
[2026-06-02 18:57:47] RESULTS_TSV=/tmp/ptoas-board-monitor-a5/runs/20260602_185406_manual_pr737/remote_npu_validation_results.tsv

@Likai-19
Copy link
Copy Markdown
Author

Likai-19 commented Jun 2, 2026

/run a5 rope_kv_cache,post_rmsnorm

@reedhecre
Copy link
Copy Markdown

已接收 /run a5 rope_kv_cache post_rmsnorm,A5 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:df7b25bf7501
  • 结果汇总:OK 0 / FAIL 2 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260602_191306_manual_pr737.log
  • 手动指令:/run a5 rope_kv_cache post_rmsnorm
  • 触发人:Likai-19
  • 指定用例:rope_kv_cache,post_rmsnorm
  • 触发评论:add detect for a5 board #737 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • rope_kv_cache (run, exit=1)
  • post_rmsnorm (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #737

rope_kv_cache

stage=run info=exit=1

[ERROR] aclrtSetDevice(deviceId) failed: 507033 (/tmp/ptoas-board-monitor-a5/runs/20260602_191306_manual_pr737/npu_validation/Qwen3DecodeA5/rope_kv_cache/main.cpp:110)
[ERROR] RecentErrMsg: [PID: 518744] 2026-06-02-19:16:39.975.682 Invalid_Argument(EE1001): The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        TsdOpen failed. devId=0, tdt error=1[FUNC:PrintfTsdError][FILE:runtime.cc][LINE:2618]
        Check param failed, dev can not be NULL![FUNC:DeviceRetain][FILE:runtime.cc][LINE:3536]
        Check param failed, dev can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3153]
        Check param failed, ctx can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3184]
        Check param failed, context can not be null.[FUNC:SetDevice][FILE:api_impl.cc][LINE:3321]
        rtSetDevice execution failed, reason=device retain error[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
        open device 0 failed, runtime result = 507033.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:6120]
        The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
[2026-06-02 19:16:40] ERROR: testcase failed (exit 1): rope_kv_cache
post_rmsnorm

stage=run info=exit=1

[ERROR] aclrtSetDevice(deviceId) failed: 507033 (/tmp/ptoas-board-monitor-a5/runs/20260602_191306_manual_pr737/npu_validation/Qwen3DecodeA5/post_rmsnorm/main.cpp:80)
[ERROR] RecentErrMsg: [PID: 519285] 2026-06-02-19:16:43.498.232 Invalid_Argument(EE1001): The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        TsdOpen failed. devId=0, tdt error=1[FUNC:PrintfTsdError][FILE:runtime.cc][LINE:2618]
        Check param failed, dev can not be NULL![FUNC:DeviceRetain][FILE:runtime.cc][LINE:3536]
        Check param failed, dev can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3153]
        Check param failed, ctx can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3184]
        Check param failed, context can not be null.[FUNC:SetDevice][FILE:api_impl.cc][LINE:3321]
        rtSetDevice execution failed, reason=device retain error[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
        open device 0 failed, runtime result = 507033.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:6120]
        The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
[2026-06-02 19:16:43] ERROR: testcase failed (exit 1): post_rmsnorm
[2026-06-02 19:16:43] === SUMMARY ===
[2026-06-02 19:16:43] OK=0 FAIL=2 SKIP=0
[2026-06-02 19:16:43] RESULTS_TSV=/tmp/ptoas-board-monitor-a5/runs/20260602_191306_manual_pr737/remote_npu_validation_results.tsv

@Likai-19
Copy link
Copy Markdown
Author

Likai-19 commented Jun 2, 2026

/run a3 rope_kv_cache,post_rmsnorm

@reedhecre
Copy link
Copy Markdown

已接收 /run a3 rope_kv_cache post_rmsnorm,A3 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A3 板测成功

  • 触发方式:manual
  • 源码提交:df7b25bf7501
  • 结果汇总:OK 2 / FAIL 0 / SKIP 0
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260602_192505_manual_pr737.log
  • 结果 TSV:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260602_192505_manual_pr737.tsv
  • 手动指令:/run a3 rope_kv_cache post_rmsnorm
  • 触发人:Likai-19
  • 指定用例:rope_kv_cache,post_rmsnorm
  • 触发评论:add detect for a5 board #737 (comment)

@reedhecre
Copy link
Copy Markdown

reedhecre commented Jun 2, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: add detect for a5 board #737 add detect for a5 board
  • Author: Likai-19
  • Base/Head: main / borad_detect_a5
  • Head SHA: 783821b90df0
  • Trigger: 检测到新的 open PR
  • Generated At: 2026-06-02T13:15:11Z
  • Status: completed

Summary

PR #737 的 A5 分支仍然依赖 SOC_VERSION/SIM_SOC_VERSION,而不是真实板卡信息;在默认/误配路径下仍会把 A5 机型误判成 A3,切到错误的验证分支。

Findings

  1. P2 A5“板卡检测”仍然依赖目标 SOC 字符串,而不是真实板卡信息 test/npu_validation/scripts/run_remote_npu_validation.sh:173

新增分支只在 SOC_VERSION/SIM_SOC_VERSION 包含 950a5 时才跳过 Ascend910B* simulator-dir fallback,但这里并没有利用前面已经拿到的 _board_chip 去正向识别 A5。结果是:workflow_dispatch 的默认 soc_version 仍是 Ascend910(见 .github/workflows/ci.yml),因此在真实 A5 板机上如果沿用默认值,脚本依旧可能因为本机安装了 Ascend910B* simulator 目录而把 PTOAS_BOARD_IS_A3 置为 1。这个标志会被 test/npu_validation/scripts/generate_testcase.py 用来走 A3 专用的 golden 逻辑,所以 PR 标题所说的 “detect for a5 board” 在这条路径上仍然失效,并会把 A5 板机切到错误的验证行为。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants