Skip to content

DAOS-18727 pool: Fix reconf error handling#18442

Open
liw wants to merge 1 commit into
masterfrom
liw/rsvc-reconf-grpver
Open

DAOS-18727 pool: Fix reconf error handling#18442
liw wants to merge 1 commit into
masterfrom
liw/rsvc-reconf-grpver

Conversation

@liw
Copy link
Copy Markdown
Contributor

@liw liw commented Jun 5, 2026

When pool_svc_reconf_ult adds a PS replica, the replica creation request
may encounter a network error such as -DER_GRPVER (e.g., if the
destination rank has just started). This patch adds a retry loop for
such errors, to avoid giving up the reconfiguration.

In addition, add flag CRT_RPC_FLAG_CO_FAILOUT to RSVC_START and
RSVC_STOP CoRPCs, because by default a CoRPC executes the local handler
even upon a group version mismatch, which seems unnecessary and has
caused confusions during past debugging activities.

Test-tag: pr pool_list_consolidation

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Ticket title is './recovery/pool_list_consolidation.py:PoolListConsolidationTest.test_lost_majority_ps_replicas - rdb-pool are recovered, three out of four ranks should have rdb-pool'
Status is 'In Progress'
Labels: 'ci_master_daily,daily_test,request_for_2.8'
https://daosio.atlassian.net/browse/DAOS-18727

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18442/1/testReport/

When pool_svc_reconf_ult adds a PS replica, the replica creation request
may encounter a network error such as -DER_GRPVER (e.g., if the
destination rank has just started). This patch adds a retry loop for
such errors, to avoid giving up the reconfiguration.

In addition, add flag CRT_RPC_FLAG_CO_FAILOUT to RSVC_START and
RSVC_STOP CoRPCs, because by default a CoRPC executes the local handler
even upon a group version mismatch, which seems unnecessary and has
caused confusions during past debugging activities.

Test-tag: pr pool_list_consolidation
Signed-off-by: Li Wei <liwei@hpe.com>
@liw liw force-pushed the liw/rsvc-reconf-grpver branch from a42dcc1 to 55a1ccb Compare June 8, 2026 01:48
@liw liw marked this pull request as ready for review June 8, 2026 01:50
@liw liw requested review from a team as code owners June 8, 2026 01:50
@liw liw requested review from kccain and liuxuezhao June 8, 2026 01:50
@liw
Copy link
Copy Markdown
Contributor Author

liw commented Jun 8, 2026

Triggered 10 pool_list_consolidation repeats with #18457: pass.

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18442/2/testReport/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants