DAOS-18727 pool: Fix reconf error handling#18442
Conversation
|
Ticket title is './recovery/pool_list_consolidation.py:PoolListConsolidationTest.test_lost_majority_ps_replicas - rdb-pool are recovered, three out of four ranks should have rdb-pool' |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18442/1/testReport/ |
When pool_svc_reconf_ult adds a PS replica, the replica creation request may encounter a network error such as -DER_GRPVER (e.g., if the destination rank has just started). This patch adds a retry loop for such errors, to avoid giving up the reconfiguration. In addition, add flag CRT_RPC_FLAG_CO_FAILOUT to RSVC_START and RSVC_STOP CoRPCs, because by default a CoRPC executes the local handler even upon a group version mismatch, which seems unnecessary and has caused confusions during past debugging activities. Test-tag: pr pool_list_consolidation Signed-off-by: Li Wei <liwei@hpe.com>
a42dcc1 to
55a1ccb
Compare
|
Triggered 10 pool_list_consolidation repeats with #18457: pass. |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18442/2/testReport/ |
When pool_svc_reconf_ult adds a PS replica, the replica creation request
may encounter a network error such as -DER_GRPVER (e.g., if the
destination rank has just started). This patch adds a retry loop for
such errors, to avoid giving up the reconfiguration.
In addition, add flag CRT_RPC_FLAG_CO_FAILOUT to RSVC_START and
RSVC_STOP CoRPCs, because by default a CoRPC executes the local handler
even upon a group version mismatch, which seems unnecessary and has
caused confusions during past debugging activities.
Test-tag: pr pool_list_consolidation
Steps for the author:
After all prior steps are complete: