DAOS-17321 ddb: Add checksum dump function to ddb VOS API#18444
DAOS-17321 ddb: Add checksum dump function to ddb VOS API#18444knard38 wants to merge 8 commits into
Conversation
Add dv_dump_csum() to the ddb VOS API to expose checksum metadata stored in a VOS tree. The function fetches checksum info using VOS_OF_FETCH_CSUM and delivers it via a callback. Two akey types are handled: - Single-value: always fetched at DAOS_EPOCH_MAX; callback receives rel=NULL and the dcs_ci_list. - Array: fetched at the caller-specified epoch; callback receives both the daos_recx_ep_list and the dcs_ci_list. Passing dump_cb=NULL is a valid no-op. A new dt_csum_ctx test fixture is introduced with two objects (one with and one without checksums) and three test cases covering error paths, single-value csum retrieval, and array csum retrieval, including callback error propagation. Features: recovery Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
|
Ticket title is 'Checksum management with ddb' |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18444/1/execution/node/1311/log |
|
|
||
| /* Single-value has one record per epoch; always fetch the latest. */ | ||
| rc = vos_fetch_begin(coh, *oid, DAOS_EPOCH_MAX, dkey, 1, iod, VOS_OF_FETCH_CSUM, NULL, &ioh, | ||
| NULL); |
There was a problem hiding this comment.
We need to allow to fetch record with specified epoch since we support snapshot, right?
There was a problem hiding this comment.
- Thread epoch through dump_csum_sv for snapshot access
There was a problem hiding this comment.
- Thread epoch through dump_csum_sv for snapshot access
Fixed with commit 338c5e6
| rc = vos_fetch_end(ioh, NULL, rc); | ||
| if (rc != 0) | ||
| D_ERROR("vos_fetch_end for csum dump of " DF_UOID " failed: " DF_RC "\n", | ||
| DP_UOID(*oid), DP_RC(rc)); |
There was a problem hiding this comment.
Is it necessary to repeatedly report error if failed during dump_cb()?
There was a problem hiding this comment.
Is it necessary to repeatedly report error if failed during dump_cb()?
- Avoid double error logging on callback failure in csum dump
…/daos-17321/patch-002
The original implementation of dump_csum_sv() hardcoded DAOS_EPOCH_MAX, making it impossible to retrieve the checksum of a single-value record as it existed at a snapshot epoch. Add an epoch parameter to dump_csum_sv() and pass the caller-provided epoch through from dv_dump_csum(), making SV and array akeys consistent. Update the @param epoch doc comment accordingly. Extend the test fixture (dt_csum_ctx) to store DVT_FAKE_SV_COUNT successive SV writes in dct_sv_ics[], mirroring the dct_recx_ics[] pattern. Add two new sub-cases in dump_csum_sv_tests: - fetching at epoch 1 returns the epoch-1 checksum (dct_sv_ics[0]) - fetching at DAOS_EPOCH_MAX returns the latest checksum (dct_sv_ics[1]) Reviewed-at: #18444 (comment) Features: recovery Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
…m dump When dump_csum_sv() or dump_csum_recx() invoke the caller's callback and it returns an error, the same rc was previously logged twice: once as a callback error (D_ERROR) and again as a vos_fetch_end error (D_ERROR), because vos_fetch_end propagates the err argument it receives. Introduce a separate cb_rc variable to track the callback return code. Downgrade the callback failure log from D_ERROR to D_DEBUG — the error is propagated to the caller who is responsible for handling it. Only log vos_fetch_end at D_ERROR when it returns a genuinely new error (rc != cb_rc). Add a guard at the out label to ensure cb_rc is always returned to the caller even if vos_fetch_end silently returns 0. Reviewed-at: #18444 (comment) Features: recovery Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
| if (rc != 0) | ||
| vos_cont_destroy(poh, csum_ctx->dct_cont_uuid); | ||
| out_pool: | ||
| vos_pool_close(poh); |
There was a problem hiding this comment.
[suggestion] We can keep the pool opened, and close it via ddb_test_csum_teardown.
There was a problem hiding this comment.
- Do not open pool multiple times
…/daos-17321/patch-002
ddb_test_csum_setup() and ddb_test_csum_teardown() each opened and closed their own VOS pool handle, causing a redundant vos_pool_open / vos_pool_close round-trip when called from dv_test_csum_setup / dv_test_csum_teardown, which already hold an open handle in tctx->dvt_poh via dv_test_setup(). Replace the internal pool open/close with direct use of tctx->dvt_poh, and document the precondition in comments on both functions and their declarations. The pool handle lifecycle remains the caller's responsibility, keeping the functions self-contained and reusable from any test file that provides an open pool handle. Features: recovery Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
…/daos-17321/patch-002 Features: recovery
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18444/4/execution/node/1352/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18444/4/execution/node/1342/log |
…/daos-17321/patch-002 Features: recovery
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18444/7/execution/node/1419/log |
|
The only CI failure of the last CI build #7 is a known issue:
|
Description
Add
dv_dump_csum()to the ddb VOS API (ddb_vos.h/ddb_vos.c). The function fetches the stored checksum metadata for a given VOS tree path usingVOS_OF_FETCH_CSUMand delivers it to the caller via a callback. PassingNULLas the callback is a valid no-op.The function dispatches on akey type:
dump_csum_sv): always fetches atDAOS_EPOCH_MAX(single values have one record per epoch). The callback receivesrel=NULLand thedcs_ci_list.dump_csum_recx): fetches at the caller-specified epoch. The callback receives both thedaos_recx_ep_list(one entry per stored extent, as reported byVOS_OF_FETCH_CSUM) and thedcs_ci_list.Validation
A new test fixture (
dt_csum_ctx) is introduced inddb_test_driver.c, with a dedicated container, csummer, and two objects:g_oids[0]: data written without checksumg_oids[1]: data written with checksum (CRC64, 4KB chunks)Three new test cases in
ddb_vos_tests.c:dump_csum_error_testsvos_cont_openfailure propagationdump_csum_sv_testsdump_csum_recx_testsThe
csum_cb_return_rchelper (takes return code viacb_args) verifies that non-zero callback return codes are forwarded to the caller. A key non-obvious property:VOS_OF_FETCH_CSUMrecords the full stored extent (re_recx.rx_nr), not the intersection with the fetch IOD; this is documented in the test.Steps for the author:
After all prior steps are complete: