Skip to content

DAOS-17321 ddb: Add checksum dump function to ddb VOS API#18444

Open
knard38 wants to merge 8 commits into
masterfrom
ckochhof/dev/master/daos-17321/patch-002
Open

DAOS-17321 ddb: Add checksum dump function to ddb VOS API#18444
knard38 wants to merge 8 commits into
masterfrom
ckochhof/dev/master/daos-17321/patch-002

Conversation

@knard38

@knard38 knard38 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Description

Add dv_dump_csum() to the ddb VOS API (ddb_vos.h/ddb_vos.c). The function fetches the stored checksum metadata for a given VOS tree path using VOS_OF_FETCH_CSUM and delivers it to the caller via a callback. Passing NULL as the callback is a valid no-op.

The function dispatches on akey type:

  • Single-value akeys (dump_csum_sv): always fetches at DAOS_EPOCH_MAX (single values have one record per epoch). The callback receives rel=NULL and the dcs_ci_list.
  • Array akeys (dump_csum_recx): fetches at the caller-specified epoch. The callback receives both the daos_recx_ep_list (one entry per stored extent, as reported by VOS_OF_FETCH_CSUM) and the dcs_ci_list.

Validation

A new test fixture (dt_csum_ctx) is introduced in ddb_test_driver.c, with a dedicated container, csummer, and two objects:

  • g_oids[0]: data written without checksum
  • g_oids[1]: data written with checksum (CRC64, 4KB chunks)

Three new test cases in ddb_vos_tests.c:

Test What it covers
dump_csum_error_tests Invalid pool handle → vos_cont_open failure propagation
dump_csum_sv_tests SV path: no-csum, with-csum, null callback, callback error propagation
dump_csum_recx_tests Array path: no-csum, with-csum, null callback, callback error propagation

The csum_cb_return_rc helper (takes return code via cb_args) verifies that non-zero callback return codes are forwarded to the caller. A key non-obvious property: VOS_OF_FETCH_CSUM records the full stored extent (re_recx.rx_nr), not the intersection with the fetch IOD; this is documented in the test.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Add dv_dump_csum() to the ddb VOS API to expose checksum metadata
stored in a VOS tree. The function fetches checksum info using
VOS_OF_FETCH_CSUM and delivers it via a callback.

Two akey types are handled:
- Single-value: always fetched at DAOS_EPOCH_MAX; callback receives
  rel=NULL and the dcs_ci_list.
- Array: fetched at the caller-specified epoch; callback receives both
  the daos_recx_ep_list and the dcs_ci_list.

Passing dump_cb=NULL is a valid no-op.

A new dt_csum_ctx test fixture is introduced with two objects (one with
and one without checksums) and three test cases covering error paths,
single-value csum retrieval, and array csum retrieval, including
callback error propagation.

Features: recovery
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

Ticket title is 'Checksum management with ddb'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-17321

@knard38 knard38 marked this pull request as ready for review June 5, 2026 13:07
@knard38 knard38 requested review from Nasf-Fan, NiuYawei and janekmi June 5, 2026 13:07
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18444/1/execution/node/1311/log

Comment thread src/utils/ddb/ddb_vos.c Outdated

/* Single-value has one record per epoch; always fetch the latest. */
rc = vos_fetch_begin(coh, *oid, DAOS_EPOCH_MAX, dkey, 1, iod, VOS_OF_FETCH_CSUM, NULL, &ioh,
NULL);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to allow to fetch record with specified epoch since we support snapshot, right?

@knard38 knard38 Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Thread epoch through dump_csum_sv for snapshot access

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Thread epoch through dump_csum_sv for snapshot access

Fixed with commit 338c5e6

Comment thread src/utils/ddb/ddb_vos.c
rc = vos_fetch_end(ioh, NULL, rc);
if (rc != 0)
D_ERROR("vos_fetch_end for csum dump of " DF_UOID " failed: " DF_RC "\n",
DP_UOID(*oid), DP_RC(rc));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to repeatedly report error if failed during dump_cb()?

@knard38 knard38 Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to repeatedly report error if failed during dump_cb()?

  • Avoid double error logging on callback failure in csum dump

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with commit 89aef6b

kanard38 added 3 commits June 10, 2026 05:58
The original implementation of dump_csum_sv() hardcoded DAOS_EPOCH_MAX,
making it impossible to retrieve the checksum of a single-value record
as it existed at a snapshot epoch.

Add an epoch parameter to dump_csum_sv() and pass the caller-provided
epoch through from dv_dump_csum(), making SV and array akeys consistent.
Update the @param epoch doc comment accordingly.

Extend the test fixture (dt_csum_ctx) to store DVT_FAKE_SV_COUNT
successive SV writes in dct_sv_ics[], mirroring the dct_recx_ics[]
pattern. Add two new sub-cases in dump_csum_sv_tests:
- fetching at epoch 1 returns the epoch-1 checksum (dct_sv_ics[0])
- fetching at DAOS_EPOCH_MAX returns the latest checksum (dct_sv_ics[1])

Reviewed-at: #18444 (comment)

Features: recovery
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
…m dump

When dump_csum_sv() or dump_csum_recx() invoke the caller's callback
and it returns an error, the same rc was previously logged twice:
once as a callback error (D_ERROR) and again as a vos_fetch_end error
(D_ERROR), because vos_fetch_end propagates the err argument it receives.

Introduce a separate cb_rc variable to track the callback return code.
Downgrade the callback failure log from D_ERROR to D_DEBUG — the error
is propagated to the caller who is responsible for handling it. Only
log vos_fetch_end at D_ERROR when it returns a genuinely new error
(rc != cb_rc). Add a guard at the out label to ensure cb_rc is always
returned to the caller even if vos_fetch_end silently returns 0.

Reviewed-at: #18444 (comment)

Features: recovery
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
@knard38 knard38 requested a review from Nasf-Fan June 10, 2026 06:33
@knard38 knard38 self-assigned this Jun 10, 2026
NiuYawei
NiuYawei previously approved these changes Jun 11, 2026
Nasf-Fan
Nasf-Fan previously approved these changes Jun 12, 2026
Comment thread src/utils/ddb/tests/ddb_test_driver.c Outdated
if (rc != 0)
vos_cont_destroy(poh, csum_ctx->dct_cont_uuid);
out_pool:
vos_pool_close(poh);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] We can keep the pool opened, and close it via ddb_test_csum_teardown.

@knard38 knard38 Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do not open pool multiple times

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with commit 3a85ff9

kanard38 added 3 commits June 12, 2026 14:55
ddb_test_csum_setup() and ddb_test_csum_teardown() each opened and
closed their own VOS pool handle, causing a redundant vos_pool_open /
vos_pool_close round-trip when called from dv_test_csum_setup /
dv_test_csum_teardown, which already hold an open handle in
tctx->dvt_poh via dv_test_setup().

Replace the internal pool open/close with direct use of tctx->dvt_poh,
and document the precondition in comments on both functions and their
declarations. The pool handle lifecycle remains the caller's
responsibility, keeping the functions self-contained and reusable from
any test file that provides an open pool handle.

Features: recovery
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
@knard38 knard38 dismissed stale reviews from Nasf-Fan and NiuYawei via 22c044d June 15, 2026 06:22
@knard38 knard38 requested review from Nasf-Fan and NiuYawei June 15, 2026 06:24
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18444/4/execution/node/1352/log

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18444/4/execution/node/1342/log

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18444/7/execution/node/1419/log

@knard38

knard38 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

The only CI failure of the last CI build #7 is a known issue:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants