Skip to content

Flaky CI: v7 paginated-listing test fails ~6% of runs (single-dir seed never spans shards; >=2-page assertion rides an empty-tail artifact) #38

@ehsan6sha

Description

@ehsan6sha

Symptom

test-rust failed on both of today's main pushes (run 27431570266 after #35, run 27437891794 after #37) with the same assertion, while all four PR runs passed:

thread 'test_v7_list_directory_paginated_round_trips' panicked at tests/v7_hamt_tests.rs:405:
expected >=2 pages for 64 entries with max_keys=8, got 1

Local reproduction: 1 failure in 10 runs of the test on current main — a ~6% per-run flake, present since the test landed in e98ad3d (predates #34/#35/#36/#37; the two main-run failures in a row were bad luck, P(≥2 of today's 6 runs) ≈ 5.5%).

Root cause — the test's structural assumption is false under dir-local routing

The test seeds 64 files all under one directory /big/ and asserts the shard-grained paginated walk needs ≥2 pages. But shard routing is dir-local: shard_for_path_v6 hashes the parent directory (crates/fula-crypto/src/private_forest.rs:973), so all 64 files land in one salt-random shard out of 16. The test never spanned multiple shards at all. What it actually measured:

  • Hot shard at index 0–14 (15/16 of runs): page 1 returns all 64 entries and stops with cursor = S+1; page 2 walks the remaining shards, finds nothing, returns an empty tail page with cursor Nonepage_count = 2 → assertion passes for the wrong reason (the "second page" is a drain artifact, not pagination).
  • Hot shard at index 15 (1/16 of runs): the stop lands on the last shard, next_shard == num_shards → cursor None immediately (sharded_hamt_forest.rs:1964) → page_count = 1assertion fails.

So the ≥2-page assertion was a coin-weighted artifact test, never a multi-shard pagination test.

Fix (test-only; no shipped code changes, no version bump)

Seed the 64 files across 8 subdirectories of /big/ (8 files each). Each subdir routes to its own salt-random shard, so the set genuinely spans ≥2 shards unless all 8 subdirs collide on one shard — P = 16⁻⁷ ≈ 4×10⁻⁹, i.e. never. With max_keys = 8 ≤ per-shard match counts, every populated shard ends a page, so the walk genuinely produces ≥2 non-empty pages. The test now additionally asserts ≥2 non-empty pages, so it can never again pass on the empty-tail artifact (and stays correct if the tail-page wart below is ever fixed).

Follow-up candidate (not in this fix)

list_recursive_page emits one wasted empty tail page whenever the last data-bearing shard isn't the final shard index. The manifest already carries per-shard entry_count, so the cursor could skip-ahead/return None when nothing populated remains. Harmless today (one extra round trip per listing); left out of this change to keep the CI fix zero-risk.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions