Skip to content

Improve search: multi-term AND + relevance ranking (FTS spike)#95

Merged
rdhyee merged 2 commits intoisamplesorg:mainfrom
rdhyee:feature/fts-spike
May 1, 2026
Merged

Improve search: multi-term AND + relevance ranking (FTS spike)#95
rdhyee merged 2 commits intoisamplesorg:mainfrom
rdhyee:feature/fts-spike

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 9, 2026

Summary

Closes #84 — FTS spike complete with immediate search improvements and documented future path.

Shipped now (zero new dependencies):

  • Multi-term search: "pottery Cyprus" requires BOTH words to match (was OR on the full phrase)
  • Wildcard-safe ILIKE: search terms containing %, _, or \ are escaped, so they match literal characters instead of acting as wildcards
  • Relevance ranking: results sorted by score when searching — label match = 3pts, place_name = 2pts. (Description is not scored — the column lives in the wide parquet, while the Explorer's search runs against the lite parquet, which has no description.)
  • When not searching, results remain random for exploration variety
  • Search composes with active facet filters (Material / Sampled Feature / Specimen Type) and the source legend, so results stay inside the user's current narrowing

FTS spike findings:

  • Built offline DuckDB FTS index with tools/build_fts_index.py (preserved as a non-production spike artifact, clearly marked in its module docstring)
  • Full index (label + description + place_name): 358 MB — too large for auto-download
  • Lite index (label + place_name only): 211 MB — still substantial
  • BM25 scoring works well (Porter stemming, English stopwords)
  • ATTACH over HTTP in DuckDB-WASM is supported but downloading 200–358 MB is impractical

Recommended next steps (not in this PR):

  1. Explore pre-tokenized search parquet (inverted index as parquet, much smaller)
  2. Consider on-demand FTS loading behind an "Enhanced Search" toggle
  3. Evaluate DuckDB text analytics functions (stemming without full index)

Test plan

  • Search "pottery" → results ranked by relevance (label matches first)
  • Search "pottery Cyprus" → only samples matching BOTH words
  • Search "basalt" → geological samples with label matches at top
  • Search "100%" or other wildcard chars → matches the literal characters, not all rows
  • Clear search → results return to random sampling
  • With Material/Source/Specimen filters active → search results stay inside those filters
  • tools/build_fts_index.py runs successfully with local parquet (spike artifact, not part of the deploy)

🤖 Generated with Claude Code

rdhyee added a commit to rdhyee/isamplesorg.github.io that referenced this pull request Apr 28, 2026
Search input was passed into ILIKE patterns with only single-quote
escaping, so a literal "%" or "_" in the query (e.g. "100%", "co_op")
silently turned into wildcards. Escape % _ \ and add ESCAPE '\' in
both whereClause and the relevance-score expression.

Also reframe tools/build_fts_index.py as a spike artifact: the
docstring told readers to upload the index to data.isamples.org, but
per PR isamplesorg#95 findings the 200-358 MB result is too large to ship. Mark
the script NOT in production pipeline and drop the misleading upload
instructions.

Smoke-tested locally with /tmp/explorer_smoke_test.py (multi-term
"pottery cyprus" + wildcard "100%"): 0 JS exceptions, 0 console
errors, 0 failed requests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 28, 2026

Reviewed and pushed two small follow-ups (134aca2):

1. ILIKE wildcard escaping. Search input was passed into the ILIKE pattern with only single-quote escaping, so literal % or _ in the query (e.g. 100%, co_op) silently became wildcards. Now escape % _ \ and add ESCAPE '\' in both the whereClause block and the relevance-score expression.

2. FTS spike script header. tools/build_fts_index.py told readers to "upload to data.isamples.org" but per the PR's own findings the 200-358 MB result is too large to ship. Reframed as STATUS: spike artifact — NOT in production pipeline, kept the script for future revisits, dropped the misleading upload instructions.

Smoke test (/tmp/explorer_smoke_test.py against local Quarto render):

Serving docs on :64856
URL: http://127.0.0.1:64856/tutorials/isamples_explorer.html
JS exceptions:    0
Console errors:   0
Failed requests:  0
RESULT: PASS

Exercised: initial load, multi-term search (pottery cyprus), wildcard-char search (100%). Screenshot confirms the new placeholder and that 100% no longer matches everything.

Other notes from review (not blocking):

  • Score expression has discrete plateaus (0/1/2/3/5/6 per term); ties break alphabetically on label. Fine for spike — could mention in placeholder docs later.
  • description ILIKE over the wide parquet over HTTP range-fetch may add first-search latency; worth a ?perf=1 measurement before declaring search "done", but out of scope here.

LGTM to merge once you've eyeballed the diff.

rdhyee added a commit to rdhyee/isamplesorg.github.io that referenced this pull request Apr 30, 2026
Search input was passed into ILIKE patterns with only single-quote
escaping, so a literal "%" or "_" in the query (e.g. "100%", "co_op")
silently turned into wildcards. Escape % _ \ and add ESCAPE '\' in
both whereClause and the relevance-score expression.

Also reframe tools/build_fts_index.py as a spike artifact: the
docstring told readers to upload the index to data.isamples.org, but
per PR isamplesorg#95 findings the 200-358 MB result is too large to ship. Mark
the script NOT in production pipeline and drop the misleading upload
instructions.

Smoke-tested locally with /tmp/explorer_smoke_test.py (multi-term
"pottery cyprus" + wildcard "100%"): 0 JS exceptions, 0 console
errors, 0 failed requests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee rdhyee force-pushed the feature/fts-spike branch from 134aca2 to 7623ff5 Compare April 30, 2026 21:46
rdhyee and others added 2 commits May 1, 2026 06:36
Search improvements (immediate):
- Multi-term search: "pottery Cyprus" requires BOTH words to match
- Relevance ranking: label matches weighted 3x, place 2x, description 1x
- Results sorted by relevance score when searching (random for browsing)

FTS spike (future path, documented):
- Added tools/build_fts_index.py to build DuckDB FTS index offline
- Tested: 358 MB full index, 211 MB lite — too large for auto-download
- BM25 scoring works correctly (Porter stemming, stopwords)
- Next step: explore smaller index strategies or on-demand loading

Closes isamplesorg#84 (spike complete — findings documented in PR)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Search input was passed into ILIKE patterns with only single-quote
escaping, so a literal "%" or "_" in the query (e.g. "100%", "co_op")
silently turned into wildcards. Escape % _ \ and add ESCAPE '\' in
both whereClause and the relevance-score expression.

Also reframe tools/build_fts_index.py as a spike artifact: the
docstring told readers to upload the index to data.isamples.org, but
per PR isamplesorg#95 findings the 200-358 MB result is too large to ship. Mark
the script NOT in production pipeline and drop the misleading upload
instructions.

Smoke-tested locally with /tmp/explorer_smoke_test.py (multi-term
"pottery cyprus" + wildcard "100%"): 0 JS exceptions, 0 console
errors, 0 failed requests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee rdhyee force-pushed the feature/fts-spike branch from 7623ff5 to 6a31a97 Compare May 1, 2026 13:37
@rdhyee rdhyee merged commit b2a1103 into isamplesorg:main May 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explore DuckDB FTS extension for full-text search in Explorer

1 participant