diff --git a/_quarto.yml b/_quarto.yml index 2ce6bc5..cf3570b 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -44,6 +44,8 @@ website: href: models/index.qmd - text: "Technical: Narrow vs Wide" href: tutorials/narrow_vs_wide_performance.qmd + - text: "Technical: Why H3?" + href: tutorials/why_h3.qmd - text: "Research & Resources" menu: - text: "Publications & Conferences" @@ -97,6 +99,8 @@ website: text: Vocabularies - text: "Technical: Narrow vs Wide" href: tutorials/narrow_vs_wide_performance.qmd + - text: "Technical: Why H3?" + href: tutorials/why_h3.qmd - id: research title: "Research & Resources" style: "docked" diff --git a/data.qmd b/data.qmd index 3437457..d73a7be 100644 --- a/data.qmd +++ b/data.qmd @@ -149,6 +149,8 @@ con.sql(""" The H3 summary files back a progressive-globe rendering pattern: render aggregate circles at low zoom, individual points at high zoom. +For why we use H3 at all and why specifically resolutions 4 / 6 / 8, +see [Technical: Why H3?](tutorials/why_h3.qmd). Approximate breakpoints: | Zoom / altitude | Use | diff --git a/how-to-use.qmd b/how-to-use.qmd index 3bd2856..e99c06e 100644 --- a/how-to-use.qmd +++ b/how-to-use.qmd @@ -96,7 +96,8 @@ and counts instantly, without touching the 278 MB primary file: Hexagonal H3 cells pre-aggregated at three resolutions for zoom-adaptive globe rendering. Each row: `h3_cell, center_lat, center_lng, sample_count, -dominant_source, source_count`. +dominant_source, source_count`. For the design rationale (why hexagons, +why these resolutions), see [Technical: Why H3?](tutorials/why_h3.qmd). | File | Size | Cells | Typical altitude | |---|---:|---:|---| diff --git a/tutorials/isamples_explorer.qmd b/tutorials/isamples_explorer.qmd index a4f5691..18820f8 100644 --- a/tutorials/isamples_explorer.qmd +++ b/tutorials/isamples_explorer.qmd @@ -3,7 +3,7 @@ title: "iSamples Interactive Explorer" categories: [parquet, spatial, interactive] format: html: - code-fold: true + echo: false toc: true toc-depth: 3 include-in-header: @@ -68,16 +68,12 @@ Search and explore **6.7 million physical samples** from scientific collections This app uses a **two-tier loading strategy**: a 2KB pre-computed summary loads instantly for facet counts, while the full ~280 MB Parquet file is queried on demand. **Cross-filtering** keeps counts accurate — selecting a source updates material/context/specimen counts to reflect only that source's samples. All powered by DuckDB-WASM in your browser — no server required! ::: -## Setup - ```{ojs} -//| code-fold: true // Imports - use dynamic import to avoid CORS issues duckdbModule = import("https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm@1.28.0/+esm") ``` ```{ojs} -//| code-fold: true // Version gate. Append ?v=2 to the URL to opt into the lite-backed // rewrite (samples_map_lite.parquet instead of wide.parquet, lazy // description fetch on click, no ORDER BY RANDOM(), lazy Cesium mount). @@ -118,14 +114,12 @@ function getCesiumColor(source) { ``` ```{ojs} -//| code-fold: true // Parse URL params for bookmarkable searches initialParams = { const params = new URLSearchParams(window.location.search); return { q: params.get("q") || "", - sources: params.get("sources")?.split(",").filter(s => s) || [], - view: ["list", "table", "globe"].includes(params.get("view")) ? params.get("view") : "globe" + sources: params.get("sources")?.split(",").filter(s => s) || [] }; } ``` @@ -133,7 +127,6 @@ initialParams = { ## Search & Filters ```{ojs} -//| code-fold: false // Search input viewof searchInput = Inputs.text({ placeholder: "Search samples (e.g., pottery, basalt, Cyprus...)", @@ -154,7 +147,6 @@ facetSummariesWarning **Source** ```{ojs} -//| code-fold: true // Source checkboxes with counts - uses pre-computed summaries for instant load viewof sourceCheckboxes = { const counts = facetsByType.source; @@ -178,7 +170,6 @@ viewof sourceCheckboxes = { **Material** ```{ojs} -//| code-fold: true // Material filter - loaded from pre-computed summaries viewof materialCheckboxes = { const counts = facetsByType.material; @@ -199,7 +190,6 @@ viewof materialCheckboxes = { **Sampled Feature** ```{ojs} -//| code-fold: true // Context filter - loaded from pre-computed summaries viewof contextCheckboxes = { const counts = facetsByType.context; @@ -220,7 +210,6 @@ viewof contextCheckboxes = { **Specimen Type** ```{ojs} -//| code-fold: true // Object type filter - loaded from pre-computed summaries viewof objectTypeCheckboxes = { const counts = facetsByType.object_type; @@ -239,14 +228,12 @@ viewof objectTypeCheckboxes = { ``` ```{ojs} -//| code-fold: true html`Clear All Filters` ``` **Max Samples** ```{ojs} -//| code-fold: false viewof maxSamples = Inputs.range([1000, 100000], { value: 25000, step: 1000 @@ -256,10 +243,7 @@ viewof maxSamples = Inputs.range([1000, 100000], {
-### Results - ```{ojs} -//| code-fold: true // Update URL without reloading { const params = new URLSearchParams(); @@ -268,7 +252,6 @@ viewof maxSamples = Inputs.range([1000, 100000], { if (materialCheckboxes?.length) params.set("material", materialCheckboxes.join(",")); if (contextCheckboxes?.length) params.set("context", contextCheckboxes.join(",")); if (objectTypeCheckboxes?.length) params.set("object_type", objectTypeCheckboxes.join(",")); - if (viewMode !== "globe") params.set("view", viewMode); const newUrl = params.toString() ? `?${params.toString()}` : window.location.pathname; if (window.location.search !== `?${params.toString()}`) { @@ -277,87 +260,51 @@ viewof maxSamples = Inputs.range([1000, 100000], { } ``` -```{ojs} -//| code-fold: false -// View mode selector -viewof viewMode = Inputs.radio(["globe", "list", "table"], { - value: initialParams.view, - format: (x) => x === "globe" ? "🌍 Globe" : x === "list" ? "📋 List" : "📊 Table" -}) -``` +
+ +
+ SESAR + OpenContext + GEOME + Smithsonian +
Loading data...
```{ojs} -//| code-fold: true // Show result count -html`
+html`
Showing ${sampleData.length.toLocaleString()} of ${Number(totalCount).toLocaleString()} matching samples
` ``` -```{ojs} -//| code-fold: true -// Render results based on view mode -{ - if (viewMode === "globe") { - // Globe is rendered in its own section below - return html`
See globe view below
`; - } else if (viewMode === "table") { - return Inputs.table(sampleData, { - columns: ['source', 'label', 'latitude', 'longitude'], - header: { - source: 'Source', - label: 'Label', - latitude: 'Lat', - longitude: 'Lon' - }, - format: { - source: (x) => html`${x}`, - latitude: (x) => x?.toFixed(4), - longitude: (x) => x?.toFixed(4) - }, - rows: 20 - }); - } else { - // List view - if (sampleData.length === 0) { - return html`
No results found
`; - } - - const items = sampleData.map(r => { - const color = SOURCE_COLORS[r.source] || SOURCE_COLORS.default; - const desc = r.description || ""; - const description = desc.length > 150 ? desc.slice(0, 150) + "..." : (desc || "No description"); - - return html`
-
- - ${r.source} - -
-
${r.label || "Unlabeled"}
-
${description}
-
- 📍 ${r.latitude?.toFixed(4)}, ${r.longitude?.toFixed(4)} -
-
`; - }); - - return html`
${items}
`; - } -} -``` -
-## Database & Queries +### Results + +```{ojs} +// Full-width results table +Inputs.table(sampleData, { + columns: ['source', 'label', 'latitude', 'longitude'], + header: { + source: 'Source', + label: 'Label', + latitude: 'Lat', + longitude: 'Lon' + }, + format: { + source: (x) => html`${x}`, + latitude: (x) => x?.toFixed(4), + longitude: (x) => x?.toFixed(4) + }, + rows: 20 +}) +``` ```{ojs} -//| code-fold: true // Initialize DuckDB-WASM db = { performance.mark('explorer-db-start'); @@ -415,12 +362,10 @@ async function runQuery(sql) { ``` ```{ojs} -//| code-fold: true mutable facetSummariesError = null ``` ```{ojs} -//| code-fold: true // Tier 1: Load pre-computed facet summaries (2KB, instant) facetSummaries = { mutable facetSummariesError = null; @@ -440,7 +385,6 @@ facetSummaries = { ``` ```{ojs} -//| code-fold: true facetSummariesWarning = { if (!facetSummariesError) return null; return html`
@@ -466,7 +410,6 @@ facetsByType = { ``` ```{ojs} -//| code-fold: true // Load SKOS prefLabels for vocabulary URIs (#148). Tiny lookup (~60KB); // fallback to last URI segment if a URI isn't covered. vocabLabels = { @@ -496,7 +439,6 @@ prettyLabel = function (uri) { ``` ```{ojs} -//| code-fold: true // Cross-filter: build WHERE clause excluding one facet dimension // Queries the sample_facets view (URI strings, correct column names) function buildCrossFilterWhere(excludeFacet) { @@ -549,7 +491,6 @@ function buildCrossFilterWhere(excludeFacet) { ``` ```{ojs} -//| code-fold: true // Detect whether any filter is active (triggers cross-filter queries) hasActiveFilters = { const hasSearch = searchInput?.trim()?.length > 0; @@ -562,7 +503,6 @@ hasActiveFilters = { ``` ```{ojs} -//| code-fold: true // Cross-filtered facet counts: use pre-computed cache for single-filter, // fall back to on-the-fly queries against sample_facets for multi-filter crossFilteredFacets = { @@ -653,7 +593,6 @@ crossFilteredFacets = { ``` ```{ojs} -//| code-fold: true // Update facet count labels in-place when cross-filtered counts change // This avoids re-rendering checkboxes (which would reset user selections) { @@ -687,7 +626,6 @@ crossFilteredFacets = { ``` ```{ojs} -//| code-fold: true // Build WHERE clause from current filters (Tier 2: queries full parquet only when filtering) // Source filter uses the wide parquet's `n` column directly. // Material/context/object_type filters use the sample_facets view (URI strings) @@ -754,14 +692,12 @@ whereClause = { ``` ```{ojs} -//| code-fold: true // Source counts now come from pre-computed facet summaries (Tier 1) // No longer scans the full parquet file on every page load sourceCounts = facetsByType.source ``` ```{ojs} -//| code-fold: true // Get total count matching current filters totalCount = { performance.mark('explorer-count-start'); @@ -778,7 +714,6 @@ totalCount = { ``` ```{ojs} -//| code-fold: true // Load sample data sampleData = { const statusDiv = document.getElementById('loading_status'); @@ -843,35 +778,14 @@ sampleData = { } ``` -## Globe View - ```{ojs} -//| code-fold: true mutable clickedPointId = null mutable clickedPointIndex = null ``` -
- -
- SESAR - OpenContext - GEOME - Smithsonian -
- ```{ojs} -//| code-fold: true // Cesium viewer setup viewer = { - // v2: defer Cesium construction until the user actually switches to - // globe view. The cell re-evaluates when viewMode changes (reactive - // dependency below), so toggling into globe will mount on demand. - // v1 mounts eagerly to preserve original behavior. - if (explorerVersion === 'v2' && viewMode !== 'globe') { - return null; - } - // Wait for Cesium to be available await new Promise(resolve => { if (typeof Cesium !== 'undefined') resolve(); @@ -916,7 +830,6 @@ viewer = { ``` ```{ojs} -//| code-fold: true // Render points on globe renderPoints = { if (!viewer || sampleData.length === 0) return null; @@ -971,7 +884,6 @@ renderPoints = { ## Sample Card ```{ojs} -//| code-fold: true // Get selected sample data selectedSample = { if (clickedPointIndex === null || clickedPointIndex < 0) return null; @@ -981,7 +893,6 @@ selectedSample = { ``` ```{ojs} -//| code-fold: true // v2: lazy description fetch — only hit the 278 MB wide parquet when a sample // is actually clicked, rather than pulling description for every row eagerly. lazyDescription = { @@ -1003,7 +914,6 @@ lazyDescription = { ``` ```{ojs} -//| code-fold: true // Render sample card sampleCard = { if (!selectedSample) { @@ -1064,12 +974,10 @@ sampleCard = { Current State & Query ```{ojs} -//| code-fold: true html`
 State:
   search: "${searchInput || ''}"
   sources: ${JSON.stringify(Array.from(sourceCheckboxes || []))}
-  view: "${viewMode}"
   maxSamples: ${maxSamples}
 
 WHERE clause:
diff --git a/tutorials/why_h3.qmd b/tutorials/why_h3.qmd
new file mode 100644
index 0000000..70e2327
--- /dev/null
+++ b/tutorials/why_h3.qmd
@@ -0,0 +1,133 @@
+---
+title: "Technical: Why H3?"
+subtitle: "Why iSamples uses Uber's H3 hexagonal grid for spatial aggregation, and why specifically resolutions 4 / 6 / 8"
+categories: [h3, spatial, design-rationale]
+format:
+  html:
+    code-fold: true
+    toc: true
+    toc-depth: 3
+---
+
+The progressive globe and the Interactive Explorer both render millions of samples by aggregating points into pre-computed [H3](https://h3geo.org/) cells at three resolutions. This page documents *why* H3, why *those* resolutions, and what we considered before adopting it.
+
+::: {.callout-tip}
+### One-paragraph version
+
+H3 is a hierarchical hexagonal grid system, originally built by Uber for ride-sharing analytics and released as open source in 2018. We use it because hexagons have **uniform neighbor distance** (square grids do not), the index is a single 64-bit integer (cheap to store, fast to filter), and DuckDB has first-class H3 support. We pre-aggregate at resolutions 4 / 6 / 8 — chosen so each tier file fits comfortably in browser memory and roughly matches a continental / regional / neighborhood zoom band on the globe.
+:::
+
+## What H3 is
+
+H3 partitions the Earth's surface into a hierarchy of hexagonal cells at 16 resolutions (0 = ~4.4 M km² per cell, 15 = ~0.9 m²). Each cell has a unique 64-bit integer ID. Parents and children at adjacent resolutions don't perfectly nest (a hexagon can't tile to seven smaller hexagons exactly), but the approximate parent–child relationship is good enough for binning and rollup.
+
+| Resource | Why follow it |
+|---|---|
+| [h3geo.org](https://h3geo.org/) | Authoritative documentation, including the resolution table |
+| [H3 GitHub](https://github.com/uber/h3) | C library (canonical) plus bindings for Python, JS, Java, R, Go |
+| [Uber Engineering blog (2018)](https://www.uber.com/blog/h3/) | Original announcement and design rationale |
+| [Sahr (2008), *Discrete Global Grid Systems*](https://discreteglobalgrids.org/) | Theoretical underpinning for hex-based DGGs |
+| [Wikipedia: Discrete global grid](https://en.wikipedia.org/wiki/Discrete_global_grid) | Background — H3 has no standalone Wikipedia article yet (April 2026) |
+
+::: {.callout-note}
+The lack of a standalone Wikipedia article is a real gap and a fair signal of how niche the technology still is outside ride-sharing, mapping, and geo-analytics circles. iSamples chose H3 anyway; this page is part of how we make the choice legible to people who would otherwise have a Wikipedia article to fall back on.
+:::
+
+## Why hexagons over squares (or triangles)
+
+Most map-tile systems and database geohash schemes (Google's quadkey, the geohash string, S2's Hilbert-curve-on-square cells) partition the world into squares or rectangles. The classical critique:
+
+| Property | Square grid | Hex grid |
+|---|---|---|
+| Neighbor count | 8 (4 edge + 4 corner) | 6 (all edge) |
+| Distance to all neighbors | Two distinct values (`d` and `d√2`) | One value |
+| Direction sampling | Anisotropic (axis-aligned) | More uniform |
+| Fits a sphere cleanly | No (poles/dateline distortion) | Better (12 pentagons hide the curvature) |
+
+For aggregation queries — "how many samples in this cell and its neighbors?" — uniform neighbor distance is the property that matters. With squares, you must decide whether a diagonal neighbor "counts the same" as an edge neighbor; with hexagons, you don't.
+
+Triangles satisfy uniform neighbor distance too, but they alternate orientation (point-up / point-down), which makes neighborhood logic and rendering both more complex.
+
+## Why H3 over S2 or geohash
+
+[S2](http://s2geometry.io/) (Google) and [geohash](https://en.wikipedia.org/wiki/Geohash) are the two most common alternatives. Both partition into squares.
+
+- **Geohash** uses a string-based base-32 encoding. Adjacent cells often have very different prefixes (the *poles-and-dateline problem*), which makes range queries unreliable for neighborhood lookups. Cells get badly distorted near the poles.
+- **S2** uses a quad-tree projected onto the six faces of a cube, with cells indexed via a Hilbert curve. The neighborhood logic is sound, but cells are still squares with anisotropic neighbor distance, and the index is not as straightforward to use from SQL.
+- **H3** is a hex grid with a 64-bit integer index, with first-class C / Python / JavaScript / SQL bindings. The DuckDB H3 extension (which we use) operates on the integer index directly — `WHERE h3_res6 = 612345...` is a fast equality scan over a sorted column.
+
+We did not benchmark S2 head-to-head against H3 for this project. The hexagon-vs-square argument plus H3's DuckDB integration plus the prior art (Eric Kansa flagged H3 in December 2025; pqg #19 added the H3 indexing CLI in February 2026) was enough.
+
+## Why resolutions 4 / 6 / 8 specifically
+
+H3 has 16 resolutions. We pre-aggregate at three of them — 4, 6, 8 — and serve each as a separate parquet file. The choice is driven by:
+
+1. **Cell size at each resolution**, from the [H3 resolution table](https://h3geo.org/docs/core-library/restable):
+
+   | Resolution | Avg edge length | Avg cell area | Roughly… |
+   |---:|---:|---:|---|
+   | 4 | 22 km | 1,770 km² | Subregion of a small country |
+   | 5 | 8.5 km | 253 km² | County |
+   | 6 | 3.2 km | 36 km² | Town |
+   | 7 | 1.2 km | 5 km² | Neighborhood |
+   | 8 | 460 m | 0.74 km² | A few city blocks |
+
+2. **Globe altitude bands**. The Cesium camera at altitude 1000 km sees roughly continental scale; at 100 km, regional; below ~10 km, neighborhood. Resolutions 4 / 6 / 8 land near the centers of those bands — odd resolutions (5, 7, 9) would also work but offer diminishing returns at the cost of an extra file to ship.
+
+3. **Parquet size budget**. The progressive globe loads the lowest-resolution tier first and reaches for higher resolution as the user zooms in. Each tier has to fit comfortably in browser memory:
+
+   | File | Resolution | Cells | Size on R2 |
+   |---|---:|---:|---:|
+   | [`isamples_202601_h3_summary_res4.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res4.parquet) | 4 | ~38 K | 580 KB |
+   | [`isamples_202601_h3_summary_res6.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res6.parquet) | 6 | ~112 K | 1.6 MB |
+   | [`isamples_202601_h3_summary_res8.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res8.parquet) | 8 | ~176 K | 2.4 MB |
+
+   Adding res-5 and res-7 tiers would roughly triple the on-the-wire payload for a barely-perceptible improvement in cluster smoothness during zoom transitions.
+
+4. **Skip-by-two leaves obvious detail jumps**, which the renderer leans into rather than fights — the user perceives the level change as deliberate progressive disclosure rather than a stutter.
+
+Below res-8 (zoom ≥ ~10, altitude < ~120 km on the globe), aggregation stops mattering: there are usually fewer than a few thousand individual samples in view, and we serve them as points from [`samples_map_lite.parquet`](https://data.isamples.org/isamples_202601_samples_map_lite.parquet) instead.
+
+## What this means for queries
+
+The wide parquet carries `h3_res4`, `h3_res6`, and `h3_res8` BIGINT columns (added by `pqg add-h3` — see [pqg PR #19](https://github.com/isamplesorg/pqg/pull/19)). Filtering or grouping on these columns is a sorted-integer scan — much faster than recomputing H3 cells from `(latitude, longitude)` at query time, and DuckDB-WASM doesn't ship the H3 extension, so the alternative would mean shipping every point.
+
+Two query patterns are common:
+
+```python
+# 1. Aggregate cells in a region (use the dedicated tier file — much smaller)
+con.sql("""
+    SELECT h3_cell, sample_count, dominant_source, center_lat, center_lng
+    FROM read_parquet('https://data.isamples.org/isamples_202601_h3_summary_res6.parquet')
+    WHERE center_lat BETWEEN 30 AND 40
+      AND center_lng BETWEEN -125 AND -115
+    ORDER BY sample_count DESC
+""").df()
+```
+
+```python
+# 2. Filter the wide file to one or more H3 cells (use the precomputed column)
+con.sql("""
+    SELECT pid, label, latitude, longitude, n AS source
+    FROM read_parquet('https://data.isamples.org/current/wide.parquet')
+    WHERE h3_res6 = 612345678901234567   -- one cell at resolution 6
+      AND otype = 'MaterialSampleRecord'
+    LIMIT 100
+""").df()
+```
+
+For the full schema and aggregate columns, see the [serialization catalog](/SERIALIZATIONS.md) and [data downloads](/data.qmd#h3-tier-breakpoints-for-map-authors).
+
+## What we'd revisit
+
+The current design assumes **dominant-source-per-cell** is good enough for color encoding on the globe (see the source-color legend in the Interactive Explorer). When two sources are nearly equally represented in the same cell, the rendered color hides the second one. Eric Kansa raised this in our December 2025 discussion; we accepted the simplification for the initial release and may revisit it with per-source counts if the closeout demos or the June 2026 keynote surface the issue.
+
+We also do not ship resolutions 0–3 (continental / global) or 9–15 (sub-meter). The globe never zooms out far enough to need lower resolutions, and the lite parquet covers the high-resolution case better than per-cell aggregates would.
+
+## See also
+
+- [`tutorials/progressive_globe.qmd`](progressive_globe.qmd) — the tier files in action on a Cesium globe
+- [`tutorials/narrow_vs_wide_performance.qmd`](narrow_vs_wide_performance.qmd) — performance comparison across schema shapes, including the H3-augmented wide
+- [`data.qmd §4`](/data.html#h3-tier-breakpoints-for-map-authors) — the zoom-to-resolution breakpoint table
+- [`SERIALIZATIONS.md`](/SERIALIZATIONS.md) — full catalog including the H3 tier files
+- [pqg PR #19](https://github.com/isamplesorg/pqg/pull/19) — the build-time CLI that adds H3 columns and emits the tier files