diff --git a/SERIALIZATIONS.md b/SERIALIZATIONS.md index 8cfae53..b421149 100644 --- a/SERIALIZATIONS.md +++ b/SERIALIZATIONS.md @@ -70,6 +70,11 @@ Source-specific variants (parallel to the substrate, not derived from it): oc_isamples_pqg.parquet (GCS, 11.8 M, narrow, OC-only) oc_isamples_pqg_wide.parquet (GCS, 2.5 M, wide, OC-only) └─► serve as upstream for OpenContext thumbnails folded into 202604 wide + +Vocabulary labels (parallel to the substrate, sourced from isamplesorg/vocabularies): + +vocab_labels.parquet (58 KB, 537 SKOS concepts) + └─► consumed by Search Explorer to render facet URIs as prefLabels ``` Arrows indicate derivation, not containment. Every file in the left @@ -115,6 +120,12 @@ column can be rebuilt from its parent by a script in | `isamples_202601_facet_summaries.parquet` | Baseline `(facet_type, facet_value, scheme, count)` | 2 KB | 56 | wide | Every tutorial (instant initial facet counts) | QUERY_SPEC §3.3 tier 1 | | `isamples_202601_facet_cross_filter.parquet` | Pre-computed counts for single-filter cross-facet queries | 6 KB | 526 | wide | Search Explorer cross-filter UI | QUERY_SPEC §3.3 tier 2a | +### Tier: vocabulary labels + +| File | Role | Size | Rows | Upstream | Consumers | Spec | +|---|---|---:|---:|---|---|---| +| `vocab_labels.parquet` | SKOS concept URI → human-readable `pref_label` map (plus `definition`, `alt_labels`, `scheme`); covers material, sample object type, and sampled feature type vocabularies | 58 KB | 537 | `isamplesorg/vocabularies` TTLs (built by `scripts/build_vocab_labels.py`) | Search Explorer (renders facet URIs as prefLabels); any tutorial that surfaces controlled-vocabulary URIs | issue #148 | + ### Tier: alternative export formats (upstream of the aggregated Zenodo export) The `export_client` can emit each source's records in multiple formats; diff --git a/data.qmd b/data.qmd index 22466ef..3437457 100644 --- a/data.qmd +++ b/data.qmd @@ -57,6 +57,7 @@ cite `https://data.isamples.org/`. | Aggregate map clusters by zoom | [`h3_summary_res{4,6,8}.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res4.parquet) | ≤ 2.4 MB each | | Filter by material / context / object-type | [`sample_facets_v2.parquet`](https://data.isamples.org/isamples_202601_sample_facets_v2.parquet) | 63 MB | | Walk relationships (graph queries) | [`isamples_202512_narrow.parquet`](https://data.isamples.org/isamples_202512_narrow.parquet) | 820 MB | +| Translate vocabulary URIs to human-readable labels | [`vocab_labels.parquet`](https://data.isamples.org/vocab_labels.parquet) | 58 KB | ## 3. Copy-pasteable DuckDB snippets @@ -129,6 +130,21 @@ con.sql(""" """).df() ``` +### 3.6 Vocab labels: render facet URIs as human-readable text + +```python +# Join sample facets to vocabulary prefLabels so the UI shows +# "Ceramic Clay" instead of the raw concept URI. +con.sql(""" + SELECT f.pid, f.label, v.pref_label AS material_label + FROM read_parquet('https://data.isamples.org/isamples_202601_sample_facets_v2.parquet') f + LEFT JOIN read_parquet('https://data.isamples.org/vocab_labels.parquet') v + ON f.material = v.uri + WHERE f.material IS NOT NULL + LIMIT 10 +""").df() +``` + ## 4. H3 tier breakpoints (for map authors) The H3 summary files back a progressive-globe rendering pattern: diff --git a/how-to-use.qmd b/how-to-use.qmd index e9d530a..3bd2856 100644 --- a/how-to-use.qmd +++ b/how-to-use.qmd @@ -90,6 +90,7 @@ and counts instantly, without touching the 278 MB primary file: | [`isamples_202601_facet_summaries.parquet`](https://data.isamples.org/isamples_202601_facet_summaries.parquet) | 2 KB | `(facet_type, facet_value, count)` for source, material, context, object_type | You want instant initial facet counts with no filters applied | | [`isamples_202601_facet_cross_filter.parquet`](https://data.isamples.org/isamples_202601_facet_cross_filter.parquet) | 6 KB | Pre-computed counts for single-facet selections | You want instant cross-filtered counts for a single active filter | | [`isamples_202601_sample_facets_v2.parquet`](https://data.isamples.org/isamples_202601_sample_facets_v2.parquet) | 63 MB | `(pid, material, context, object_type)` facet URIs per sample | You need to filter on *combinations* of facets at query time | +| [`vocab_labels.parquet`](https://data.isamples.org/vocab_labels.parquet) | 58 KB | `(uri, pref_label, definition, alt_labels, scheme)` for 537 SKOS concepts (material, sample object type, sampled feature type) | You need to render facet URIs as human-readable text | ### Geospatial aggregates (H3) {.unnumbered} @@ -123,6 +124,7 @@ browsers use the parquet versions. | `sample_facets_v2.parquet` | ● | ● | | | `h3_summary_res4/6/8.parquet` | ● | | | | `samples_map_lite.parquet` | ● | | | +| `vocab_labels.parquet` | ● | ● | | ### Quick query recipes {.unnumbered}