From 2c6ac90520d859f812f45da13d251aa11aefbbe3 Mon Sep 17 00:00:00 2001
From: Seungpyo1007 <rush94434@gmail.com>
Date: Tue, 2 Jun 2026 13:55:18 +0900
Subject: [PATCH] docs: reflect org move, weekly-refresh pipeline, and
 submodule autosync
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Update README + DATA_PIPELINE + DEVELOPMENT for the current state: TechAPI now at GetTechAPI/TechAPI, the weekly-refresh pipeline (enrich → integrity gate → dump → PR) as the real weekly automation with refresh-data demoted to a dump smoke-test, in-repo collection (ingest/enrich), the bidirectional submodule autosync (notify-techapi / bump-techapi), the new workflow list, and the TECHAPI_PR_TOKEN → TECHAPI_TOKEN rename.
---
 README.md             | 54 ++++++++++++++++++++++++++-----------------
 docs/DATA_PIPELINE.md | 27 ++++++++++++++--------
 docs/DEVELOPMENT.md   |  6 +++--
 3 files changed, 55 insertions(+), 32 deletions(-)
diff --git a/README.md b/README.md
index f28b8b0..3b32a2f 100644
--- a/README.md
+++ b/README.md
@@ -1,16 +1,16 @@
 # TechEngine
 
-> **Validation, ingestion, and serving engine for the [TechAPI](https://github.com/Seungpyo1007/TechAPI) dataset.**
+> **Validation, ingestion, and serving engine for the [TechAPI](https://github.com/GetTechAPI/TechAPI) dataset.**
 
 [![test](https://github.com/GetTechAPI/TechEngine/actions/workflows/test.yml/badge.svg)](https://github.com/GetTechAPI/TechEngine/actions/workflows/test.yml)
-&nbsp;Code: **MIT** · Data: lives in **[TechAPI](https://github.com/Seungpyo1007/TechAPI)** (CC-BY-SA 4.0)
+&nbsp;Code: **MIT** · Data: lives in **[TechAPI](https://github.com/GetTechAPI/TechAPI)** (CC-BY-SA 4.0)
 
 TechEngine owns everything *around* the data: schema validation, the FastAPI
 read API, the static JSON dump generator, the engine's own landing site, and
 (next up) automated coverage checks and a weekly ingestion crawler.
 
 The dataset and the public-facing playground site live in
-[TechAPI](https://github.com/Seungpyo1007/TechAPI) so each can be versioned,
+[TechAPI](https://github.com/GetTechAPI/TechAPI) so each can be versioned,
 mirrored, and licensed independently. The site shipped in this repo is the
 engine's own landing — what TechEngine is, what it runs, link out to docs.
 
@@ -31,33 +31,42 @@ app/
 tests/                 # unit + integration
 site/                  # Astro engine landing (deploys to Pages)
 docs/                  # SPEC / DATA_PIPELINE / DEVELOPMENT
+TechAPI/               # submodule → GetTechAPI/TechAPI (clickable @ <sha> link)
 .github/workflows/
-  ├ validate-data.yml  # workflow_call: PR-time data validation for TechAPI
-  ├ refresh-data.yml   # cron: regenerate the static dump weekly
-  ├ coverage-report.yml # cron: gap report, sticky issue
-  ├ weekly-ingest.yml  # cron: drafts new SKUs, opens PR against TechAPI
-  ├ deploy-pages.yml   # build & deploy engine site + dump
-  └ test.yml           # lint + type-check + tests
+  ├ validate-data.yml   # workflow_call: PR-time data validation for TechAPI
+  ├ weekly-refresh.yml  # cron: live-scrape → integrity gate → dump → PR to TechAPI
+  ├ weekly-ingest.yml   # cron: draft new SKUs, open PR against TechAPI
+  ├ coverage-report.yml # cron: gap report, sticky issue (TechEngine + TechAPI)
+  ├ refresh-data.yml    # smoke-test: rebuild the dump on engine (app/**) changes
+  ├ notify-techapi.yml  # push→main: ping TechAPI to bump its TechEngine submodule
+  ├ bump-techapi.yml    # dispatch: advance this repo's TechAPI submodule pointer
+  ├ deploy-pages.yml    # build & deploy engine site + dump
+  └ test.yml            # lint + type-check + tests
 ```
 
 ## How the two repos connect
 
-```
-┌────────────────────┐                ┌──────────────────────────┐
-│ TechAPI (data/)    │  workflow_call │ TechEngine (this repo)   │
-│ + bundled self-    │ ─────────────▶ │  validate-data.yml       │
-│   check (PR)       │ ◀───────────── │  (checks out TechAPI)    │
-└────────────────────┘                └──────────────────────────┘
-```
+Both repos live in the **GetTechAPI** org and each includes the other as a git
+**submodule** (a clickable `@ <sha>` pin). Three automations keep them in step:
+
+- **validate-data.yml** (`workflow_call`) — TechAPI's PR-time check calls into
+  TechEngine to validate its data.
+- **weekly-refresh.yml** — live-scrapes benchmarks, runs the full-dataset
+  integrity gate (`app.validate` + `integrity_check.py --strict`), regenerates
+  the static dump, and opens a dated refresh PR against TechAPI.
+- **Submodule autosync** — every push to TechEngine `main` fires
+  `notify-techapi.yml`, which pings TechAPI to bump its TechEngine pointer;
+  conversely `bump-techapi.yml` advances TechEngine's TechAPI pointer when
+  TechAPI changes. Bumps are loop-guarded, so each real change converges to one.
 
-Every Python entry point reads data from a sibling **TechAPI checkout**. The
-location can be overridden via `TECHAPI_DATA_DIR`; the default looks for
-`../TechAPI/data` next to this repo, which matches a local dev layout.
+Every Python entry point reads data from a **TechAPI checkout**. The location
+can be overridden via `TECHAPI_DATA_DIR`; the default looks for `../TechAPI/data`
+next to this repo, which matches a local dev layout.
 
 ## Quickstart
 
 ```bash
-git clone https://github.com/Seungpyo1007/TechAPI.git ../TechAPI   # data source
+git clone https://github.com/GetTechAPI/TechAPI.git ../TechAPI   # data source
 pip install -e ".[dev]"
 python -m app.validate          # check data integrity
 python -m app.seed              # data/ → ./techapi.db (SQLite)
@@ -82,8 +91,11 @@ Spins up Postgres 16, seeds from the mounted TechAPI checkout, serves on `:8000`
   and surface missing SKUs as a sticky weekly issue
   ([#1](https://github.com/GetTechAPI/TechEngine/issues/1))
 - [x] **Weekly ingestion crawler** — scrape canonical sources and open PRs
-  against TechAPI with new SKUs (requires `TECHAPI_PR_TOKEN` secret to push)
+  against TechAPI with new SKUs (requires the `TECHAPI_TOKEN` secret to push)
   ([#2](https://github.com/GetTechAPI/TechEngine/issues/2))
+- [x] **Weekly refresh pipeline** — live benchmark enrichment → full-dataset
+  integrity gate → static dump → dated refresh PR (`weekly-refresh.yml`)
+- [x] **Bidirectional submodule autosync** between TechEngine and TechAPI
 - [ ] More sources (Intel ARK, AMD product pages, TechPowerUp DB)
 
 ## License
diff --git a/docs/DATA_PIPELINE.md b/docs/DATA_PIPELINE.md
index d2707d8..9d0ef3b 100644
--- a/docs/DATA_PIPELINE.md
+++ b/docs/DATA_PIPELINE.md
@@ -44,23 +44,32 @@ dump/v1/socs/…  /v1/gpus/…  /v1/cpus/…  /v1/brands/…
 A static consumer just fetches, e.g.
 `https://<host>/v1/smartphones/galaxy-s25/index.json`.
 
-## 3. Automated refresh (`.github/workflows/refresh-data.yml`)
+## 3. Automated refresh (`.github/workflows/weekly-refresh.yml`)
 
-A scheduled workflow (weekly cron + on `data/**` changes + manual) runs:
+The weekly pipeline (Monday cron + manual dispatch) runs the full cycle against
+a TechAPI checkout:
 
 ```
-validate seed data → generate dump → publish/commit dump if changed
+live-scrape benchmark sources → full-dataset integrity gate
+  (app.validate + integrity_check.py --strict) → regenerate static dump
+  → open a dated refresh PR against TechAPI
 ```
 
-This is the git-scraping pattern: GitHub runs and stores everything for free.
-The hosting target depends on the public/private decision (§5).
+The integrity gate re-checks the **whole** dataset every run (not just new
+rows), so a bad scrape can't slip a contaminated value past it. A lighter
+`refresh-data.yml` rebuilds the dump on engine (`app/**`) changes as a smoke
+test only. This is the git-scraping pattern — GitHub runs and stores everything
+for free — and the dated PR keeps every refresh reviewable before it lands. The
+hosting target depends on the public/private decision (§5).
 
 ## 4. Where the data comes from
 
-This repo contains only **curated, validated** records. Bulk collection and
-normalization happen **outside this repo**, through a separate internal pipeline,
-which publishes curated records here (by PR) after review (SPEC §9.3). This repo
-never needs scraping/browser dependencies.
+This repo serves **curated, validated** records, but collection now happens
+**in-repo**: `app/ingest` drafts new SKUs from upstream catalogs and
+`app/ingest/enrich` backfills benchmark columns from multiple sources
+(variant-safe, fill-only-nulls, never overwrites). Both run weekly and open PRs
+against TechAPI for human review before anything lands (SPEC §9.3). The curated
+dataset is a **subset, not exhaustive.**
 
 **Dataset layout (this repo).** Curated data uses singular folder names and is
 organised by brand: `data/brand/<slug>.json`, `data/soc/<manufacturer>/<slug>.json`,
diff --git a/docs/DEVELOPMENT.md b/docs/DEVELOPMENT.md
index d878d0a..a32a4fa 100644
--- a/docs/DEVELOPMENT.md
+++ b/docs/DEVELOPMENT.md
@@ -60,7 +60,9 @@ scripts/
   dump.py            DB → static JSON dump (replays API in-process) → ./dump
 tests/               unit/ + integration/ (conftest seeds a temp SQLite from data/)
 docs/                SPEC.md, DATA_PIPELINE.md, DEVELOPMENT.md
-.github/workflows/   test.yml, validate-data.yml, refresh-data.yml
+.github/workflows/   test.yml, validate-data.yml, weekly-refresh.yml, weekly-ingest.yml,
+                     coverage-report.yml, refresh-data.yml, notify-techapi.yml,
+                     bump-techapi.yml, deploy-pages.yml
 ```
 
 > Note: **data folders are singular** (`data/soc/…`) but **API routes are plural**
@@ -85,7 +87,7 @@ python -m app.dump               # generate ./dump/ static tree (gitignored)
 - **GPU activated** — model existed (§6.5); endpoints + data added.
 - **Data restructured** to singular names + brand subfolders (maintainer request).
 - **Static-dump pivot** — `app/dump.py` exports the API to a static JSON tree,
-  refreshed by GitHub Actions (`refresh-data.yml`).
+  refreshed weekly by GitHub Actions (`weekly-refresh.yml`).
 - **Scoring** is a Phase-0 reference-based approximation; Phase 1 → dataset-wide
   min-max (§8.4). Raw third-party benchmarks (Geekbench/AnTuTu/Cinebench/Time Spy)
   are stored as algorithm inputs but NOT exposed (ADR-006).