From 2c6ac90520d859f812f45da13d251aa11aefbbe3 Mon Sep 17 00:00:00 2001 From: Seungpyo1007 Date: Tue, 2 Jun 2026 13:55:18 +0900 Subject: [PATCH] docs: reflect org move, weekly-refresh pipeline, and submodule autosync MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Update README + DATA_PIPELINE + DEVELOPMENT for the current state: TechAPI now at GetTechAPI/TechAPI, the weekly-refresh pipeline (enrich → integrity gate → dump → PR) as the real weekly automation with refresh-data demoted to a dump smoke-test, in-repo collection (ingest/enrich), the bidirectional submodule autosync (notify-techapi / bump-techapi), the new workflow list, and the TECHAPI_PR_TOKEN → TECHAPI_TOKEN rename. --- README.md | 54 ++++++++++++++++++++++++++----------------- docs/DATA_PIPELINE.md | 27 ++++++++++++++-------- docs/DEVELOPMENT.md | 6 +++-- 3 files changed, 55 insertions(+), 32 deletions(-) diff --git a/README.md b/README.md index f28b8b0..3b32a2f 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,16 @@ # TechEngine -> **Validation, ingestion, and serving engine for the [TechAPI](https://github.com/Seungpyo1007/TechAPI) dataset.** +> **Validation, ingestion, and serving engine for the [TechAPI](https://github.com/GetTechAPI/TechAPI) dataset.** [![test](https://github.com/GetTechAPI/TechEngine/actions/workflows/test.yml/badge.svg)](https://github.com/GetTechAPI/TechEngine/actions/workflows/test.yml) - Code: **MIT** · Data: lives in **[TechAPI](https://github.com/Seungpyo1007/TechAPI)** (CC-BY-SA 4.0) + Code: **MIT** · Data: lives in **[TechAPI](https://github.com/GetTechAPI/TechAPI)** (CC-BY-SA 4.0) TechEngine owns everything *around* the data: schema validation, the FastAPI read API, the static JSON dump generator, the engine's own landing site, and (next up) automated coverage checks and a weekly ingestion crawler. The dataset and the public-facing playground site live in -[TechAPI](https://github.com/Seungpyo1007/TechAPI) so each can be versioned, +[TechAPI](https://github.com/GetTechAPI/TechAPI) so each can be versioned, mirrored, and licensed independently. The site shipped in this repo is the engine's own landing — what TechEngine is, what it runs, link out to docs. @@ -31,33 +31,42 @@ app/ tests/ # unit + integration site/ # Astro engine landing (deploys to Pages) docs/ # SPEC / DATA_PIPELINE / DEVELOPMENT +TechAPI/ # submodule → GetTechAPI/TechAPI (clickable @ link) .github/workflows/ - ├ validate-data.yml # workflow_call: PR-time data validation for TechAPI - ├ refresh-data.yml # cron: regenerate the static dump weekly - ├ coverage-report.yml # cron: gap report, sticky issue - ├ weekly-ingest.yml # cron: drafts new SKUs, opens PR against TechAPI - ├ deploy-pages.yml # build & deploy engine site + dump - └ test.yml # lint + type-check + tests + ├ validate-data.yml # workflow_call: PR-time data validation for TechAPI + ├ weekly-refresh.yml # cron: live-scrape → integrity gate → dump → PR to TechAPI + ├ weekly-ingest.yml # cron: draft new SKUs, open PR against TechAPI + ├ coverage-report.yml # cron: gap report, sticky issue (TechEngine + TechAPI) + ├ refresh-data.yml # smoke-test: rebuild the dump on engine (app/**) changes + ├ notify-techapi.yml # push→main: ping TechAPI to bump its TechEngine submodule + ├ bump-techapi.yml # dispatch: advance this repo's TechAPI submodule pointer + ├ deploy-pages.yml # build & deploy engine site + dump + └ test.yml # lint + type-check + tests ``` ## How the two repos connect -``` -┌────────────────────┐ ┌──────────────────────────┐ -│ TechAPI (data/) │ workflow_call │ TechEngine (this repo) │ -│ + bundled self- │ ─────────────▶ │ validate-data.yml │ -│ check (PR) │ ◀───────────── │ (checks out TechAPI) │ -└────────────────────┘ └──────────────────────────┘ -``` +Both repos live in the **GetTechAPI** org and each includes the other as a git +**submodule** (a clickable `@ ` pin). Three automations keep them in step: + +- **validate-data.yml** (`workflow_call`) — TechAPI's PR-time check calls into + TechEngine to validate its data. +- **weekly-refresh.yml** — live-scrapes benchmarks, runs the full-dataset + integrity gate (`app.validate` + `integrity_check.py --strict`), regenerates + the static dump, and opens a dated refresh PR against TechAPI. +- **Submodule autosync** — every push to TechEngine `main` fires + `notify-techapi.yml`, which pings TechAPI to bump its TechEngine pointer; + conversely `bump-techapi.yml` advances TechEngine's TechAPI pointer when + TechAPI changes. Bumps are loop-guarded, so each real change converges to one. -Every Python entry point reads data from a sibling **TechAPI checkout**. The -location can be overridden via `TECHAPI_DATA_DIR`; the default looks for -`../TechAPI/data` next to this repo, which matches a local dev layout. +Every Python entry point reads data from a **TechAPI checkout**. The location +can be overridden via `TECHAPI_DATA_DIR`; the default looks for `../TechAPI/data` +next to this repo, which matches a local dev layout. ## Quickstart ```bash -git clone https://github.com/Seungpyo1007/TechAPI.git ../TechAPI # data source +git clone https://github.com/GetTechAPI/TechAPI.git ../TechAPI # data source pip install -e ".[dev]" python -m app.validate # check data integrity python -m app.seed # data/ → ./techapi.db (SQLite) @@ -82,8 +91,11 @@ Spins up Postgres 16, seeds from the mounted TechAPI checkout, serves on `:8000` and surface missing SKUs as a sticky weekly issue ([#1](https://github.com/GetTechAPI/TechEngine/issues/1)) - [x] **Weekly ingestion crawler** — scrape canonical sources and open PRs - against TechAPI with new SKUs (requires `TECHAPI_PR_TOKEN` secret to push) + against TechAPI with new SKUs (requires the `TECHAPI_TOKEN` secret to push) ([#2](https://github.com/GetTechAPI/TechEngine/issues/2)) +- [x] **Weekly refresh pipeline** — live benchmark enrichment → full-dataset + integrity gate → static dump → dated refresh PR (`weekly-refresh.yml`) +- [x] **Bidirectional submodule autosync** between TechEngine and TechAPI - [ ] More sources (Intel ARK, AMD product pages, TechPowerUp DB) ## License diff --git a/docs/DATA_PIPELINE.md b/docs/DATA_PIPELINE.md index d2707d8..9d0ef3b 100644 --- a/docs/DATA_PIPELINE.md +++ b/docs/DATA_PIPELINE.md @@ -44,23 +44,32 @@ dump/v1/socs/… /v1/gpus/… /v1/cpus/… /v1/brands/… A static consumer just fetches, e.g. `https:///v1/smartphones/galaxy-s25/index.json`. -## 3. Automated refresh (`.github/workflows/refresh-data.yml`) +## 3. Automated refresh (`.github/workflows/weekly-refresh.yml`) -A scheduled workflow (weekly cron + on `data/**` changes + manual) runs: +The weekly pipeline (Monday cron + manual dispatch) runs the full cycle against +a TechAPI checkout: ``` -validate seed data → generate dump → publish/commit dump if changed +live-scrape benchmark sources → full-dataset integrity gate + (app.validate + integrity_check.py --strict) → regenerate static dump + → open a dated refresh PR against TechAPI ``` -This is the git-scraping pattern: GitHub runs and stores everything for free. -The hosting target depends on the public/private decision (§5). +The integrity gate re-checks the **whole** dataset every run (not just new +rows), so a bad scrape can't slip a contaminated value past it. A lighter +`refresh-data.yml` rebuilds the dump on engine (`app/**`) changes as a smoke +test only. This is the git-scraping pattern — GitHub runs and stores everything +for free — and the dated PR keeps every refresh reviewable before it lands. The +hosting target depends on the public/private decision (§5). ## 4. Where the data comes from -This repo contains only **curated, validated** records. Bulk collection and -normalization happen **outside this repo**, through a separate internal pipeline, -which publishes curated records here (by PR) after review (SPEC §9.3). This repo -never needs scraping/browser dependencies. +This repo serves **curated, validated** records, but collection now happens +**in-repo**: `app/ingest` drafts new SKUs from upstream catalogs and +`app/ingest/enrich` backfills benchmark columns from multiple sources +(variant-safe, fill-only-nulls, never overwrites). Both run weekly and open PRs +against TechAPI for human review before anything lands (SPEC §9.3). The curated +dataset is a **subset, not exhaustive.** **Dataset layout (this repo).** Curated data uses singular folder names and is organised by brand: `data/brand/.json`, `data/soc//.json`, diff --git a/docs/DEVELOPMENT.md b/docs/DEVELOPMENT.md index d878d0a..a32a4fa 100644 --- a/docs/DEVELOPMENT.md +++ b/docs/DEVELOPMENT.md @@ -60,7 +60,9 @@ scripts/ dump.py DB → static JSON dump (replays API in-process) → ./dump tests/ unit/ + integration/ (conftest seeds a temp SQLite from data/) docs/ SPEC.md, DATA_PIPELINE.md, DEVELOPMENT.md -.github/workflows/ test.yml, validate-data.yml, refresh-data.yml +.github/workflows/ test.yml, validate-data.yml, weekly-refresh.yml, weekly-ingest.yml, + coverage-report.yml, refresh-data.yml, notify-techapi.yml, + bump-techapi.yml, deploy-pages.yml ``` > Note: **data folders are singular** (`data/soc/…`) but **API routes are plural** @@ -85,7 +87,7 @@ python -m app.dump # generate ./dump/ static tree (gitignored) - **GPU activated** — model existed (§6.5); endpoints + data added. - **Data restructured** to singular names + brand subfolders (maintainer request). - **Static-dump pivot** — `app/dump.py` exports the API to a static JSON tree, - refreshed by GitHub Actions (`refresh-data.yml`). + refreshed weekly by GitHub Actions (`weekly-refresh.yml`). - **Scoring** is a Phase-0 reference-based approximation; Phase 1 → dataset-wide min-max (§8.4). Raw third-party benchmarks (Geekbench/AnTuTu/Cinebench/Time Spy) are stored as algorithm inputs but NOT exposed (ADR-006).