Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 33 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# TechEngine

> **Validation, ingestion, and serving engine for the [TechAPI](https://github.com/Seungpyo1007/TechAPI) dataset.**
> **Validation, ingestion, and serving engine for the [TechAPI](https://github.com/GetTechAPI/TechAPI) dataset.**

[![test](https://github.com/GetTechAPI/TechEngine/actions/workflows/test.yml/badge.svg)](https://github.com/GetTechAPI/TechEngine/actions/workflows/test.yml)
 Code: **MIT** · Data: lives in **[TechAPI](https://github.com/Seungpyo1007/TechAPI)** (CC-BY-SA 4.0)
 Code: **MIT** · Data: lives in **[TechAPI](https://github.com/GetTechAPI/TechAPI)** (CC-BY-SA 4.0)

TechEngine owns everything *around* the data: schema validation, the FastAPI
read API, the static JSON dump generator, the engine's own landing site, and
(next up) automated coverage checks and a weekly ingestion crawler.

The dataset and the public-facing playground site live in
[TechAPI](https://github.com/Seungpyo1007/TechAPI) so each can be versioned,
[TechAPI](https://github.com/GetTechAPI/TechAPI) so each can be versioned,
mirrored, and licensed independently. The site shipped in this repo is the
engine's own landing — what TechEngine is, what it runs, link out to docs.

Expand All @@ -31,33 +31,42 @@ app/
tests/ # unit + integration
site/ # Astro engine landing (deploys to Pages)
docs/ # SPEC / DATA_PIPELINE / DEVELOPMENT
TechAPI/ # submodule → GetTechAPI/TechAPI (clickable @ <sha> link)
.github/workflows/
├ validate-data.yml # workflow_call: PR-time data validation for TechAPI
├ refresh-data.yml # cron: regenerate the static dump weekly
├ coverage-report.yml # cron: gap report, sticky issue
├ weekly-ingest.yml # cron: drafts new SKUs, opens PR against TechAPI
├ deploy-pages.yml # build & deploy engine site + dump
└ test.yml # lint + type-check + tests
├ validate-data.yml # workflow_call: PR-time data validation for TechAPI
├ weekly-refresh.yml # cron: live-scrape → integrity gate → dump → PR to TechAPI
├ weekly-ingest.yml # cron: draft new SKUs, open PR against TechAPI
├ coverage-report.yml # cron: gap report, sticky issue (TechEngine + TechAPI)
├ refresh-data.yml # smoke-test: rebuild the dump on engine (app/**) changes
├ notify-techapi.yml # push→main: ping TechAPI to bump its TechEngine submodule
├ bump-techapi.yml # dispatch: advance this repo's TechAPI submodule pointer
├ deploy-pages.yml # build & deploy engine site + dump
└ test.yml # lint + type-check + tests
```

## How the two repos connect

```
┌────────────────────┐ ┌──────────────────────────┐
│ TechAPI (data/) │ workflow_call │ TechEngine (this repo) │
│ + bundled self- │ ─────────────▶ │ validate-data.yml │
│ check (PR) │ ◀───────────── │ (checks out TechAPI) │
└────────────────────┘ └──────────────────────────┘
```
Both repos live in the **GetTechAPI** org and each includes the other as a git
**submodule** (a clickable `@ <sha>` pin). Three automations keep them in step:

- **validate-data.yml** (`workflow_call`) — TechAPI's PR-time check calls into
TechEngine to validate its data.
- **weekly-refresh.yml** — live-scrapes benchmarks, runs the full-dataset
integrity gate (`app.validate` + `integrity_check.py --strict`), regenerates
the static dump, and opens a dated refresh PR against TechAPI.
- **Submodule autosync** — every push to TechEngine `main` fires
`notify-techapi.yml`, which pings TechAPI to bump its TechEngine pointer;
conversely `bump-techapi.yml` advances TechEngine's TechAPI pointer when
TechAPI changes. Bumps are loop-guarded, so each real change converges to one.

Every Python entry point reads data from a sibling **TechAPI checkout**. The
location can be overridden via `TECHAPI_DATA_DIR`; the default looks for
`../TechAPI/data` next to this repo, which matches a local dev layout.
Every Python entry point reads data from a **TechAPI checkout**. The location
can be overridden via `TECHAPI_DATA_DIR`; the default looks for `../TechAPI/data`
next to this repo, which matches a local dev layout.

## Quickstart

```bash
git clone https://github.com/Seungpyo1007/TechAPI.git ../TechAPI # data source
git clone https://github.com/GetTechAPI/TechAPI.git ../TechAPI # data source
pip install -e ".[dev]"
python -m app.validate # check data integrity
python -m app.seed # data/ → ./techapi.db (SQLite)
Expand All @@ -82,8 +91,11 @@ Spins up Postgres 16, seeds from the mounted TechAPI checkout, serves on `:8000`
and surface missing SKUs as a sticky weekly issue
([#1](https://github.com/GetTechAPI/TechEngine/issues/1))
- [x] **Weekly ingestion crawler** — scrape canonical sources and open PRs
against TechAPI with new SKUs (requires `TECHAPI_PR_TOKEN` secret to push)
against TechAPI with new SKUs (requires the `TECHAPI_TOKEN` secret to push)
([#2](https://github.com/GetTechAPI/TechEngine/issues/2))
- [x] **Weekly refresh pipeline** — live benchmark enrichment → full-dataset
integrity gate → static dump → dated refresh PR (`weekly-refresh.yml`)
- [x] **Bidirectional submodule autosync** between TechEngine and TechAPI
- [ ] More sources (Intel ARK, AMD product pages, TechPowerUp DB)

## License
Expand Down
27 changes: 18 additions & 9 deletions docs/DATA_PIPELINE.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,23 +44,32 @@ dump/v1/socs/… /v1/gpus/… /v1/cpus/… /v1/brands/…
A static consumer just fetches, e.g.
`https://<host>/v1/smartphones/galaxy-s25/index.json`.

## 3. Automated refresh (`.github/workflows/refresh-data.yml`)
## 3. Automated refresh (`.github/workflows/weekly-refresh.yml`)

A scheduled workflow (weekly cron + on `data/**` changes + manual) runs:
The weekly pipeline (Monday cron + manual dispatch) runs the full cycle against
a TechAPI checkout:

```
validate seed data → generate dump → publish/commit dump if changed
live-scrape benchmark sources → full-dataset integrity gate
(app.validate + integrity_check.py --strict) → regenerate static dump
→ open a dated refresh PR against TechAPI
```

This is the git-scraping pattern: GitHub runs and stores everything for free.
The hosting target depends on the public/private decision (§5).
The integrity gate re-checks the **whole** dataset every run (not just new
rows), so a bad scrape can't slip a contaminated value past it. A lighter
`refresh-data.yml` rebuilds the dump on engine (`app/**`) changes as a smoke
test only. This is the git-scraping pattern — GitHub runs and stores everything
for free — and the dated PR keeps every refresh reviewable before it lands. The
hosting target depends on the public/private decision (§5).

## 4. Where the data comes from

This repo contains only **curated, validated** records. Bulk collection and
normalization happen **outside this repo**, through a separate internal pipeline,
which publishes curated records here (by PR) after review (SPEC §9.3). This repo
never needs scraping/browser dependencies.
This repo serves **curated, validated** records, but collection now happens
**in-repo**: `app/ingest` drafts new SKUs from upstream catalogs and
`app/ingest/enrich` backfills benchmark columns from multiple sources
(variant-safe, fill-only-nulls, never overwrites). Both run weekly and open PRs
against TechAPI for human review before anything lands (SPEC §9.3). The curated
dataset is a **subset, not exhaustive.**

**Dataset layout (this repo).** Curated data uses singular folder names and is
organised by brand: `data/brand/<slug>.json`, `data/soc/<manufacturer>/<slug>.json`,
Expand Down
6 changes: 4 additions & 2 deletions docs/DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,9 @@ scripts/
dump.py DB → static JSON dump (replays API in-process) → ./dump
tests/ unit/ + integration/ (conftest seeds a temp SQLite from data/)
docs/ SPEC.md, DATA_PIPELINE.md, DEVELOPMENT.md
.github/workflows/ test.yml, validate-data.yml, refresh-data.yml
.github/workflows/ test.yml, validate-data.yml, weekly-refresh.yml, weekly-ingest.yml,
coverage-report.yml, refresh-data.yml, notify-techapi.yml,
bump-techapi.yml, deploy-pages.yml
```

> Note: **data folders are singular** (`data/soc/…`) but **API routes are plural**
Expand All @@ -85,7 +87,7 @@ python -m app.dump # generate ./dump/ static tree (gitignored)
- **GPU activated** — model existed (§6.5); endpoints + data added.
- **Data restructured** to singular names + brand subfolders (maintainer request).
- **Static-dump pivot** — `app/dump.py` exports the API to a static JSON tree,
refreshed by GitHub Actions (`refresh-data.yml`).
refreshed weekly by GitHub Actions (`weekly-refresh.yml`).
- **Scoring** is a Phase-0 reference-based approximation; Phase 1 → dataset-wide
min-max (§8.4). Raw third-party benchmarks (Geekbench/AnTuTu/Cinebench/Time Spy)
are stored as algorithm inputs but NOT exposed (ADR-006).
Expand Down
Loading