Skip to content

Auto-update data + charts 2026-06-12#85

Open
github-actions[bot] wants to merge 9 commits into
mainfrom
auto/weekly-2026-06-12
Open

Auto-update data + charts 2026-06-12#85
github-actions[bot] wants to merge 9 commits into
mainfrom
auto/weekly-2026-06-12

Conversation

@github-actions

Copy link
Copy Markdown
Contributor

Automated weekly data and chart refresh.

nesanders and others added 9 commits June 8, 2026 22:28
Data pipeline scripts:
- get_MA_lobbying.py — full MA SoS lobbying disclosure scraper (2005–present)
- get_MA_legislature_bills.py — incremental Legislature API bill text fetcher
- score_lobbying_bills.py — Gemini embedding + env relevance scoring
  (fixes H/S chamber collision dedup bug; derive bill_id from chamber prefix)
- summarize_lobbying_bills.py — parallel LLM summary + taxonomy tagging
  (actual cost: $0.627/1k bills; documented in script header)
- cluster_lobbying_bills.py — k-means clustering on bill embeddings
- assemble_db.py — entity normalization, bill_id join key, lobbying tables
- validate_data.py — lobbying schema checks
- generate_semantic_context.py — lobbying table descriptions + join docs

CI workflows:
- update-weekly.yml (new): replaces update-data.yml as the weekly job;
  runs full pipeline including lobbying incremental fetch + scoring
- update-data.yml / update-charts.yml: refactored as self-contained;
  no cross-workflow dispatch (GITHUB_TOKEN can't trigger workflow_dispatch)

Infrastructure:
- requirements-ci.txt: add scikit-learn, bump joblib, add gcsfs/pyarrow
- CLAUDE.md: document Gemini cost actuals, lobbying known issues
- .gitignore: add lobbying secret and cache dirs

Data:
- docs/data/: lobbying sample CSVs, legislature sample, cluster labels,
  embedding model artifacts, timestamps, facts_lobbying.yml
- docs/assets/db_semantic_context.txt: lobbying + legislature table context

Dashboard charts (dashboard_charts.py: graceful try/except for lobbying
import until MA_lobbying_viz.py lands in a follow-up PR):
- Non-lobbying dash chart refreshes: MADEP enforcement, MAEEADP CSO/EJ

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
employer_name was renamed to entity_name during entity normalization
refactor in assemble_db.py. Update expected schema in validate_data.py
and regenerate data_stats.yml to include lobbying/legislature row counts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
These were one-time experiment/calibration scripts introduced during
lobbying bill embedding and tagging development. Not part of the
ongoing pipeline.

  cluster_pilot_summaries.py  — clustering pilot experiments
  diagnostics_summarize.py    — summarization diagnostics
  fill_summary_embeddings.py  — one-time summary embedding backfill (done)
  test_bill_embedding_pipeline.py — embedding quality iteration tests
  test_concat_embeddings.py       — concat embedding method tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rent state

README_lobbying.md:
- Add summarize_lobbying_bills.py as section 5 (manual, not in CI)
- Fix cluster_lobbying_bills.py: now runs --incremental in weekly CI
- Fix score script cost figure (was wildly wrong; now /bin/bash.00015/bill)
- Fix text prep: 3000 chars not 2000
- Document H/S dedup fix and correct bill_id derivation
- Update corpus size to 33,159 bills / 924 environmental

NOTES_bill_embeddings.md:
- Mark GC bug as FIXED (FIRST_GC_START_YEAR 2005→2003)
- Document H/S collision fix as FIXED (dedup on bill_id not bill_number)
- Update env bill count: 329 → 654 → 924 across three pipeline fixes
- Update corpus size to 33,159 bills (June 2026)
- Note that all 33k bills now have LLM summaries/tags
- Correct score distribution table with post-backfill figures
- Update t-SNE note to 924 env bills (was 329)
- Clean up duplicate section 6 heading
- Remove stale UMAP reference (we use t-SNE)
- Condense "Recommended next steps" → "Remaining known limitations"
  (GC fix and summarization are done; only title-only re-embed,
  threshold recalibration, and lookup quality remain)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ter run

Without this, every CI run starts with no state files (all three lobbying CSVs
are gitignored) and re-scrapes all 22 years of history (~hours).

Changes:
- Pull MA_lobbying_summary_links.csv, MA_lobbying_bills.csv, and
  MA_lobbying_employers.csv from GCS at startup if not present locally
- Push MA_lobbying_summary_links.csv back to GCS at end of every run
  (MA_lobbying_bills.csv and MA_lobbying_employers.csv are already
  uploaded by assemble_db.py)
- Add import os

Also uploaded current local links file to GCS so next run is incremental.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…window

Root cause: even with correct state from GCS, the script was fetching
a Summary.aspx page for every one of ~1,700 registrants per year just
to check for new disc_urls. At 1.0s/request × 1,700 × 2 years = ~57 min
— well over the 30-min CI timeout.

Two fixes:
1. Skip summary-page fetches for registrants already in existing_links
   for past (closed) years. Filing periods close ~6 months after year end
   so their disc_urls won't change. Only truly new registrants in past
   years, and all registrants in the current year, are checked.
   Effect: prior year (2025) drops from ~1,700 requests to ~0-50.

2. Lower REQUEST_DELAY from 1.0s to 0.3s — safe for this low-volume
   SoS server; only affects summary and disclosure page GETs.
   Effect: current-year scan (2026) drops from ~28 min to ~9 min.

Combined expected runtime: ~9 min (vs 57 min before).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…window

Previous fix (skip known prior-year registrants) was necessary but not
sufficient: current-year scan of ~1,700 registrants at 0.3s delay +
~0.5-1s server response ≈ 37 min still exceeds 30-min timeout.

Additional fix: also skip known current-year registrants outside the H2
filing window (Jan–Sep). MA lobbying has two semi-annual periods:
  H1: Jan–Jun, disclosures due ~Jul 15
  H2: Jul–Dec, disclosures due ~Jan 15 of following year
Outside the H2 window, a registrant who already has a disc_url for the
current year won't have a new one until October. Only genuinely new
registrants (first-time filers, not yet in existing_links) are checked.

Expected runtime after both fixes:
  Jan–Sep (pre-H2): only new/unknown registrants → ~2–5 min
  Oct–Dec (H2 window): all ~1,700 current-year registrants → ~25 min

Also bump lobbying step timeout-minutes from 30 → 45 as headroom for
the H2 window full scan and slow server responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… from GCS

Review of the previous incremental fix found two real bugs and one gap:

1. DATA LOSS: the prior skip rule permanently skipped known prior-year
   registrants. H2 disclosures for year Y are filed ~Jan 15 of Y+1, so
   every H2 filing from a registrant who had already filed H1 would have
   been silently missed starting next January.

2. WASTED SCANS: registrants were only recorded in the links CSV after a
   disclosure was fetched. Before July, current-year registrants have no
   disclosures, so all ~1,700 pages were re-scanned every weekly run
   (the observed 40-min runs that found nothing).

3. WRONG ASSUMPTION: a disclosure-count cutoff (max 2 per year) cannot
   work — ~11% of registrant-years have 3–10 disclosure URLs due to
   amendments, which cluster around the filing deadlines.

New model in get_MA_lobbying.py:
- links CSV gains a last_checked column; visited pages with no
  disclosures get a marker row (null disc_url)
- a page is re-checked only inside a filing window
  (deadline − 14d → deadline + 60d; deadlines Jul 15 Y and Jan 15 Y+1)
  plus one closing sweep after each window
- a year is skipped wholesale before Jul 1 (H1 period not closed)
- state syncs to GCS every 200 pages and at run end, data files first
  and the links index LAST, so a timed-out run makes durable progress
  and can never index a disclosure whose data wasn't uploaded

Expected runtimes: ~1–2 min steady-state; ~40 min weekly during the two
filing windows. Verified the window logic against 8 calendar scenarios
and live-tested marker/stamp writes (3-page smoke test + full sweep).

get_MA_legislature_bills.py:
- restore MA_legislature_bills.csv from GCS at startup (CI has no local
  state; without this it re-fetches all 33k bills and times out) —
  verified to_fetch=0 with restored state
- upload to GCS every 500 bills and at run end (self-healing)

Docs: update CLAUDE.md bullets and README_lobbying.md incremental
section to describe the new model.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread get_data/assemble_db.py
continue
out = f'../docs/data/{fname}_sample.csv'
df.head(100).to_csv(out, index=has_index)
print(f'Wrote sample: {out}')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants