Auto-update data + charts 2026-06-12 by github-actions[bot] · Pull Request #85 · nesanders/MAenvironmentaldata

github-actions · 2026-06-12T14:09:17Z

Automated weekly data and chart refresh.

Data pipeline scripts: - get_MA_lobbying.py — full MA SoS lobbying disclosure scraper (2005–present) - get_MA_legislature_bills.py — incremental Legislature API bill text fetcher - score_lobbying_bills.py — Gemini embedding + env relevance scoring (fixes H/S chamber collision dedup bug; derive bill_id from chamber prefix) - summarize_lobbying_bills.py — parallel LLM summary + taxonomy tagging (actual cost: $0.627/1k bills; documented in script header) - cluster_lobbying_bills.py — k-means clustering on bill embeddings - assemble_db.py — entity normalization, bill_id join key, lobbying tables - validate_data.py — lobbying schema checks - generate_semantic_context.py — lobbying table descriptions + join docs CI workflows: - update-weekly.yml (new): replaces update-data.yml as the weekly job; runs full pipeline including lobbying incremental fetch + scoring - update-data.yml / update-charts.yml: refactored as self-contained; no cross-workflow dispatch (GITHUB_TOKEN can't trigger workflow_dispatch) Infrastructure: - requirements-ci.txt: add scikit-learn, bump joblib, add gcsfs/pyarrow - CLAUDE.md: document Gemini cost actuals, lobbying known issues - .gitignore: add lobbying secret and cache dirs Data: - docs/data/: lobbying sample CSVs, legislature sample, cluster labels, embedding model artifacts, timestamps, facts_lobbying.yml - docs/assets/db_semantic_context.txt: lobbying + legislature table context Dashboard charts (dashboard_charts.py: graceful try/except for lobbying import until MA_lobbying_viz.py lands in a follow-up PR): - Non-lobbying dash chart refreshes: MADEP enforcement, MAEEADP CSO/EJ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

employer_name was renamed to entity_name during entity normalization refactor in assemble_db.py. Update expected schema in validate_data.py and regenerate data_stats.yml to include lobbying/legislature row counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

These were one-time experiment/calibration scripts introduced during lobbying bill embedding and tagging development. Not part of the ongoing pipeline. cluster_pilot_summaries.py — clustering pilot experiments diagnostics_summarize.py — summarization diagnostics fill_summary_embeddings.py — one-time summary embedding backfill (done) test_bill_embedding_pipeline.py — embedding quality iteration tests test_concat_embeddings.py — concat embedding method tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rent state README_lobbying.md: - Add summarize_lobbying_bills.py as section 5 (manual, not in CI) - Fix cluster_lobbying_bills.py: now runs --incremental in weekly CI - Fix score script cost figure (was wildly wrong; now /bin/bash.00015/bill) - Fix text prep: 3000 chars not 2000 - Document H/S dedup fix and correct bill_id derivation - Update corpus size to 33,159 bills / 924 environmental NOTES_bill_embeddings.md: - Mark GC bug as FIXED (FIRST_GC_START_YEAR 2005→2003) - Document H/S collision fix as FIXED (dedup on bill_id not bill_number) - Update env bill count: 329 → 654 → 924 across three pipeline fixes - Update corpus size to 33,159 bills (June 2026) - Note that all 33k bills now have LLM summaries/tags - Correct score distribution table with post-backfill figures - Update t-SNE note to 924 env bills (was 329) - Clean up duplicate section 6 heading - Remove stale UMAP reference (we use t-SNE) - Condense "Recommended next steps" → "Remaining known limitations" (GC fix and summarization are done; only title-only re-embed, threshold recalibration, and lookup quality remain) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ter run Without this, every CI run starts with no state files (all three lobbying CSVs are gitignored) and re-scrapes all 22 years of history (~hours). Changes: - Pull MA_lobbying_summary_links.csv, MA_lobbying_bills.csv, and MA_lobbying_employers.csv from GCS at startup if not present locally - Push MA_lobbying_summary_links.csv back to GCS at end of every run (MA_lobbying_bills.csv and MA_lobbying_employers.csv are already uploaded by assemble_db.py) - Add import os Also uploaded current local links file to GCS so next run is incremental. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…window Root cause: even with correct state from GCS, the script was fetching a Summary.aspx page for every one of ~1,700 registrants per year just to check for new disc_urls. At 1.0s/request × 1,700 × 2 years = ~57 min — well over the 30-min CI timeout. Two fixes: 1. Skip summary-page fetches for registrants already in existing_links for past (closed) years. Filing periods close ~6 months after year end so their disc_urls won't change. Only truly new registrants in past years, and all registrants in the current year, are checked. Effect: prior year (2025) drops from ~1,700 requests to ~0-50. 2. Lower REQUEST_DELAY from 1.0s to 0.3s — safe for this low-volume SoS server; only affects summary and disclosure page GETs. Effect: current-year scan (2026) drops from ~28 min to ~9 min. Combined expected runtime: ~9 min (vs 57 min before). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…window Previous fix (skip known prior-year registrants) was necessary but not sufficient: current-year scan of ~1,700 registrants at 0.3s delay + ~0.5-1s server response ≈ 37 min still exceeds 30-min timeout. Additional fix: also skip known current-year registrants outside the H2 filing window (Jan–Sep). MA lobbying has two semi-annual periods: H1: Jan–Jun, disclosures due ~Jul 15 H2: Jul–Dec, disclosures due ~Jan 15 of following year Outside the H2 window, a registrant who already has a disc_url for the current year won't have a new one until October. Only genuinely new registrants (first-time filers, not yet in existing_links) are checked. Expected runtime after both fixes: Jan–Sep (pre-H2): only new/unknown registrants → ~2–5 min Oct–Dec (H2 window): all ~1,700 current-year registrants → ~25 min Also bump lobbying step timeout-minutes from 30 → 45 as headroom for the H2 window full scan and slow server responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… from GCS Review of the previous incremental fix found two real bugs and one gap: 1. DATA LOSS: the prior skip rule permanently skipped known prior-year registrants. H2 disclosures for year Y are filed ~Jan 15 of Y+1, so every H2 filing from a registrant who had already filed H1 would have been silently missed starting next January. 2. WASTED SCANS: registrants were only recorded in the links CSV after a disclosure was fetched. Before July, current-year registrants have no disclosures, so all ~1,700 pages were re-scanned every weekly run (the observed 40-min runs that found nothing). 3. WRONG ASSUMPTION: a disclosure-count cutoff (max 2 per year) cannot work — ~11% of registrant-years have 3–10 disclosure URLs due to amendments, which cluster around the filing deadlines. New model in get_MA_lobbying.py: - links CSV gains a last_checked column; visited pages with no disclosures get a marker row (null disc_url) - a page is re-checked only inside a filing window (deadline − 14d → deadline + 60d; deadlines Jul 15 Y and Jan 15 Y+1) plus one closing sweep after each window - a year is skipped wholesale before Jul 1 (H1 period not closed) - state syncs to GCS every 200 pages and at run end, data files first and the links index LAST, so a timed-out run makes durable progress and can never index a disclosure whose data wasn't uploaded Expected runtimes: ~1–2 min steady-state; ~40 min weekly during the two filing windows. Verified the window logic against 8 calendar scenarios and live-tested marker/stamp writes (3-page smoke test + full sweep). get_MA_legislature_bills.py: - restore MA_legislature_bills.csv from GCS at startup (CI has no local state; without this it re-fetches all 33k bills and times out) — verified to_fetch=0 with restored state - upload to GCS every 500 bills and at run end (self-healing) Docs: update CLAUDE.md bullets and README_lobbying.md incremental section to describe the new model. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

+			continue
+		out = f'../docs/data/{fname}_sample.csv'
+		df.head(100).to_csv(out, index=has_index)
+		print(f'Wrote sample: {out}')


nesanders and others added 9 commits June 8, 2026 22:28

Auto-update data + charts 2026-06-12

d9c72be

github-advanced-security AI found potential problems Jun 12, 2026

View reviewed changes

Comment thread get_data/assemble_db.py

continue

out = f'../docs/data/{fname}_sample.csv'

df.head(100).to_csv(out, index=has_index)

print(f'Wrote sample: {out}')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-update data + charts 2026-06-12#85

Auto-update data + charts 2026-06-12#85
github-actions[bot] wants to merge 9 commits into
mainfrom
auto/weekly-2026-06-12

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants