Auto-update data + charts 2026-06-12#85
Open
github-actions[bot] wants to merge 9 commits into
Open
Conversation
Data pipeline scripts: - get_MA_lobbying.py — full MA SoS lobbying disclosure scraper (2005–present) - get_MA_legislature_bills.py — incremental Legislature API bill text fetcher - score_lobbying_bills.py — Gemini embedding + env relevance scoring (fixes H/S chamber collision dedup bug; derive bill_id from chamber prefix) - summarize_lobbying_bills.py — parallel LLM summary + taxonomy tagging (actual cost: $0.627/1k bills; documented in script header) - cluster_lobbying_bills.py — k-means clustering on bill embeddings - assemble_db.py — entity normalization, bill_id join key, lobbying tables - validate_data.py — lobbying schema checks - generate_semantic_context.py — lobbying table descriptions + join docs CI workflows: - update-weekly.yml (new): replaces update-data.yml as the weekly job; runs full pipeline including lobbying incremental fetch + scoring - update-data.yml / update-charts.yml: refactored as self-contained; no cross-workflow dispatch (GITHUB_TOKEN can't trigger workflow_dispatch) Infrastructure: - requirements-ci.txt: add scikit-learn, bump joblib, add gcsfs/pyarrow - CLAUDE.md: document Gemini cost actuals, lobbying known issues - .gitignore: add lobbying secret and cache dirs Data: - docs/data/: lobbying sample CSVs, legislature sample, cluster labels, embedding model artifacts, timestamps, facts_lobbying.yml - docs/assets/db_semantic_context.txt: lobbying + legislature table context Dashboard charts (dashboard_charts.py: graceful try/except for lobbying import until MA_lobbying_viz.py lands in a follow-up PR): - Non-lobbying dash chart refreshes: MADEP enforcement, MAEEADP CSO/EJ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
employer_name was renamed to entity_name during entity normalization refactor in assemble_db.py. Update expected schema in validate_data.py and regenerate data_stats.yml to include lobbying/legislature row counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
These were one-time experiment/calibration scripts introduced during lobbying bill embedding and tagging development. Not part of the ongoing pipeline. cluster_pilot_summaries.py — clustering pilot experiments diagnostics_summarize.py — summarization diagnostics fill_summary_embeddings.py — one-time summary embedding backfill (done) test_bill_embedding_pipeline.py — embedding quality iteration tests test_concat_embeddings.py — concat embedding method tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rent state README_lobbying.md: - Add summarize_lobbying_bills.py as section 5 (manual, not in CI) - Fix cluster_lobbying_bills.py: now runs --incremental in weekly CI - Fix score script cost figure (was wildly wrong; now /bin/bash.00015/bill) - Fix text prep: 3000 chars not 2000 - Document H/S dedup fix and correct bill_id derivation - Update corpus size to 33,159 bills / 924 environmental NOTES_bill_embeddings.md: - Mark GC bug as FIXED (FIRST_GC_START_YEAR 2005→2003) - Document H/S collision fix as FIXED (dedup on bill_id not bill_number) - Update env bill count: 329 → 654 → 924 across three pipeline fixes - Update corpus size to 33,159 bills (June 2026) - Note that all 33k bills now have LLM summaries/tags - Correct score distribution table with post-backfill figures - Update t-SNE note to 924 env bills (was 329) - Clean up duplicate section 6 heading - Remove stale UMAP reference (we use t-SNE) - Condense "Recommended next steps" → "Remaining known limitations" (GC fix and summarization are done; only title-only re-embed, threshold recalibration, and lookup quality remain) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ter run Without this, every CI run starts with no state files (all three lobbying CSVs are gitignored) and re-scrapes all 22 years of history (~hours). Changes: - Pull MA_lobbying_summary_links.csv, MA_lobbying_bills.csv, and MA_lobbying_employers.csv from GCS at startup if not present locally - Push MA_lobbying_summary_links.csv back to GCS at end of every run (MA_lobbying_bills.csv and MA_lobbying_employers.csv are already uploaded by assemble_db.py) - Add import os Also uploaded current local links file to GCS so next run is incremental. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…window Root cause: even with correct state from GCS, the script was fetching a Summary.aspx page for every one of ~1,700 registrants per year just to check for new disc_urls. At 1.0s/request × 1,700 × 2 years = ~57 min — well over the 30-min CI timeout. Two fixes: 1. Skip summary-page fetches for registrants already in existing_links for past (closed) years. Filing periods close ~6 months after year end so their disc_urls won't change. Only truly new registrants in past years, and all registrants in the current year, are checked. Effect: prior year (2025) drops from ~1,700 requests to ~0-50. 2. Lower REQUEST_DELAY from 1.0s to 0.3s — safe for this low-volume SoS server; only affects summary and disclosure page GETs. Effect: current-year scan (2026) drops from ~28 min to ~9 min. Combined expected runtime: ~9 min (vs 57 min before). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…window Previous fix (skip known prior-year registrants) was necessary but not sufficient: current-year scan of ~1,700 registrants at 0.3s delay + ~0.5-1s server response ≈ 37 min still exceeds 30-min timeout. Additional fix: also skip known current-year registrants outside the H2 filing window (Jan–Sep). MA lobbying has two semi-annual periods: H1: Jan–Jun, disclosures due ~Jul 15 H2: Jul–Dec, disclosures due ~Jan 15 of following year Outside the H2 window, a registrant who already has a disc_url for the current year won't have a new one until October. Only genuinely new registrants (first-time filers, not yet in existing_links) are checked. Expected runtime after both fixes: Jan–Sep (pre-H2): only new/unknown registrants → ~2–5 min Oct–Dec (H2 window): all ~1,700 current-year registrants → ~25 min Also bump lobbying step timeout-minutes from 30 → 45 as headroom for the H2 window full scan and slow server responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… from GCS Review of the previous incremental fix found two real bugs and one gap: 1. DATA LOSS: the prior skip rule permanently skipped known prior-year registrants. H2 disclosures for year Y are filed ~Jan 15 of Y+1, so every H2 filing from a registrant who had already filed H1 would have been silently missed starting next January. 2. WASTED SCANS: registrants were only recorded in the links CSV after a disclosure was fetched. Before July, current-year registrants have no disclosures, so all ~1,700 pages were re-scanned every weekly run (the observed 40-min runs that found nothing). 3. WRONG ASSUMPTION: a disclosure-count cutoff (max 2 per year) cannot work — ~11% of registrant-years have 3–10 disclosure URLs due to amendments, which cluster around the filing deadlines. New model in get_MA_lobbying.py: - links CSV gains a last_checked column; visited pages with no disclosures get a marker row (null disc_url) - a page is re-checked only inside a filing window (deadline − 14d → deadline + 60d; deadlines Jul 15 Y and Jan 15 Y+1) plus one closing sweep after each window - a year is skipped wholesale before Jul 1 (H1 period not closed) - state syncs to GCS every 200 pages and at run end, data files first and the links index LAST, so a timed-out run makes durable progress and can never index a disclosure whose data wasn't uploaded Expected runtimes: ~1–2 min steady-state; ~40 min weekly during the two filing windows. Verified the window logic against 8 calendar scenarios and live-tested marker/stamp writes (3-page smoke test + full sweep). get_MA_legislature_bills.py: - restore MA_legislature_bills.csv from GCS at startup (CI has no local state; without this it re-fetches all 33k bills and times out) — verified to_fetch=0 with restored state - upload to GCS every 500 bills and at run end (self-healing) Docs: update CLAUDE.md bullets and README_lobbying.md incremental section to describe the new model. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
| continue | ||
| out = f'../docs/data/{fname}_sample.csv' | ||
| df.head(100).to_csv(out, index=has_index) | ||
| print(f'Wrote sample: {out}') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated weekly data and chart refresh.