Spaces:

VAILL
/

legislation-tracker

Running on CPU Upgrade

App Files Files Community

legislation-tracker / docs /data_update_pipeline_guide.md

ramanna

Deploy: newsletter display polish

3cc39aa about 1 month ago

preview code

raw

history blame contribute delete

15.4 kB

Data Update Pipeline Guide

Overview

The AI Legislation Tracker pipeline pulls AI-related bills from all 50 US states (+ US Congress) via the LegiScan API, processes them through multiple stages (text extraction, categorization, AI-generated summaries/questions/reports), builds searchable vectorstores, and optionally syncs everything to HuggingFace for cloud storage.

Quick Reference

From the Admin Panel (recommended):

Open the app -> navigate to Admin -> log in -> "Update Data" tab -> click "Update Data"

From the command line:

python update_data.py --pull --overwrite-pdf --continue-on-error

After pipeline completes, sync to cloud:

python huggingface_upload.py

Step-by-Step: Running from the Admin Panel

Step 1: Start the Streamlit App

Open a terminal in the project root and run:

streamlit run streamlit_app.py

The app opens in your browser (typically at http://localhost:8501).

What you see: The main VAILL AI Governance Bills Tracker dashboard with a map, bill table, filters, and analysis tools.

Step 2: Navigate to the Admin Panel

In the left sidebar, click "Admin" to open the Admin page.

What you see: A login form asking for username and password.

Step 3: Log In

Enter your admin credentials (configured in auth_config.json or HuggingFace).

What you see: After login, the Admin Panel with three tabs: Overview, Update Data, and Manage Users. Your name and username appear in the sidebar with a Logout button.

Step 4: Review Current Data (Overview Tab)

The Overview tab shows:

System Status: Connection status for OpenAI API, LegiScan API, and HuggingFace
Current Data: Counts for Total Bills, Bills with Details, Summaries, Question Sets, Reports, and Cached Bills
Last Pipeline Run: Timestamp of the most recent run
Admin Users: Table of all registered admin accounts

Check that all three API connections show "Connected" before running the pipeline. If any show "Missing", set the corresponding environment variable in your .env file.

Step 5: Run the Pipeline (Update Data Tab)

Click the "Update Data" tab
Review the Current Data counts (same metrics as Overview)
Optional: Check "Skip uploading to HuggingFace after update" if you only want local updates
Click the blue "Update Data" button

What you see: A live status panel showing real-time log output as each script runs. The pipeline runs 10 scripts sequentially (detailed below). This takes 2-6 hours depending on how many new bills need processing.

Step 6: Monitor Progress

The status panel streams output from each script. You'll see messages like:

--- Running data_updating_scripts/get_data.py ---
Fetching bills for AL (2023-2026)...
...
--- Running data_updating_scripts/generate_summaries.py ---
Processing 1/3487: AL_SJR94
...

Important: You can navigate away from the page — the pipeline runs as a background process. To check on it later, look at the terminal where Streamlit is running, or check the log files in data_updating_scripts/logs/.

Step 7: Review Results

When the pipeline finishes, the Update Data tab shows:

New / Updated Bills: How many bills were fetched from LegiScan
Unchanged Bills: Bills that hadn't changed (API calls saved)
Steps Completed: How many scripts passed vs failed
Data Changes: Before/after comparison of all data file counts

If HuggingFace upload was enabled, it automatically syncs all JSON files to the cloud.

Step 8: View Updated Data

Hard-refresh the main page (Cmd+Shift+R on Mac, Ctrl+Shift+R on Windows) or restart Streamlit to clear the cache and see the updated bills.

What the Pipeline Does (All 10 Scripts)

The pipeline orchestrator (update_data.py) runs these 10 scripts in order:

Script 1: `get_data.py` — Pull Bills from LegiScan

What it does:

Queries the LegiScan API for "artificial intelligence" bills across all 50 states + US Congress
Searches years 2023 through the current year
For each bill found, fetches full bill details (text, sponsors, status, history)
Uses a local cache (data/bill_cache.json) to skip bills that haven't changed since last pull
Extracts bill text from base64-encoded documents (HTML or PDF)

Input: LegiScan API Output: data/known_bills.json, data/bill_cache.json API used: LegiScan (uses LEGISCAN_API_KEY) Cost: Free tier allows ~30,000 requests/month

Script 2: `fix_pdf_bills.py` — Extract Text from PDF Bills

What it does:

Finds bills where the text is still raw base64-encoded PDF content
Decodes the base64, extracts readable text using PyPDF2
Also handles HTML-encoded bill text via BeautifulSoup
Marks successfully processed bills with text_fixed: true to avoid reprocessing
Creates a backup before modifying data

Input: data/known_bills.json Output: data/known_bills.json (updated in place), data/known_bills_backup.json API used: None (local processing only)

Script 3: `known_bills_status.py` — Clean and Merge Bill Data

What it does:

Merges raw bill data from known_bills.json with the existing visualization dataset
Maps numeric status codes to human-readable labels (e.g., 1 -> "Introduced", 4 -> "Signed Into Law")
Preserves existing IAPP categories and other enrichments from previous runs
Removes bills that are no longer in the source data

Input: data/known_bills.json, data/known_bills_visualize.json (existing) Output: data/known_bills_fixed.json, data/known_bills_visualize.json API used: None (local processing only)

Script 4: `migrate_iapp_categories.py` — Categorize Bills (IAPP Framework)

What it does:

Analyzes each bill's text using OpenAI to categorize it under the IAPP AI governance framework
Assigns bills to 4 main categories: Governance, Transparency, Assurance, Individual Rights
Each category has specific subcategories (e.g., "Program and documentation", "Opt out/appeal")
Uses a local cache (data/iapp_categories_cache.json) to avoid reprocessing unchanged bills
Falls back to default categories if the API call fails

Input: data/known_bills_fixed.json, data/known_bills_visualize.json Output: data/known_bills_visualize.json (updated with iapp_categories field), data/iapp_categories_cache.json API used: OpenAI Chat API (OPENAI_API_KEY) Cost: ~$0.02 per bill

Script 5: `mark_no_text_bills.py` — Flag Bills Without Text

What it does:

Scans all bills and marks those with missing or very short text (< 50 characters)
Sets iapp_categories to null for these bills so they're excluded from categorization displays
This prevents empty bills from cluttering analysis results

Input: data/known_bills_visualize.json Output: data/known_bills_visualize.json (updated in place) API used: None (local processing only)

Script 6: `generate_summaries.py` — Generate Bill Summaries

What it does:

For each bill with text, generates a concise AI summary explaining what the bill does
Skips bills that already have a valid summary in the cache
Saves progress every 10 bills (safe to interrupt and resume)
Summaries are stored separately from bill data to avoid reprocessing

Input: data/known_bills_visualize.json, data/bill_summaries.json (existing cache) Output: data/bill_summaries.json API used: OpenAI Chat API Cost: ~$0.025 per bill

Script 7: `generate_suggested_questions.py` — Generate Discussion Questions

What it does:

For each bill, generates 5 specific questions that a user might ask about the bill
These questions appear in the UI as clickable suggestions when viewing a bill
Falls back to generic questions if generation fails
Saves progress every 10 bills

Input: data/known_bills_visualize.json, data/bill_suggested_questions.json (existing cache) Output: data/bill_suggested_questions.json API used: OpenAI Chat API Cost: ~$0.02 per bill

Script 8: `generate_reports.py` — Generate Detailed Reports

What it does:

For each bill, generates a detailed Markdown report covering: title, status, sponsors, goals, key provisions, regulatory approaches, enforcement mechanisms, and notable features
Reports are the longest AI-generated content (~1-2 pages per bill)
Saves progress every 10 bills (resumable)

Input: data/known_bills_visualize.json, data/bill_reports.json (existing cache) Output: data/bill_reports.json API used: OpenAI Chat API (uses gpt-4o) Cost: ~$0.045 per bill

Script 9: `build_bills_vectorstore.py` — Build Searchable Vectorstore

What it does:

Converts all bills into vector embeddings for semantic search
Splits long bill text into overlapping chunks (1500 chars, 200 overlap)
Stores embeddings in a ChromaDB vectorstore on disk
Uses a manifest file to skip bills that haven't changed since last build
This vectorstore powers the "Ask a Question" and "Compare Bills" features in the app

Input: data/known_bills_visualize.json Output: data/bills_vectorstore/ (ChromaDB files), data/bills_vectorstore_manifest.json API used: OpenAI Embeddings API (text-embedding-3-small) Cost: ~$0.0001 per bill (very cheap)

Script 10: `eu_vectorstore.py` — Build EU AI Act Vectorstore

What it does:

Extracts text from the EU AI Act PDF (data_updating_scripts/eu-ai-act.pdf)
Splits it into chunks and creates a FAISS vectorstore
This vectorstore powers the "Compare with EU AI Act" feature in the app

Input: data_updating_scripts/eu-ai-act.pdf Output: data/eu_ai_act_vectorstore/ (FAISS index + metadata) API used: OpenAI Embeddings API Cost: ~$0.01 (one-time, small document)

Data Flow Diagram

LegiScan API
    |
    v
[1] get_data.py
    |
    v
data/known_bills.json  +  data/bill_cache.json
    |
    v
[2] fix_pdf_bills.py  (extract text from PDFs)
    |
    v
data/known_bills.json  (text extracted)
    |
    v
[3] known_bills_status.py  (merge + clean statuses)
    |
    v
data/known_bills_fixed.json  +  data/known_bills_visualize.json
    |
    +---> [4] migrate_iapp_categories.py  (IAPP categorization via OpenAI)
    |         |
    |         v
    |     data/known_bills_visualize.json  (with iapp_categories)
    |
    +---> [5] mark_no_text_bills.py  (flag empty bills)
    |         |
    |         v
    |     data/known_bills_visualize.json  (final version)
    |
    +---> [6] generate_summaries.py ---------> data/bill_summaries.json
    |
    +---> [7] generate_suggested_questions.py -> data/bill_suggested_questions.json
    |
    +---> [8] generate_reports.py ------------> data/bill_reports.json
    |
    +---> [9] build_bills_vectorstore.py -----> data/bills_vectorstore/
    |
    +---> [10] eu_vectorstore.py -------------> data/eu_ai_act_vectorstore/
    |
    v
[Optional] huggingface_upload.py  (sync all JSONs to HuggingFace)
    |
    v
HuggingFace Datasets Hub
    |
    v
streamlit_app.py reads data/known_bills_visualize.json
    + data/bill_summaries.json
    + data/bill_suggested_questions.json
    + data/bill_reports.json
    + data/bills_vectorstore/
    + data/eu_ai_act_vectorstore/
    |
    v
Website displays all bills with interactive features

Files Produced by the Pipeline

File	Description	Used By
`data/known_bills.json`	Raw bill data from LegiScan	Pipeline scripts
`data/known_bills_backup.json`	Backup before PDF text extraction	Recovery only
`data/known_bills_fixed.json`	Bills with cleaned statuses	Pipeline scripts
`data/known_bills_visualize.json`	Main data file — bills with IAPP categories	Website + all scripts
`data/bill_cache.json`	LegiScan change hashes (skip unchanged bills)	`get_data.py`
`data/iapp_categories_cache.json`	IAPP categorization cache	`migrate_iapp_categories.py`
`data/bill_summaries.json`	AI-generated bill summaries	Website
`data/bill_suggested_questions.json`	AI-generated discussion questions	Website
`data/bill_reports.json`	AI-generated detailed reports	Website
`data/bills_vectorstore/`	ChromaDB vectorstore for bill search	Website (Q&A, Compare)
`data/bills_vectorstore_manifest.json`	Tracks which bills are in vectorstore	`build_bills_vectorstore.py`
`data/eu_ai_act_vectorstore/`	FAISS vectorstore for EU AI Act	Website (EU comparison)

Running from the Command Line

Full pipeline (pull new data + process everything):

python update_data.py --pull --overwrite-pdf --continue-on-error

Skip LegiScan pull (reprocess existing local data only):

python update_data.py --no-pull --continue-on-error

Skip HuggingFace upload:

python update_data.py --pull --overwrite-pdf --continue-on-error --skip-upload

Test mode (pull only 3 bills from CA to verify pipeline works):

python update_data.py --test --continue-on-error --skip-upload

Run individual scripts:

# Re-run just summaries (e.g., to catch up after a hang)
python data_updating_scripts/generate_summaries.py

# Re-run just questions
python data_updating_scripts/generate_suggested_questions.py

# Re-run just reports
python data_updating_scripts/generate_reports.py

# Rebuild vectorstore
python data_updating_scripts/build_bills_vectorstore.py

# Upload all data to HuggingFace
python huggingface_upload.py

Environment Variables Required

Set these in a .env file in the project root:

Variable	Required For	Description
`LEGISCAN_API_KEY`	Scripts 1	LegiScan API key for pulling bill data
`OPENAI_API_KEY`	Scripts 4, 6-10	OpenAI API key for summaries, reports, embeddings
`HUGGINGFACE_HUB_TOKEN`	HF upload	HuggingFace API token
`HF_REPO_ID`	HF upload	HuggingFace dataset repo (e.g., `username/dataset-name`)

Troubleshooting

Pipeline hangs on a bill

All OpenAI API calls have a 120-second timeout. If a call hangs, it will timeout, log an error, and skip to the next bill. The skipped bill will be retried on the next pipeline run.

Bills show up but summaries/questions/reports are missing

The generate scripts cache their results. If a script was interrupted mid-run, re-run it individually to catch up:

python data_updating_scripts/generate_summaries.py

Main page shows old data after pipeline runs

Streamlit caches data in memory. Hard-refresh the page (Cmd+Shift+R) or restart Streamlit.

HuggingFace upload fails

Check that HUGGINGFACE_HUB_TOKEN and HF_REPO_ID are set in .env. Test the connection:

python huggingface_upload.py

"OPENAI_API_KEY not set" error

Ensure your .env file contains the key and that python-dotenv is installed. The pipeline loads .env automatically.

Estimated Runtime

Scenario	Duration
Full pipeline, all bills new	4-6 hours
Incremental (100 new bills)	30-60 minutes
Re-run summaries only (catch-up)	~15-30 min per 1,000 bills
Vectorstore rebuild (all bills)	15-30 minutes
HuggingFace upload	2-5 minutes

Data Update Pipeline Guide

Overview

Quick Reference

Step-by-Step: Running from the Admin Panel

Step 1: Start the Streamlit App

Step 2: Navigate to the Admin Panel

Step 3: Log In

Step 4: Review Current Data (Overview Tab)

Step 5: Run the Pipeline (Update Data Tab)

Step 6: Monitor Progress

Step 7: Review Results

Step 8: View Updated Data

What the Pipeline Does (All 10 Scripts)

Script 1: get_data.py — Pull Bills from LegiScan

Script 2: fix_pdf_bills.py — Extract Text from PDF Bills

Script 3: known_bills_status.py — Clean and Merge Bill Data

Script 4: migrate_iapp_categories.py — Categorize Bills (IAPP Framework)

Script 5: mark_no_text_bills.py — Flag Bills Without Text

Script 6: generate_summaries.py — Generate Bill Summaries

Script 7: generate_suggested_questions.py — Generate Discussion Questions

Script 8: generate_reports.py — Generate Detailed Reports

Script 9: build_bills_vectorstore.py — Build Searchable Vectorstore

Script 10: eu_vectorstore.py — Build EU AI Act Vectorstore

Data Flow Diagram

Files Produced by the Pipeline

Running from the Command Line

Full pipeline (pull new data + process everything):

Skip LegiScan pull (reprocess existing local data only):

Skip HuggingFace upload:

Test mode (pull only 3 bills from CA to verify pipeline works):

Run individual scripts:

Environment Variables Required

Troubleshooting

Pipeline hangs on a bill

Bills show up but summaries/questions/reports are missing

Main page shows old data after pipeline runs

HuggingFace upload fails

"OPENAI_API_KEY not set" error

Estimated Runtime

Script 1: `get_data.py` — Pull Bills from LegiScan

Script 2: `fix_pdf_bills.py` — Extract Text from PDF Bills

Script 3: `known_bills_status.py` — Clean and Merge Bill Data

Script 4: `migrate_iapp_categories.py` — Categorize Bills (IAPP Framework)

Script 5: `mark_no_text_bills.py` — Flag Bills Without Text

Script 6: `generate_summaries.py` — Generate Bill Summaries

Script 7: `generate_suggested_questions.py` — Generate Discussion Questions

Script 8: `generate_reports.py` — Generate Detailed Reports

Script 9: `build_bills_vectorstore.py` — Build Searchable Vectorstore

Script 10: `eu_vectorstore.py` — Build EU AI Act Vectorstore