Spaces:
Running on CPU Upgrade
Data Update Pipeline Guide
Overview
The AI Legislation Tracker pipeline pulls AI-related bills from all 50 US states (+ US Congress) via the LegiScan API, processes them through multiple stages (text extraction, categorization, AI-generated summaries/questions/reports), builds searchable vectorstores, and optionally syncs everything to HuggingFace for cloud storage.
Quick Reference
From the Admin Panel (recommended):
- Open the app -> navigate to Admin -> log in -> "Update Data" tab -> click "Update Data"
From the command line:
python update_data.py --pull --overwrite-pdf --continue-on-error
After pipeline completes, sync to cloud:
python huggingface_upload.py
Step-by-Step: Running from the Admin Panel
Step 1: Start the Streamlit App
Open a terminal in the project root and run:
streamlit run streamlit_app.py
The app opens in your browser (typically at http://localhost:8501).
What you see: The main VAILL AI Governance Bills Tracker dashboard with a map, bill table, filters, and analysis tools.
Step 2: Navigate to the Admin Panel
In the left sidebar, click "Admin" to open the Admin page.
What you see: A login form asking for username and password.
Step 3: Log In
Enter your admin credentials (configured in auth_config.json or HuggingFace).
What you see: After login, the Admin Panel with three tabs: Overview, Update Data, and Manage Users. Your name and username appear in the sidebar with a Logout button.
Step 4: Review Current Data (Overview Tab)
The Overview tab shows:
- System Status: Connection status for OpenAI API, LegiScan API, and HuggingFace
- Current Data: Counts for Total Bills, Bills with Details, Summaries, Question Sets, Reports, and Cached Bills
- Last Pipeline Run: Timestamp of the most recent run
- Admin Users: Table of all registered admin accounts
Check that all three API connections show "Connected" before running the pipeline. If any show "Missing", set the corresponding environment variable in your
.envfile.
Step 5: Run the Pipeline (Update Data Tab)
- Click the "Update Data" tab
- Review the Current Data counts (same metrics as Overview)
- Optional: Check "Skip uploading to HuggingFace after update" if you only want local updates
- Click the blue "Update Data" button
What you see: A live status panel showing real-time log output as each script runs. The pipeline runs 10 scripts sequentially (detailed below). This takes 2-6 hours depending on how many new bills need processing.
Step 6: Monitor Progress
The status panel streams output from each script. You'll see messages like:
--- Running data_updating_scripts/get_data.py ---
Fetching bills for AL (2023-2026)...
...
--- Running data_updating_scripts/generate_summaries.py ---
Processing 1/3487: AL_SJR94
...
Important: You can navigate away from the page β the pipeline runs as a background process. To check on it later, look at the terminal where Streamlit is running, or check the log files in
data_updating_scripts/logs/.
Step 7: Review Results
When the pipeline finishes, the Update Data tab shows:
- New / Updated Bills: How many bills were fetched from LegiScan
- Unchanged Bills: Bills that hadn't changed (API calls saved)
- Steps Completed: How many scripts passed vs failed
- Data Changes: Before/after comparison of all data file counts
If HuggingFace upload was enabled, it automatically syncs all JSON files to the cloud.
Step 8: View Updated Data
Hard-refresh the main page (Cmd+Shift+R on Mac, Ctrl+Shift+R on Windows) or restart Streamlit to clear the cache and see the updated bills.
What the Pipeline Does (All 10 Scripts)
The pipeline orchestrator (update_data.py) runs these 10 scripts in order:
Script 1: get_data.py β Pull Bills from LegiScan
What it does:
- Queries the LegiScan API for "artificial intelligence" bills across all 50 states + US Congress
- Searches years 2023 through the current year
- For each bill found, fetches full bill details (text, sponsors, status, history)
- Uses a local cache (
data/bill_cache.json) to skip bills that haven't changed since last pull - Extracts bill text from base64-encoded documents (HTML or PDF)
Input: LegiScan API
Output: data/known_bills.json, data/bill_cache.json
API used: LegiScan (uses LEGISCAN_API_KEY)
Cost: Free tier allows ~30,000 requests/month
Script 2: fix_pdf_bills.py β Extract Text from PDF Bills
What it does:
- Finds bills where the text is still raw base64-encoded PDF content
- Decodes the base64, extracts readable text using PyPDF2
- Also handles HTML-encoded bill text via BeautifulSoup
- Marks successfully processed bills with
text_fixed: trueto avoid reprocessing - Creates a backup before modifying data
Input: data/known_bills.json
Output: data/known_bills.json (updated in place), data/known_bills_backup.json
API used: None (local processing only)
Script 3: known_bills_status.py β Clean and Merge Bill Data
What it does:
- Merges raw bill data from
known_bills.jsonwith the existing visualization dataset - Maps numeric status codes to human-readable labels (e.g., 1 -> "Introduced", 4 -> "Signed Into Law")
- Preserves existing IAPP categories and other enrichments from previous runs
- Removes bills that are no longer in the source data
Input: data/known_bills.json, data/known_bills_visualize.json (existing)
Output: data/known_bills_fixed.json, data/known_bills_visualize.json
API used: None (local processing only)
Script 4: migrate_iapp_categories.py β Categorize Bills (IAPP Framework)
What it does:
- Analyzes each bill's text using OpenAI to categorize it under the IAPP AI governance framework
- Assigns bills to 4 main categories: Governance, Transparency, Assurance, Individual Rights
- Each category has specific subcategories (e.g., "Program and documentation", "Opt out/appeal")
- Uses a local cache (
data/iapp_categories_cache.json) to avoid reprocessing unchanged bills - Falls back to default categories if the API call fails
Input: data/known_bills_fixed.json, data/known_bills_visualize.json
Output: data/known_bills_visualize.json (updated with iapp_categories field), data/iapp_categories_cache.json
API used: OpenAI Chat API (OPENAI_API_KEY)
Cost: ~$0.02 per bill
Script 5: mark_no_text_bills.py β Flag Bills Without Text
What it does:
- Scans all bills and marks those with missing or very short text (< 50 characters)
- Sets
iapp_categoriestonullfor these bills so they're excluded from categorization displays - This prevents empty bills from cluttering analysis results
Input: data/known_bills_visualize.json
Output: data/known_bills_visualize.json (updated in place)
API used: None (local processing only)
Script 6: generate_summaries.py β Generate Bill Summaries
What it does:
- For each bill with text, generates a concise AI summary explaining what the bill does
- Skips bills that already have a valid summary in the cache
- Saves progress every 10 bills (safe to interrupt and resume)
- Summaries are stored separately from bill data to avoid reprocessing
Input: data/known_bills_visualize.json, data/bill_summaries.json (existing cache)
Output: data/bill_summaries.json
API used: OpenAI Chat API
Cost: ~$0.025 per bill
Script 7: generate_suggested_questions.py β Generate Discussion Questions
What it does:
- For each bill, generates 5 specific questions that a user might ask about the bill
- These questions appear in the UI as clickable suggestions when viewing a bill
- Falls back to generic questions if generation fails
- Saves progress every 10 bills
Input: data/known_bills_visualize.json, data/bill_suggested_questions.json (existing cache)
Output: data/bill_suggested_questions.json
API used: OpenAI Chat API
Cost: ~$0.02 per bill
Script 8: generate_reports.py β Generate Detailed Reports
What it does:
- For each bill, generates a detailed Markdown report covering: title, status, sponsors, goals, key provisions, regulatory approaches, enforcement mechanisms, and notable features
- Reports are the longest AI-generated content (~1-2 pages per bill)
- Saves progress every 10 bills (resumable)
Input: data/known_bills_visualize.json, data/bill_reports.json (existing cache)
Output: data/bill_reports.json
API used: OpenAI Chat API (uses gpt-4o)
Cost: ~$0.045 per bill
Script 9: build_bills_vectorstore.py β Build Searchable Vectorstore
What it does:
- Converts all bills into vector embeddings for semantic search
- Splits long bill text into overlapping chunks (1500 chars, 200 overlap)
- Stores embeddings in a ChromaDB vectorstore on disk
- Uses a manifest file to skip bills that haven't changed since last build
- This vectorstore powers the "Ask a Question" and "Compare Bills" features in the app
Input: data/known_bills_visualize.json
Output: data/bills_vectorstore/ (ChromaDB files), data/bills_vectorstore_manifest.json
API used: OpenAI Embeddings API (text-embedding-3-small)
Cost: ~$0.0001 per bill (very cheap)
Script 10: eu_vectorstore.py β Build EU AI Act Vectorstore
What it does:
- Extracts text from the EU AI Act PDF (
data_updating_scripts/eu-ai-act.pdf) - Splits it into chunks and creates a FAISS vectorstore
- This vectorstore powers the "Compare with EU AI Act" feature in the app
Input: data_updating_scripts/eu-ai-act.pdf
Output: data/eu_ai_act_vectorstore/ (FAISS index + metadata)
API used: OpenAI Embeddings API
Cost: ~$0.01 (one-time, small document)
Data Flow Diagram
LegiScan API
|
v
[1] get_data.py
|
v
data/known_bills.json + data/bill_cache.json
|
v
[2] fix_pdf_bills.py (extract text from PDFs)
|
v
data/known_bills.json (text extracted)
|
v
[3] known_bills_status.py (merge + clean statuses)
|
v
data/known_bills_fixed.json + data/known_bills_visualize.json
|
+---> [4] migrate_iapp_categories.py (IAPP categorization via OpenAI)
| |
| v
| data/known_bills_visualize.json (with iapp_categories)
|
+---> [5] mark_no_text_bills.py (flag empty bills)
| |
| v
| data/known_bills_visualize.json (final version)
|
+---> [6] generate_summaries.py ---------> data/bill_summaries.json
|
+---> [7] generate_suggested_questions.py -> data/bill_suggested_questions.json
|
+---> [8] generate_reports.py ------------> data/bill_reports.json
|
+---> [9] build_bills_vectorstore.py -----> data/bills_vectorstore/
|
+---> [10] eu_vectorstore.py -------------> data/eu_ai_act_vectorstore/
|
v
[Optional] huggingface_upload.py (sync all JSONs to HuggingFace)
|
v
HuggingFace Datasets Hub
|
v
streamlit_app.py reads data/known_bills_visualize.json
+ data/bill_summaries.json
+ data/bill_suggested_questions.json
+ data/bill_reports.json
+ data/bills_vectorstore/
+ data/eu_ai_act_vectorstore/
|
v
Website displays all bills with interactive features
Files Produced by the Pipeline
| File | Description | Used By |
|---|---|---|
data/known_bills.json |
Raw bill data from LegiScan | Pipeline scripts |
data/known_bills_backup.json |
Backup before PDF text extraction | Recovery only |
data/known_bills_fixed.json |
Bills with cleaned statuses | Pipeline scripts |
data/known_bills_visualize.json |
Main data file β bills with IAPP categories | Website + all scripts |
data/bill_cache.json |
LegiScan change hashes (skip unchanged bills) | get_data.py |
data/iapp_categories_cache.json |
IAPP categorization cache | migrate_iapp_categories.py |
data/bill_summaries.json |
AI-generated bill summaries | Website |
data/bill_suggested_questions.json |
AI-generated discussion questions | Website |
data/bill_reports.json |
AI-generated detailed reports | Website |
data/bills_vectorstore/ |
ChromaDB vectorstore for bill search | Website (Q&A, Compare) |
data/bills_vectorstore_manifest.json |
Tracks which bills are in vectorstore | build_bills_vectorstore.py |
data/eu_ai_act_vectorstore/ |
FAISS vectorstore for EU AI Act | Website (EU comparison) |
Running from the Command Line
Full pipeline (pull new data + process everything):
python update_data.py --pull --overwrite-pdf --continue-on-error
Skip LegiScan pull (reprocess existing local data only):
python update_data.py --no-pull --continue-on-error
Skip HuggingFace upload:
python update_data.py --pull --overwrite-pdf --continue-on-error --skip-upload
Test mode (pull only 3 bills from CA to verify pipeline works):
python update_data.py --test --continue-on-error --skip-upload
Run individual scripts:
# Re-run just summaries (e.g., to catch up after a hang)
python data_updating_scripts/generate_summaries.py
# Re-run just questions
python data_updating_scripts/generate_suggested_questions.py
# Re-run just reports
python data_updating_scripts/generate_reports.py
# Rebuild vectorstore
python data_updating_scripts/build_bills_vectorstore.py
# Upload all data to HuggingFace
python huggingface_upload.py
Environment Variables Required
Set these in a .env file in the project root:
| Variable | Required For | Description |
|---|---|---|
LEGISCAN_API_KEY |
Scripts 1 | LegiScan API key for pulling bill data |
OPENAI_API_KEY |
Scripts 4, 6-10 | OpenAI API key for summaries, reports, embeddings |
HUGGINGFACE_HUB_TOKEN |
HF upload | HuggingFace API token |
HF_REPO_ID |
HF upload | HuggingFace dataset repo (e.g., username/dataset-name) |
Troubleshooting
Pipeline hangs on a bill
All OpenAI API calls have a 120-second timeout. If a call hangs, it will timeout, log an error, and skip to the next bill. The skipped bill will be retried on the next pipeline run.
Bills show up but summaries/questions/reports are missing
The generate scripts cache their results. If a script was interrupted mid-run, re-run it individually to catch up:
python data_updating_scripts/generate_summaries.py
Main page shows old data after pipeline runs
Streamlit caches data in memory. Hard-refresh the page (Cmd+Shift+R) or restart Streamlit.
HuggingFace upload fails
Check that HUGGINGFACE_HUB_TOKEN and HF_REPO_ID are set in .env. Test the connection:
python huggingface_upload.py
"OPENAI_API_KEY not set" error
Ensure your .env file contains the key and that python-dotenv is installed. The pipeline loads .env automatically.
Estimated Runtime
| Scenario | Duration |
|---|---|
| Full pipeline, all bills new | 4-6 hours |
| Incremental (100 new bills) | 30-60 minutes |
| Re-run summaries only (catch-up) | ~15-30 min per 1,000 bills |
| Vectorstore rebuild (all bills) | 15-30 minutes |
| HuggingFace upload | 2-5 minutes |