legislation-tracker / docs /data_update_pipeline_guide.md
ramanna's picture
Deploy: newsletter display polish
3cc39aa

Data Update Pipeline Guide

Overview

The AI Legislation Tracker pipeline pulls AI-related bills from all 50 US states (+ US Congress) via the LegiScan API, processes them through multiple stages (text extraction, categorization, AI-generated summaries/questions/reports), builds searchable vectorstores, and optionally syncs everything to HuggingFace for cloud storage.


Quick Reference

From the Admin Panel (recommended):

  1. Open the app -> navigate to Admin -> log in -> "Update Data" tab -> click "Update Data"

From the command line:

python update_data.py --pull --overwrite-pdf --continue-on-error

After pipeline completes, sync to cloud:

python huggingface_upload.py

Step-by-Step: Running from the Admin Panel

Step 1: Start the Streamlit App

Open a terminal in the project root and run:

streamlit run streamlit_app.py

The app opens in your browser (typically at http://localhost:8501).

What you see: The main VAILL AI Governance Bills Tracker dashboard with a map, bill table, filters, and analysis tools.

Step 2: Navigate to the Admin Panel

In the left sidebar, click "Admin" to open the Admin page.

What you see: A login form asking for username and password.

Step 3: Log In

Enter your admin credentials (configured in auth_config.json or HuggingFace).

What you see: After login, the Admin Panel with three tabs: Overview, Update Data, and Manage Users. Your name and username appear in the sidebar with a Logout button.

Step 4: Review Current Data (Overview Tab)

The Overview tab shows:

  • System Status: Connection status for OpenAI API, LegiScan API, and HuggingFace
  • Current Data: Counts for Total Bills, Bills with Details, Summaries, Question Sets, Reports, and Cached Bills
  • Last Pipeline Run: Timestamp of the most recent run
  • Admin Users: Table of all registered admin accounts

Check that all three API connections show "Connected" before running the pipeline. If any show "Missing", set the corresponding environment variable in your .env file.

Step 5: Run the Pipeline (Update Data Tab)

  1. Click the "Update Data" tab
  2. Review the Current Data counts (same metrics as Overview)
  3. Optional: Check "Skip uploading to HuggingFace after update" if you only want local updates
  4. Click the blue "Update Data" button

What you see: A live status panel showing real-time log output as each script runs. The pipeline runs 10 scripts sequentially (detailed below). This takes 2-6 hours depending on how many new bills need processing.

Step 6: Monitor Progress

The status panel streams output from each script. You'll see messages like:

--- Running data_updating_scripts/get_data.py ---
Fetching bills for AL (2023-2026)...
...
--- Running data_updating_scripts/generate_summaries.py ---
Processing 1/3487: AL_SJR94
...

Important: You can navigate away from the page β€” the pipeline runs as a background process. To check on it later, look at the terminal where Streamlit is running, or check the log files in data_updating_scripts/logs/.

Step 7: Review Results

When the pipeline finishes, the Update Data tab shows:

  • New / Updated Bills: How many bills were fetched from LegiScan
  • Unchanged Bills: Bills that hadn't changed (API calls saved)
  • Steps Completed: How many scripts passed vs failed
  • Data Changes: Before/after comparison of all data file counts

If HuggingFace upload was enabled, it automatically syncs all JSON files to the cloud.

Step 8: View Updated Data

Hard-refresh the main page (Cmd+Shift+R on Mac, Ctrl+Shift+R on Windows) or restart Streamlit to clear the cache and see the updated bills.


What the Pipeline Does (All 10 Scripts)

The pipeline orchestrator (update_data.py) runs these 10 scripts in order:

Script 1: get_data.py β€” Pull Bills from LegiScan

What it does:

  • Queries the LegiScan API for "artificial intelligence" bills across all 50 states + US Congress
  • Searches years 2023 through the current year
  • For each bill found, fetches full bill details (text, sponsors, status, history)
  • Uses a local cache (data/bill_cache.json) to skip bills that haven't changed since last pull
  • Extracts bill text from base64-encoded documents (HTML or PDF)

Input: LegiScan API Output: data/known_bills.json, data/bill_cache.json API used: LegiScan (uses LEGISCAN_API_KEY) Cost: Free tier allows ~30,000 requests/month

Script 2: fix_pdf_bills.py β€” Extract Text from PDF Bills

What it does:

  • Finds bills where the text is still raw base64-encoded PDF content
  • Decodes the base64, extracts readable text using PyPDF2
  • Also handles HTML-encoded bill text via BeautifulSoup
  • Marks successfully processed bills with text_fixed: true to avoid reprocessing
  • Creates a backup before modifying data

Input: data/known_bills.json Output: data/known_bills.json (updated in place), data/known_bills_backup.json API used: None (local processing only)

Script 3: known_bills_status.py β€” Clean and Merge Bill Data

What it does:

  • Merges raw bill data from known_bills.json with the existing visualization dataset
  • Maps numeric status codes to human-readable labels (e.g., 1 -> "Introduced", 4 -> "Signed Into Law")
  • Preserves existing IAPP categories and other enrichments from previous runs
  • Removes bills that are no longer in the source data

Input: data/known_bills.json, data/known_bills_visualize.json (existing) Output: data/known_bills_fixed.json, data/known_bills_visualize.json API used: None (local processing only)

Script 4: migrate_iapp_categories.py β€” Categorize Bills (IAPP Framework)

What it does:

  • Analyzes each bill's text using OpenAI to categorize it under the IAPP AI governance framework
  • Assigns bills to 4 main categories: Governance, Transparency, Assurance, Individual Rights
  • Each category has specific subcategories (e.g., "Program and documentation", "Opt out/appeal")
  • Uses a local cache (data/iapp_categories_cache.json) to avoid reprocessing unchanged bills
  • Falls back to default categories if the API call fails

Input: data/known_bills_fixed.json, data/known_bills_visualize.json Output: data/known_bills_visualize.json (updated with iapp_categories field), data/iapp_categories_cache.json API used: OpenAI Chat API (OPENAI_API_KEY) Cost: ~$0.02 per bill

Script 5: mark_no_text_bills.py β€” Flag Bills Without Text

What it does:

  • Scans all bills and marks those with missing or very short text (< 50 characters)
  • Sets iapp_categories to null for these bills so they're excluded from categorization displays
  • This prevents empty bills from cluttering analysis results

Input: data/known_bills_visualize.json Output: data/known_bills_visualize.json (updated in place) API used: None (local processing only)

Script 6: generate_summaries.py β€” Generate Bill Summaries

What it does:

  • For each bill with text, generates a concise AI summary explaining what the bill does
  • Skips bills that already have a valid summary in the cache
  • Saves progress every 10 bills (safe to interrupt and resume)
  • Summaries are stored separately from bill data to avoid reprocessing

Input: data/known_bills_visualize.json, data/bill_summaries.json (existing cache) Output: data/bill_summaries.json API used: OpenAI Chat API Cost: ~$0.025 per bill

Script 7: generate_suggested_questions.py β€” Generate Discussion Questions

What it does:

  • For each bill, generates 5 specific questions that a user might ask about the bill
  • These questions appear in the UI as clickable suggestions when viewing a bill
  • Falls back to generic questions if generation fails
  • Saves progress every 10 bills

Input: data/known_bills_visualize.json, data/bill_suggested_questions.json (existing cache) Output: data/bill_suggested_questions.json API used: OpenAI Chat API Cost: ~$0.02 per bill

Script 8: generate_reports.py β€” Generate Detailed Reports

What it does:

  • For each bill, generates a detailed Markdown report covering: title, status, sponsors, goals, key provisions, regulatory approaches, enforcement mechanisms, and notable features
  • Reports are the longest AI-generated content (~1-2 pages per bill)
  • Saves progress every 10 bills (resumable)

Input: data/known_bills_visualize.json, data/bill_reports.json (existing cache) Output: data/bill_reports.json API used: OpenAI Chat API (uses gpt-4o) Cost: ~$0.045 per bill

Script 9: build_bills_vectorstore.py β€” Build Searchable Vectorstore

What it does:

  • Converts all bills into vector embeddings for semantic search
  • Splits long bill text into overlapping chunks (1500 chars, 200 overlap)
  • Stores embeddings in a ChromaDB vectorstore on disk
  • Uses a manifest file to skip bills that haven't changed since last build
  • This vectorstore powers the "Ask a Question" and "Compare Bills" features in the app

Input: data/known_bills_visualize.json Output: data/bills_vectorstore/ (ChromaDB files), data/bills_vectorstore_manifest.json API used: OpenAI Embeddings API (text-embedding-3-small) Cost: ~$0.0001 per bill (very cheap)

Script 10: eu_vectorstore.py β€” Build EU AI Act Vectorstore

What it does:

  • Extracts text from the EU AI Act PDF (data_updating_scripts/eu-ai-act.pdf)
  • Splits it into chunks and creates a FAISS vectorstore
  • This vectorstore powers the "Compare with EU AI Act" feature in the app

Input: data_updating_scripts/eu-ai-act.pdf Output: data/eu_ai_act_vectorstore/ (FAISS index + metadata) API used: OpenAI Embeddings API Cost: ~$0.01 (one-time, small document)


Data Flow Diagram

LegiScan API
    |
    v
[1] get_data.py
    |
    v
data/known_bills.json  +  data/bill_cache.json
    |
    v
[2] fix_pdf_bills.py  (extract text from PDFs)
    |
    v
data/known_bills.json  (text extracted)
    |
    v
[3] known_bills_status.py  (merge + clean statuses)
    |
    v
data/known_bills_fixed.json  +  data/known_bills_visualize.json
    |
    +---> [4] migrate_iapp_categories.py  (IAPP categorization via OpenAI)
    |         |
    |         v
    |     data/known_bills_visualize.json  (with iapp_categories)
    |
    +---> [5] mark_no_text_bills.py  (flag empty bills)
    |         |
    |         v
    |     data/known_bills_visualize.json  (final version)
    |
    +---> [6] generate_summaries.py ---------> data/bill_summaries.json
    |
    +---> [7] generate_suggested_questions.py -> data/bill_suggested_questions.json
    |
    +---> [8] generate_reports.py ------------> data/bill_reports.json
    |
    +---> [9] build_bills_vectorstore.py -----> data/bills_vectorstore/
    |
    +---> [10] eu_vectorstore.py -------------> data/eu_ai_act_vectorstore/
    |
    v
[Optional] huggingface_upload.py  (sync all JSONs to HuggingFace)
    |
    v
HuggingFace Datasets Hub
    |
    v
streamlit_app.py reads data/known_bills_visualize.json
    + data/bill_summaries.json
    + data/bill_suggested_questions.json
    + data/bill_reports.json
    + data/bills_vectorstore/
    + data/eu_ai_act_vectorstore/
    |
    v
Website displays all bills with interactive features

Files Produced by the Pipeline

File Description Used By
data/known_bills.json Raw bill data from LegiScan Pipeline scripts
data/known_bills_backup.json Backup before PDF text extraction Recovery only
data/known_bills_fixed.json Bills with cleaned statuses Pipeline scripts
data/known_bills_visualize.json Main data file β€” bills with IAPP categories Website + all scripts
data/bill_cache.json LegiScan change hashes (skip unchanged bills) get_data.py
data/iapp_categories_cache.json IAPP categorization cache migrate_iapp_categories.py
data/bill_summaries.json AI-generated bill summaries Website
data/bill_suggested_questions.json AI-generated discussion questions Website
data/bill_reports.json AI-generated detailed reports Website
data/bills_vectorstore/ ChromaDB vectorstore for bill search Website (Q&A, Compare)
data/bills_vectorstore_manifest.json Tracks which bills are in vectorstore build_bills_vectorstore.py
data/eu_ai_act_vectorstore/ FAISS vectorstore for EU AI Act Website (EU comparison)

Running from the Command Line

Full pipeline (pull new data + process everything):

python update_data.py --pull --overwrite-pdf --continue-on-error

Skip LegiScan pull (reprocess existing local data only):

python update_data.py --no-pull --continue-on-error

Skip HuggingFace upload:

python update_data.py --pull --overwrite-pdf --continue-on-error --skip-upload

Test mode (pull only 3 bills from CA to verify pipeline works):

python update_data.py --test --continue-on-error --skip-upload

Run individual scripts:

# Re-run just summaries (e.g., to catch up after a hang)
python data_updating_scripts/generate_summaries.py

# Re-run just questions
python data_updating_scripts/generate_suggested_questions.py

# Re-run just reports
python data_updating_scripts/generate_reports.py

# Rebuild vectorstore
python data_updating_scripts/build_bills_vectorstore.py

# Upload all data to HuggingFace
python huggingface_upload.py

Environment Variables Required

Set these in a .env file in the project root:

Variable Required For Description
LEGISCAN_API_KEY Scripts 1 LegiScan API key for pulling bill data
OPENAI_API_KEY Scripts 4, 6-10 OpenAI API key for summaries, reports, embeddings
HUGGINGFACE_HUB_TOKEN HF upload HuggingFace API token
HF_REPO_ID HF upload HuggingFace dataset repo (e.g., username/dataset-name)

Troubleshooting

Pipeline hangs on a bill

All OpenAI API calls have a 120-second timeout. If a call hangs, it will timeout, log an error, and skip to the next bill. The skipped bill will be retried on the next pipeline run.

Bills show up but summaries/questions/reports are missing

The generate scripts cache their results. If a script was interrupted mid-run, re-run it individually to catch up:

python data_updating_scripts/generate_summaries.py

Main page shows old data after pipeline runs

Streamlit caches data in memory. Hard-refresh the page (Cmd+Shift+R) or restart Streamlit.

HuggingFace upload fails

Check that HUGGINGFACE_HUB_TOKEN and HF_REPO_ID are set in .env. Test the connection:

python huggingface_upload.py

"OPENAI_API_KEY not set" error

Ensure your .env file contains the key and that python-dotenv is installed. The pipeline loads .env automatically.


Estimated Runtime

Scenario Duration
Full pipeline, all bills new 4-6 hours
Incremental (100 new bills) 30-60 minutes
Re-run summaries only (catch-up) ~15-30 min per 1,000 bills
Vectorstore rebuild (all bills) 15-30 minutes
HuggingFace upload 2-5 minutes