legislation-tracker / docs /data_update_pipeline_guide.md
ramanna's picture
Deploy: newsletter display polish
3cc39aa
# Data Update Pipeline Guide
## Overview
The AI Legislation Tracker pipeline pulls AI-related bills from all 50 US states (+ US Congress) via the LegiScan API, processes them through multiple stages (text extraction, categorization, AI-generated summaries/questions/reports), builds searchable vectorstores, and optionally syncs everything to HuggingFace for cloud storage.
---
## Quick Reference
**From the Admin Panel (recommended):**
1. Open the app -> navigate to Admin -> log in -> "Update Data" tab -> click "Update Data"
**From the command line:**
```bash
python update_data.py --pull --overwrite-pdf --continue-on-error
```
**After pipeline completes, sync to cloud:**
```bash
python huggingface_upload.py
```
---
## Step-by-Step: Running from the Admin Panel
### Step 1: Start the Streamlit App
Open a terminal in the project root and run:
```bash
streamlit run streamlit_app.py
```
The app opens in your browser (typically at `http://localhost:8501`).
> **What you see:** The main VAILL AI Governance Bills Tracker dashboard with a map, bill table, filters, and analysis tools.
### Step 2: Navigate to the Admin Panel
In the left sidebar, click **"Admin"** to open the Admin page.
> **What you see:** A login form asking for username and password.
### Step 3: Log In
Enter your admin credentials (configured in `auth_config.json` or HuggingFace).
> **What you see:** After login, the Admin Panel with three tabs: **Overview**, **Update Data**, and **Manage Users**. Your name and username appear in the sidebar with a Logout button.
### Step 4: Review Current Data (Overview Tab)
The **Overview** tab shows:
- **System Status**: Connection status for OpenAI API, LegiScan API, and HuggingFace
- **Current Data**: Counts for Total Bills, Bills with Details, Summaries, Question Sets, Reports, and Cached Bills
- **Last Pipeline Run**: Timestamp of the most recent run
- **Admin Users**: Table of all registered admin accounts
> **Check that all three API connections show "Connected" before running the pipeline.** If any show "Missing", set the corresponding environment variable in your `.env` file.
### Step 5: Run the Pipeline (Update Data Tab)
1. Click the **"Update Data"** tab
2. Review the **Current Data** counts (same metrics as Overview)
3. **Optional**: Check "Skip uploading to HuggingFace after update" if you only want local updates
4. Click the blue **"Update Data"** button
> **What you see:** A live status panel showing real-time log output as each script runs. The pipeline runs 10 scripts sequentially (detailed below). This takes **2-6 hours** depending on how many new bills need processing.
### Step 6: Monitor Progress
The status panel streams output from each script. You'll see messages like:
```
--- Running data_updating_scripts/get_data.py ---
Fetching bills for AL (2023-2026)...
...
--- Running data_updating_scripts/generate_summaries.py ---
Processing 1/3487: AL_SJR94
...
```
> **Important:** You can navigate away from the page — the pipeline runs as a background process. To check on it later, look at the terminal where Streamlit is running, or check the log files in `data_updating_scripts/logs/`.
### Step 7: Review Results
When the pipeline finishes, the Update Data tab shows:
- **New / Updated Bills**: How many bills were fetched from LegiScan
- **Unchanged Bills**: Bills that hadn't changed (API calls saved)
- **Steps Completed**: How many scripts passed vs failed
- **Data Changes**: Before/after comparison of all data file counts
If HuggingFace upload was enabled, it automatically syncs all JSON files to the cloud.
### Step 8: View Updated Data
Hard-refresh the main page (`Cmd+Shift+R` on Mac, `Ctrl+Shift+R` on Windows) or restart Streamlit to clear the cache and see the updated bills.
---
## What the Pipeline Does (All 10 Scripts)
The pipeline orchestrator (`update_data.py`) runs these 10 scripts in order:
### Script 1: `get_data.py` — Pull Bills from LegiScan
**What it does:**
- Queries the LegiScan API for "artificial intelligence" bills across all 50 states + US Congress
- Searches years 2023 through the current year
- For each bill found, fetches full bill details (text, sponsors, status, history)
- Uses a local cache (`data/bill_cache.json`) to skip bills that haven't changed since last pull
- Extracts bill text from base64-encoded documents (HTML or PDF)
**Input:** LegiScan API
**Output:** `data/known_bills.json`, `data/bill_cache.json`
**API used:** LegiScan (uses `LEGISCAN_API_KEY`)
**Cost:** Free tier allows ~30,000 requests/month
### Script 2: `fix_pdf_bills.py` — Extract Text from PDF Bills
**What it does:**
- Finds bills where the text is still raw base64-encoded PDF content
- Decodes the base64, extracts readable text using PyPDF2
- Also handles HTML-encoded bill text via BeautifulSoup
- Marks successfully processed bills with `text_fixed: true` to avoid reprocessing
- Creates a backup before modifying data
**Input:** `data/known_bills.json`
**Output:** `data/known_bills.json` (updated in place), `data/known_bills_backup.json`
**API used:** None (local processing only)
### Script 3: `known_bills_status.py` — Clean and Merge Bill Data
**What it does:**
- Merges raw bill data from `known_bills.json` with the existing visualization dataset
- Maps numeric status codes to human-readable labels (e.g., 1 -> "Introduced", 4 -> "Signed Into Law")
- Preserves existing IAPP categories and other enrichments from previous runs
- Removes bills that are no longer in the source data
**Input:** `data/known_bills.json`, `data/known_bills_visualize.json` (existing)
**Output:** `data/known_bills_fixed.json`, `data/known_bills_visualize.json`
**API used:** None (local processing only)
### Script 4: `migrate_iapp_categories.py` — Categorize Bills (IAPP Framework)
**What it does:**
- Analyzes each bill's text using OpenAI to categorize it under the IAPP AI governance framework
- Assigns bills to 4 main categories: Governance, Transparency, Assurance, Individual Rights
- Each category has specific subcategories (e.g., "Program and documentation", "Opt out/appeal")
- Uses a local cache (`data/iapp_categories_cache.json`) to avoid reprocessing unchanged bills
- Falls back to default categories if the API call fails
**Input:** `data/known_bills_fixed.json`, `data/known_bills_visualize.json`
**Output:** `data/known_bills_visualize.json` (updated with `iapp_categories` field), `data/iapp_categories_cache.json`
**API used:** OpenAI Chat API (`OPENAI_API_KEY`)
**Cost:** ~$0.02 per bill
### Script 5: `mark_no_text_bills.py` — Flag Bills Without Text
**What it does:**
- Scans all bills and marks those with missing or very short text (< 50 characters)
- Sets `iapp_categories` to `null` for these bills so they're excluded from categorization displays
- This prevents empty bills from cluttering analysis results
**Input:** `data/known_bills_visualize.json`
**Output:** `data/known_bills_visualize.json` (updated in place)
**API used:** None (local processing only)
### Script 6: `generate_summaries.py` — Generate Bill Summaries
**What it does:**
- For each bill with text, generates a concise AI summary explaining what the bill does
- Skips bills that already have a valid summary in the cache
- Saves progress every 10 bills (safe to interrupt and resume)
- Summaries are stored separately from bill data to avoid reprocessing
**Input:** `data/known_bills_visualize.json`, `data/bill_summaries.json` (existing cache)
**Output:** `data/bill_summaries.json`
**API used:** OpenAI Chat API
**Cost:** ~$0.025 per bill
### Script 7: `generate_suggested_questions.py` — Generate Discussion Questions
**What it does:**
- For each bill, generates 5 specific questions that a user might ask about the bill
- These questions appear in the UI as clickable suggestions when viewing a bill
- Falls back to generic questions if generation fails
- Saves progress every 10 bills
**Input:** `data/known_bills_visualize.json`, `data/bill_suggested_questions.json` (existing cache)
**Output:** `data/bill_suggested_questions.json`
**API used:** OpenAI Chat API
**Cost:** ~$0.02 per bill
### Script 8: `generate_reports.py` — Generate Detailed Reports
**What it does:**
- For each bill, generates a detailed Markdown report covering: title, status, sponsors, goals, key provisions, regulatory approaches, enforcement mechanisms, and notable features
- Reports are the longest AI-generated content (~1-2 pages per bill)
- Saves progress every 10 bills (resumable)
**Input:** `data/known_bills_visualize.json`, `data/bill_reports.json` (existing cache)
**Output:** `data/bill_reports.json`
**API used:** OpenAI Chat API (uses `gpt-4o`)
**Cost:** ~$0.045 per bill
### Script 9: `build_bills_vectorstore.py` — Build Searchable Vectorstore
**What it does:**
- Converts all bills into vector embeddings for semantic search
- Splits long bill text into overlapping chunks (1500 chars, 200 overlap)
- Stores embeddings in a ChromaDB vectorstore on disk
- Uses a manifest file to skip bills that haven't changed since last build
- This vectorstore powers the "Ask a Question" and "Compare Bills" features in the app
**Input:** `data/known_bills_visualize.json`
**Output:** `data/bills_vectorstore/` (ChromaDB files), `data/bills_vectorstore_manifest.json`
**API used:** OpenAI Embeddings API (`text-embedding-3-small`)
**Cost:** ~$0.0001 per bill (very cheap)
### Script 10: `eu_vectorstore.py` — Build EU AI Act Vectorstore
**What it does:**
- Extracts text from the EU AI Act PDF (`data_updating_scripts/eu-ai-act.pdf`)
- Splits it into chunks and creates a FAISS vectorstore
- This vectorstore powers the "Compare with EU AI Act" feature in the app
**Input:** `data_updating_scripts/eu-ai-act.pdf`
**Output:** `data/eu_ai_act_vectorstore/` (FAISS index + metadata)
**API used:** OpenAI Embeddings API
**Cost:** ~$0.01 (one-time, small document)
---
## Data Flow Diagram
```
LegiScan API
|
v
[1] get_data.py
|
v
data/known_bills.json + data/bill_cache.json
|
v
[2] fix_pdf_bills.py (extract text from PDFs)
|
v
data/known_bills.json (text extracted)
|
v
[3] known_bills_status.py (merge + clean statuses)
|
v
data/known_bills_fixed.json + data/known_bills_visualize.json
|
+---> [4] migrate_iapp_categories.py (IAPP categorization via OpenAI)
| |
| v
| data/known_bills_visualize.json (with iapp_categories)
|
+---> [5] mark_no_text_bills.py (flag empty bills)
| |
| v
| data/known_bills_visualize.json (final version)
|
+---> [6] generate_summaries.py ---------> data/bill_summaries.json
|
+---> [7] generate_suggested_questions.py -> data/bill_suggested_questions.json
|
+---> [8] generate_reports.py ------------> data/bill_reports.json
|
+---> [9] build_bills_vectorstore.py -----> data/bills_vectorstore/
|
+---> [10] eu_vectorstore.py -------------> data/eu_ai_act_vectorstore/
|
v
[Optional] huggingface_upload.py (sync all JSONs to HuggingFace)
|
v
HuggingFace Datasets Hub
|
v
streamlit_app.py reads data/known_bills_visualize.json
+ data/bill_summaries.json
+ data/bill_suggested_questions.json
+ data/bill_reports.json
+ data/bills_vectorstore/
+ data/eu_ai_act_vectorstore/
|
v
Website displays all bills with interactive features
```
---
## Files Produced by the Pipeline
| File | Description | Used By |
|------|-------------|---------|
| `data/known_bills.json` | Raw bill data from LegiScan | Pipeline scripts |
| `data/known_bills_backup.json` | Backup before PDF text extraction | Recovery only |
| `data/known_bills_fixed.json` | Bills with cleaned statuses | Pipeline scripts |
| `data/known_bills_visualize.json` | **Main data file** — bills with IAPP categories | Website + all scripts |
| `data/bill_cache.json` | LegiScan change hashes (skip unchanged bills) | `get_data.py` |
| `data/iapp_categories_cache.json` | IAPP categorization cache | `migrate_iapp_categories.py` |
| `data/bill_summaries.json` | AI-generated bill summaries | Website |
| `data/bill_suggested_questions.json` | AI-generated discussion questions | Website |
| `data/bill_reports.json` | AI-generated detailed reports | Website |
| `data/bills_vectorstore/` | ChromaDB vectorstore for bill search | Website (Q&A, Compare) |
| `data/bills_vectorstore_manifest.json` | Tracks which bills are in vectorstore | `build_bills_vectorstore.py` |
| `data/eu_ai_act_vectorstore/` | FAISS vectorstore for EU AI Act | Website (EU comparison) |
---
## Running from the Command Line
### Full pipeline (pull new data + process everything):
```bash
python update_data.py --pull --overwrite-pdf --continue-on-error
```
### Skip LegiScan pull (reprocess existing local data only):
```bash
python update_data.py --no-pull --continue-on-error
```
### Skip HuggingFace upload:
```bash
python update_data.py --pull --overwrite-pdf --continue-on-error --skip-upload
```
### Test mode (pull only 3 bills from CA to verify pipeline works):
```bash
python update_data.py --test --continue-on-error --skip-upload
```
### Run individual scripts:
```bash
# Re-run just summaries (e.g., to catch up after a hang)
python data_updating_scripts/generate_summaries.py
# Re-run just questions
python data_updating_scripts/generate_suggested_questions.py
# Re-run just reports
python data_updating_scripts/generate_reports.py
# Rebuild vectorstore
python data_updating_scripts/build_bills_vectorstore.py
# Upload all data to HuggingFace
python huggingface_upload.py
```
---
## Environment Variables Required
Set these in a `.env` file in the project root:
| Variable | Required For | Description |
|----------|-------------|-------------|
| `LEGISCAN_API_KEY` | Scripts 1 | LegiScan API key for pulling bill data |
| `OPENAI_API_KEY` | Scripts 4, 6-10 | OpenAI API key for summaries, reports, embeddings |
| `HUGGINGFACE_HUB_TOKEN` | HF upload | HuggingFace API token |
| `HF_REPO_ID` | HF upload | HuggingFace dataset repo (e.g., `username/dataset-name`) |
---
## Troubleshooting
### Pipeline hangs on a bill
All OpenAI API calls have a 120-second timeout. If a call hangs, it will timeout, log an error, and skip to the next bill. The skipped bill will be retried on the next pipeline run.
### Bills show up but summaries/questions/reports are missing
The generate scripts cache their results. If a script was interrupted mid-run, re-run it individually to catch up:
```bash
python data_updating_scripts/generate_summaries.py
```
### Main page shows old data after pipeline runs
Streamlit caches data in memory. Hard-refresh the page (`Cmd+Shift+R`) or restart Streamlit.
### HuggingFace upload fails
Check that `HUGGINGFACE_HUB_TOKEN` and `HF_REPO_ID` are set in `.env`. Test the connection:
```bash
python huggingface_upload.py
```
### "OPENAI_API_KEY not set" error
Ensure your `.env` file contains the key and that `python-dotenv` is installed. The pipeline loads `.env` automatically.
---
## Estimated Runtime
| Scenario | Duration |
|----------|----------|
| Full pipeline, all bills new | 4-6 hours |
| Incremental (100 new bills) | 30-60 minutes |
| Re-run summaries only (catch-up) | ~15-30 min per 1,000 bills |
| Vectorstore rebuild (all bills) | 15-30 minutes |
| HuggingFace upload | 2-5 minutes |