# Data Update Pipeline Guide

## Overview

The AI Legislation Tracker pipeline pulls AI-related bills from all 50 US states (+ US Congress) via the LegiScan API, processes them through multiple stages (text extraction, categorization, AI-generated summaries/questions/reports), builds searchable vectorstores, and optionally syncs everything to HuggingFace for cloud storage.

---

## Quick Reference

**From the Admin Panel (recommended):**
1. Open the app -> navigate to Admin -> log in -> "Update Data" tab -> click "Update Data"

**From the command line:**
```bash
python update_data.py --pull --overwrite-pdf --continue-on-error
```

**After pipeline completes, sync to cloud:**
```bash
python huggingface_upload.py
```

---

## Step-by-Step: Running from the Admin Panel

### Step 1: Start the Streamlit App

Open a terminal in the project root and run:

```bash
streamlit run streamlit_app.py
```

The app opens in your browser (typically at `http://localhost:8501`).

> **What you see:** The main VAILL AI Governance Bills Tracker dashboard with a map, bill table, filters, and analysis tools.

### Step 2: Navigate to the Admin Panel

In the left sidebar, click **"Admin"** to open the Admin page.

> **What you see:** A login form asking for username and password.

### Step 3: Log In

Enter your admin credentials (configured in `auth_config.json` or HuggingFace).

> **What you see:** After login, the Admin Panel with three tabs: **Overview**, **Update Data**, and **Manage Users**. Your name and username appear in the sidebar with a Logout button.

### Step 4: Review Current Data (Overview Tab)

The **Overview** tab shows:
- **System Status**: Connection status for OpenAI API, LegiScan API, and HuggingFace
- **Current Data**: Counts for Total Bills, Bills with Details, Summaries, Question Sets, Reports, and Cached Bills
- **Last Pipeline Run**: Timestamp of the most recent run
- **Admin Users**: Table of all registered admin accounts

> **Check that all three API connections show "Connected" before running the pipeline.** If any show "Missing", set the corresponding environment variable in your `.env` file.

### Step 5: Run the Pipeline (Update Data Tab)

1. Click the **"Update Data"** tab
2. Review the **Current Data** counts (same metrics as Overview)
3. **Optional**: Check "Skip uploading to HuggingFace after update" if you only want local updates
4. Click the blue **"Update Data"** button

> **What you see:** A live status panel showing real-time log output as each script runs. The pipeline runs 10 scripts sequentially (detailed below). This takes **2-6 hours** depending on how many new bills need processing.

### Step 6: Monitor Progress

The status panel streams output from each script. You'll see messages like:
```
--- Running data_updating_scripts/get_data.py ---
Fetching bills for AL (2023-2026)...
...
--- Running data_updating_scripts/generate_summaries.py ---
Processing 1/3487: AL_SJR94
...
```

> **Important:** You can navigate away from the page — the pipeline runs as a background process. To check on it later, look at the terminal where Streamlit is running, or check the log files in `data_updating_scripts/logs/`.

### Step 7: Review Results

When the pipeline finishes, the Update Data tab shows:
- **New / Updated Bills**: How many bills were fetched from LegiScan
- **Unchanged Bills**: Bills that hadn't changed (API calls saved)
- **Steps Completed**: How many scripts passed vs failed
- **Data Changes**: Before/after comparison of all data file counts

If HuggingFace upload was enabled, it automatically syncs all JSON files to the cloud.

### Step 8: View Updated Data

Hard-refresh the main page (`Cmd+Shift+R` on Mac, `Ctrl+Shift+R` on Windows) or restart Streamlit to clear the cache and see the updated bills.

---

## What the Pipeline Does (All 10 Scripts)

The pipeline orchestrator (`update_data.py`) runs these 10 scripts in order:

### Script 1: `get_data.py` — Pull Bills from LegiScan

**What it does:**
- Queries the LegiScan API for "artificial intelligence" bills across all 50 states + US Congress
- Searches years 2023 through the current year
- For each bill found, fetches full bill details (text, sponsors, status, history)
- Uses a local cache (`data/bill_cache.json`) to skip bills that haven't changed since last pull
- Extracts bill text from base64-encoded documents (HTML or PDF)

**Input:** LegiScan API
**Output:** `data/known_bills.json`, `data/bill_cache.json`
**API used:** LegiScan (uses `LEGISCAN_API_KEY`)
**Cost:** Free tier allows ~30,000 requests/month

### Script 2: `fix_pdf_bills.py` — Extract Text from PDF Bills

**What it does:**
- Finds bills where the text is still raw base64-encoded PDF content
- Decodes the base64, extracts readable text using PyPDF2
- Also handles HTML-encoded bill text via BeautifulSoup
- Marks successfully processed bills with `text_fixed: true` to avoid reprocessing
- Creates a backup before modifying data

**Input:** `data/known_bills.json`
**Output:** `data/known_bills.json` (updated in place), `data/known_bills_backup.json`
**API used:** None (local processing only)

### Script 3: `known_bills_status.py` — Clean and Merge Bill Data

**What it does:**
- Merges raw bill data from `known_bills.json` with the existing visualization dataset
- Maps numeric status codes to human-readable labels (e.g., 1 -> "Introduced", 4 -> "Signed Into Law")
- Preserves existing IAPP categories and other enrichments from previous runs
- Removes bills that are no longer in the source data

**Input:** `data/known_bills.json`, `data/known_bills_visualize.json` (existing)
**Output:** `data/known_bills_fixed.json`, `data/known_bills_visualize.json`
**API used:** None (local processing only)

### Script 4: `migrate_iapp_categories.py` — Categorize Bills (IAPP Framework)

**What it does:**
- Analyzes each bill's text using OpenAI to categorize it under the IAPP AI governance framework
- Assigns bills to 4 main categories: Governance, Transparency, Assurance, Individual Rights
- Each category has specific subcategories (e.g., "Program and documentation", "Opt out/appeal")
- Uses a local cache (`data/iapp_categories_cache.json`) to avoid reprocessing unchanged bills
- Falls back to default categories if the API call fails

**Input:** `data/known_bills_fixed.json`, `data/known_bills_visualize.json`
**Output:** `data/known_bills_visualize.json` (updated with `iapp_categories` field), `data/iapp_categories_cache.json`
**API used:** OpenAI Chat API (`OPENAI_API_KEY`)
**Cost:** ~$0.02 per bill

### Script 5: `mark_no_text_bills.py` — Flag Bills Without Text

**What it does:**
- Scans all bills and marks those with missing or very short text (< 50 characters)
- Sets `iapp_categories` to `null` for these bills so they're excluded from categorization displays
- This prevents empty bills from cluttering analysis results

**Input:** `data/known_bills_visualize.json`
**Output:** `data/known_bills_visualize.json` (updated in place)
**API used:** None (local processing only)

### Script 6: `generate_summaries.py` — Generate Bill Summaries

**What it does:**
- For each bill with text, generates a concise AI summary explaining what the bill does
- Skips bills that already have a valid summary in the cache
- Saves progress every 10 bills (safe to interrupt and resume)
- Summaries are stored separately from bill data to avoid reprocessing

**Input:** `data/known_bills_visualize.json`, `data/bill_summaries.json` (existing cache)
**Output:** `data/bill_summaries.json`
**API used:** OpenAI Chat API
**Cost:** ~$0.025 per bill

### Script 7: `generate_suggested_questions.py` — Generate Discussion Questions

**What it does:**
- For each bill, generates 5 specific questions that a user might ask about the bill
- These questions appear in the UI as clickable suggestions when viewing a bill
- Falls back to generic questions if generation fails
- Saves progress every 10 bills

**Input:** `data/known_bills_visualize.json`, `data/bill_suggested_questions.json` (existing cache)
**Output:** `data/bill_suggested_questions.json`
**API used:** OpenAI Chat API
**Cost:** ~$0.02 per bill

### Script 8: `generate_reports.py` — Generate Detailed Reports

**What it does:**
- For each bill, generates a detailed Markdown report covering: title, status, sponsors, goals, key provisions, regulatory approaches, enforcement mechanisms, and notable features
- Reports are the longest AI-generated content (~1-2 pages per bill)
- Saves progress every 10 bills (resumable)

**Input:** `data/known_bills_visualize.json`, `data/bill_reports.json` (existing cache)
**Output:** `data/bill_reports.json`
**API used:** OpenAI Chat API (uses `gpt-4o`)
**Cost:** ~$0.045 per bill

### Script 9: `build_bills_vectorstore.py` — Build Searchable Vectorstore

**What it does:**
- Converts all bills into vector embeddings for semantic search
- Splits long bill text into overlapping chunks (1500 chars, 200 overlap)
- Stores embeddings in a ChromaDB vectorstore on disk
- Uses a manifest file to skip bills that haven't changed since last build
- This vectorstore powers the "Ask a Question" and "Compare Bills" features in the app

**Input:** `data/known_bills_visualize.json`
**Output:** `data/bills_vectorstore/` (ChromaDB files), `data/bills_vectorstore_manifest.json`
**API used:** OpenAI Embeddings API (`text-embedding-3-small`)
**Cost:** ~$0.0001 per bill (very cheap)

### Script 10: `eu_vectorstore.py` — Build EU AI Act Vectorstore

**What it does:**
- Extracts text from the EU AI Act PDF (`data_updating_scripts/eu-ai-act.pdf`)
- Splits it into chunks and creates a FAISS vectorstore
- This vectorstore powers the "Compare with EU AI Act" feature in the app

**Input:** `data_updating_scripts/eu-ai-act.pdf`
**Output:** `data/eu_ai_act_vectorstore/` (FAISS index + metadata)
**API used:** OpenAI Embeddings API
**Cost:** ~$0.01 (one-time, small document)

---

## Data Flow Diagram

```
LegiScan API
    |
    v
[1] get_data.py
    |
    v
data/known_bills.json  +  data/bill_cache.json
    |
    v
[2] fix_pdf_bills.py  (extract text from PDFs)
    |
    v
data/known_bills.json  (text extracted)
    |
    v
[3] known_bills_status.py  (merge + clean statuses)
    |
    v
data/known_bills_fixed.json  +  data/known_bills_visualize.json
    |
    +---> [4] migrate_iapp_categories.py  (IAPP categorization via OpenAI)
    |         |
    |         v
    |     data/known_bills_visualize.json  (with iapp_categories)
    |
    +---> [5] mark_no_text_bills.py  (flag empty bills)
    |         |
    |         v
    |     data/known_bills_visualize.json  (final version)
    |
    +---> [6] generate_summaries.py ---------> data/bill_summaries.json
    |
    +---> [7] generate_suggested_questions.py -> data/bill_suggested_questions.json
    |
    +---> [8] generate_reports.py ------------> data/bill_reports.json
    |
    +---> [9] build_bills_vectorstore.py -----> data/bills_vectorstore/
    |
    +---> [10] eu_vectorstore.py -------------> data/eu_ai_act_vectorstore/
    |
    v
[Optional] huggingface_upload.py  (sync all JSONs to HuggingFace)
    |
    v
HuggingFace Datasets Hub
    |
    v
streamlit_app.py reads data/known_bills_visualize.json
    + data/bill_summaries.json
    + data/bill_suggested_questions.json
    + data/bill_reports.json
    + data/bills_vectorstore/
    + data/eu_ai_act_vectorstore/
    |
    v
Website displays all bills with interactive features
```

---

## Files Produced by the Pipeline

| File | Description | Used By |
|------|-------------|---------|
| `data/known_bills.json` | Raw bill data from LegiScan | Pipeline scripts |
| `data/known_bills_backup.json` | Backup before PDF text extraction | Recovery only |
| `data/known_bills_fixed.json` | Bills with cleaned statuses | Pipeline scripts |
| `data/known_bills_visualize.json` | **Main data file** — bills with IAPP categories | Website + all scripts |
| `data/bill_cache.json` | LegiScan change hashes (skip unchanged bills) | `get_data.py` |
| `data/iapp_categories_cache.json` | IAPP categorization cache | `migrate_iapp_categories.py` |
| `data/bill_summaries.json` | AI-generated bill summaries | Website |
| `data/bill_suggested_questions.json` | AI-generated discussion questions | Website |
| `data/bill_reports.json` | AI-generated detailed reports | Website |
| `data/bills_vectorstore/` | ChromaDB vectorstore for bill search | Website (Q&A, Compare) |
| `data/bills_vectorstore_manifest.json` | Tracks which bills are in vectorstore | `build_bills_vectorstore.py` |
| `data/eu_ai_act_vectorstore/` | FAISS vectorstore for EU AI Act | Website (EU comparison) |

---

## Running from the Command Line

### Full pipeline (pull new data + process everything):
```bash
python update_data.py --pull --overwrite-pdf --continue-on-error
```

### Skip LegiScan pull (reprocess existing local data only):
```bash
python update_data.py --no-pull --continue-on-error
```

### Skip HuggingFace upload:
```bash
python update_data.py --pull --overwrite-pdf --continue-on-error --skip-upload
```

### Test mode (pull only 3 bills from CA to verify pipeline works):
```bash
python update_data.py --test --continue-on-error --skip-upload
```

### Run individual scripts:
```bash
# Re-run just summaries (e.g., to catch up after a hang)
python data_updating_scripts/generate_summaries.py

# Re-run just questions
python data_updating_scripts/generate_suggested_questions.py

# Re-run just reports
python data_updating_scripts/generate_reports.py

# Rebuild vectorstore
python data_updating_scripts/build_bills_vectorstore.py

# Upload all data to HuggingFace
python huggingface_upload.py
```

---

## Environment Variables Required

Set these in a `.env` file in the project root:

| Variable | Required For | Description |
|----------|-------------|-------------|
| `LEGISCAN_API_KEY` | Scripts 1 | LegiScan API key for pulling bill data |
| `OPENAI_API_KEY` | Scripts 4, 6-10 | OpenAI API key for summaries, reports, embeddings |
| `HUGGINGFACE_HUB_TOKEN` | HF upload | HuggingFace API token |
| `HF_REPO_ID` | HF upload | HuggingFace dataset repo (e.g., `username/dataset-name`) |

---

## Troubleshooting

### Pipeline hangs on a bill
All OpenAI API calls have a 120-second timeout. If a call hangs, it will timeout, log an error, and skip to the next bill. The skipped bill will be retried on the next pipeline run.

### Bills show up but summaries/questions/reports are missing
The generate scripts cache their results. If a script was interrupted mid-run, re-run it individually to catch up:
```bash
python data_updating_scripts/generate_summaries.py
```

### Main page shows old data after pipeline runs
Streamlit caches data in memory. Hard-refresh the page (`Cmd+Shift+R`) or restart Streamlit.

### HuggingFace upload fails
Check that `HUGGINGFACE_HUB_TOKEN` and `HF_REPO_ID` are set in `.env`. Test the connection:
```bash
python huggingface_upload.py
```

### "OPENAI_API_KEY not set" error
Ensure your `.env` file contains the key and that `python-dotenv` is installed. The pipeline loads `.env` automatically.

---

## Estimated Runtime

| Scenario | Duration |
|----------|----------|
| Full pipeline, all bills new | 4-6 hours |
| Incremental (100 new bills) | 30-60 minutes |
| Re-run summaries only (catch-up) | ~15-30 min per 1,000 bills |
| Vectorstore rebuild (all bills) | 15-30 minutes |
| HuggingFace upload | 2-5 minutes |