# Data Update Pipeline Guide ## Overview The AI Legislation Tracker pipeline pulls AI-related bills from all 50 US states (+ US Congress) via the LegiScan API, processes them through multiple stages (text extraction, categorization, AI-generated summaries/questions/reports), builds searchable vectorstores, and optionally syncs everything to HuggingFace for cloud storage. --- ## Quick Reference **From the Admin Panel (recommended):** 1. Open the app -> navigate to Admin -> log in -> "Update Data" tab -> click "Update Data" **From the command line:** ```bash python update_data.py --pull --overwrite-pdf --continue-on-error ``` **After pipeline completes, sync to cloud:** ```bash python huggingface_upload.py ``` --- ## Step-by-Step: Running from the Admin Panel ### Step 1: Start the Streamlit App Open a terminal in the project root and run: ```bash streamlit run streamlit_app.py ``` The app opens in your browser (typically at `http://localhost:8501`). > **What you see:** The main VAILL AI Governance Bills Tracker dashboard with a map, bill table, filters, and analysis tools. ### Step 2: Navigate to the Admin Panel In the left sidebar, click **"Admin"** to open the Admin page. > **What you see:** A login form asking for username and password. ### Step 3: Log In Enter your admin credentials (configured in `auth_config.json` or HuggingFace). > **What you see:** After login, the Admin Panel with three tabs: **Overview**, **Update Data**, and **Manage Users**. Your name and username appear in the sidebar with a Logout button. ### Step 4: Review Current Data (Overview Tab) The **Overview** tab shows: - **System Status**: Connection status for OpenAI API, LegiScan API, and HuggingFace - **Current Data**: Counts for Total Bills, Bills with Details, Summaries, Question Sets, Reports, and Cached Bills - **Last Pipeline Run**: Timestamp of the most recent run - **Admin Users**: Table of all registered admin accounts > **Check that all three API connections show "Connected" before running the pipeline.** If any show "Missing", set the corresponding environment variable in your `.env` file. ### Step 5: Run the Pipeline (Update Data Tab) 1. Click the **"Update Data"** tab 2. Review the **Current Data** counts (same metrics as Overview) 3. **Optional**: Check "Skip uploading to HuggingFace after update" if you only want local updates 4. Click the blue **"Update Data"** button > **What you see:** A live status panel showing real-time log output as each script runs. The pipeline runs 10 scripts sequentially (detailed below). This takes **2-6 hours** depending on how many new bills need processing. ### Step 6: Monitor Progress The status panel streams output from each script. You'll see messages like: ``` --- Running data_updating_scripts/get_data.py --- Fetching bills for AL (2023-2026)... ... --- Running data_updating_scripts/generate_summaries.py --- Processing 1/3487: AL_SJR94 ... ``` > **Important:** You can navigate away from the page — the pipeline runs as a background process. To check on it later, look at the terminal where Streamlit is running, or check the log files in `data_updating_scripts/logs/`. ### Step 7: Review Results When the pipeline finishes, the Update Data tab shows: - **New / Updated Bills**: How many bills were fetched from LegiScan - **Unchanged Bills**: Bills that hadn't changed (API calls saved) - **Steps Completed**: How many scripts passed vs failed - **Data Changes**: Before/after comparison of all data file counts If HuggingFace upload was enabled, it automatically syncs all JSON files to the cloud. ### Step 8: View Updated Data Hard-refresh the main page (`Cmd+Shift+R` on Mac, `Ctrl+Shift+R` on Windows) or restart Streamlit to clear the cache and see the updated bills. --- ## What the Pipeline Does (All 10 Scripts) The pipeline orchestrator (`update_data.py`) runs these 10 scripts in order: ### Script 1: `get_data.py` — Pull Bills from LegiScan **What it does:** - Queries the LegiScan API for "artificial intelligence" bills across all 50 states + US Congress - Searches years 2023 through the current year - For each bill found, fetches full bill details (text, sponsors, status, history) - Uses a local cache (`data/bill_cache.json`) to skip bills that haven't changed since last pull - Extracts bill text from base64-encoded documents (HTML or PDF) **Input:** LegiScan API **Output:** `data/known_bills.json`, `data/bill_cache.json` **API used:** LegiScan (uses `LEGISCAN_API_KEY`) **Cost:** Free tier allows ~30,000 requests/month ### Script 2: `fix_pdf_bills.py` — Extract Text from PDF Bills **What it does:** - Finds bills where the text is still raw base64-encoded PDF content - Decodes the base64, extracts readable text using PyPDF2 - Also handles HTML-encoded bill text via BeautifulSoup - Marks successfully processed bills with `text_fixed: true` to avoid reprocessing - Creates a backup before modifying data **Input:** `data/known_bills.json` **Output:** `data/known_bills.json` (updated in place), `data/known_bills_backup.json` **API used:** None (local processing only) ### Script 3: `known_bills_status.py` — Clean and Merge Bill Data **What it does:** - Merges raw bill data from `known_bills.json` with the existing visualization dataset - Maps numeric status codes to human-readable labels (e.g., 1 -> "Introduced", 4 -> "Signed Into Law") - Preserves existing IAPP categories and other enrichments from previous runs - Removes bills that are no longer in the source data **Input:** `data/known_bills.json`, `data/known_bills_visualize.json` (existing) **Output:** `data/known_bills_fixed.json`, `data/known_bills_visualize.json` **API used:** None (local processing only) ### Script 4: `migrate_iapp_categories.py` — Categorize Bills (IAPP Framework) **What it does:** - Analyzes each bill's text using OpenAI to categorize it under the IAPP AI governance framework - Assigns bills to 4 main categories: Governance, Transparency, Assurance, Individual Rights - Each category has specific subcategories (e.g., "Program and documentation", "Opt out/appeal") - Uses a local cache (`data/iapp_categories_cache.json`) to avoid reprocessing unchanged bills - Falls back to default categories if the API call fails **Input:** `data/known_bills_fixed.json`, `data/known_bills_visualize.json` **Output:** `data/known_bills_visualize.json` (updated with `iapp_categories` field), `data/iapp_categories_cache.json` **API used:** OpenAI Chat API (`OPENAI_API_KEY`) **Cost:** ~$0.02 per bill ### Script 5: `mark_no_text_bills.py` — Flag Bills Without Text **What it does:** - Scans all bills and marks those with missing or very short text (< 50 characters) - Sets `iapp_categories` to `null` for these bills so they're excluded from categorization displays - This prevents empty bills from cluttering analysis results **Input:** `data/known_bills_visualize.json` **Output:** `data/known_bills_visualize.json` (updated in place) **API used:** None (local processing only) ### Script 6: `generate_summaries.py` — Generate Bill Summaries **What it does:** - For each bill with text, generates a concise AI summary explaining what the bill does - Skips bills that already have a valid summary in the cache - Saves progress every 10 bills (safe to interrupt and resume) - Summaries are stored separately from bill data to avoid reprocessing **Input:** `data/known_bills_visualize.json`, `data/bill_summaries.json` (existing cache) **Output:** `data/bill_summaries.json` **API used:** OpenAI Chat API **Cost:** ~$0.025 per bill ### Script 7: `generate_suggested_questions.py` — Generate Discussion Questions **What it does:** - For each bill, generates 5 specific questions that a user might ask about the bill - These questions appear in the UI as clickable suggestions when viewing a bill - Falls back to generic questions if generation fails - Saves progress every 10 bills **Input:** `data/known_bills_visualize.json`, `data/bill_suggested_questions.json` (existing cache) **Output:** `data/bill_suggested_questions.json` **API used:** OpenAI Chat API **Cost:** ~$0.02 per bill ### Script 8: `generate_reports.py` — Generate Detailed Reports **What it does:** - For each bill, generates a detailed Markdown report covering: title, status, sponsors, goals, key provisions, regulatory approaches, enforcement mechanisms, and notable features - Reports are the longest AI-generated content (~1-2 pages per bill) - Saves progress every 10 bills (resumable) **Input:** `data/known_bills_visualize.json`, `data/bill_reports.json` (existing cache) **Output:** `data/bill_reports.json` **API used:** OpenAI Chat API (uses `gpt-4o`) **Cost:** ~$0.045 per bill ### Script 9: `build_bills_vectorstore.py` — Build Searchable Vectorstore **What it does:** - Converts all bills into vector embeddings for semantic search - Splits long bill text into overlapping chunks (1500 chars, 200 overlap) - Stores embeddings in a ChromaDB vectorstore on disk - Uses a manifest file to skip bills that haven't changed since last build - This vectorstore powers the "Ask a Question" and "Compare Bills" features in the app **Input:** `data/known_bills_visualize.json` **Output:** `data/bills_vectorstore/` (ChromaDB files), `data/bills_vectorstore_manifest.json` **API used:** OpenAI Embeddings API (`text-embedding-3-small`) **Cost:** ~$0.0001 per bill (very cheap) ### Script 10: `eu_vectorstore.py` — Build EU AI Act Vectorstore **What it does:** - Extracts text from the EU AI Act PDF (`data_updating_scripts/eu-ai-act.pdf`) - Splits it into chunks and creates a FAISS vectorstore - This vectorstore powers the "Compare with EU AI Act" feature in the app **Input:** `data_updating_scripts/eu-ai-act.pdf` **Output:** `data/eu_ai_act_vectorstore/` (FAISS index + metadata) **API used:** OpenAI Embeddings API **Cost:** ~$0.01 (one-time, small document) --- ## Data Flow Diagram ``` LegiScan API | v [1] get_data.py | v data/known_bills.json + data/bill_cache.json | v [2] fix_pdf_bills.py (extract text from PDFs) | v data/known_bills.json (text extracted) | v [3] known_bills_status.py (merge + clean statuses) | v data/known_bills_fixed.json + data/known_bills_visualize.json | +---> [4] migrate_iapp_categories.py (IAPP categorization via OpenAI) | | | v | data/known_bills_visualize.json (with iapp_categories) | +---> [5] mark_no_text_bills.py (flag empty bills) | | | v | data/known_bills_visualize.json (final version) | +---> [6] generate_summaries.py ---------> data/bill_summaries.json | +---> [7] generate_suggested_questions.py -> data/bill_suggested_questions.json | +---> [8] generate_reports.py ------------> data/bill_reports.json | +---> [9] build_bills_vectorstore.py -----> data/bills_vectorstore/ | +---> [10] eu_vectorstore.py -------------> data/eu_ai_act_vectorstore/ | v [Optional] huggingface_upload.py (sync all JSONs to HuggingFace) | v HuggingFace Datasets Hub | v streamlit_app.py reads data/known_bills_visualize.json + data/bill_summaries.json + data/bill_suggested_questions.json + data/bill_reports.json + data/bills_vectorstore/ + data/eu_ai_act_vectorstore/ | v Website displays all bills with interactive features ``` --- ## Files Produced by the Pipeline | File | Description | Used By | |------|-------------|---------| | `data/known_bills.json` | Raw bill data from LegiScan | Pipeline scripts | | `data/known_bills_backup.json` | Backup before PDF text extraction | Recovery only | | `data/known_bills_fixed.json` | Bills with cleaned statuses | Pipeline scripts | | `data/known_bills_visualize.json` | **Main data file** — bills with IAPP categories | Website + all scripts | | `data/bill_cache.json` | LegiScan change hashes (skip unchanged bills) | `get_data.py` | | `data/iapp_categories_cache.json` | IAPP categorization cache | `migrate_iapp_categories.py` | | `data/bill_summaries.json` | AI-generated bill summaries | Website | | `data/bill_suggested_questions.json` | AI-generated discussion questions | Website | | `data/bill_reports.json` | AI-generated detailed reports | Website | | `data/bills_vectorstore/` | ChromaDB vectorstore for bill search | Website (Q&A, Compare) | | `data/bills_vectorstore_manifest.json` | Tracks which bills are in vectorstore | `build_bills_vectorstore.py` | | `data/eu_ai_act_vectorstore/` | FAISS vectorstore for EU AI Act | Website (EU comparison) | --- ## Running from the Command Line ### Full pipeline (pull new data + process everything): ```bash python update_data.py --pull --overwrite-pdf --continue-on-error ``` ### Skip LegiScan pull (reprocess existing local data only): ```bash python update_data.py --no-pull --continue-on-error ``` ### Skip HuggingFace upload: ```bash python update_data.py --pull --overwrite-pdf --continue-on-error --skip-upload ``` ### Test mode (pull only 3 bills from CA to verify pipeline works): ```bash python update_data.py --test --continue-on-error --skip-upload ``` ### Run individual scripts: ```bash # Re-run just summaries (e.g., to catch up after a hang) python data_updating_scripts/generate_summaries.py # Re-run just questions python data_updating_scripts/generate_suggested_questions.py # Re-run just reports python data_updating_scripts/generate_reports.py # Rebuild vectorstore python data_updating_scripts/build_bills_vectorstore.py # Upload all data to HuggingFace python huggingface_upload.py ``` --- ## Environment Variables Required Set these in a `.env` file in the project root: | Variable | Required For | Description | |----------|-------------|-------------| | `LEGISCAN_API_KEY` | Scripts 1 | LegiScan API key for pulling bill data | | `OPENAI_API_KEY` | Scripts 4, 6-10 | OpenAI API key for summaries, reports, embeddings | | `HUGGINGFACE_HUB_TOKEN` | HF upload | HuggingFace API token | | `HF_REPO_ID` | HF upload | HuggingFace dataset repo (e.g., `username/dataset-name`) | --- ## Troubleshooting ### Pipeline hangs on a bill All OpenAI API calls have a 120-second timeout. If a call hangs, it will timeout, log an error, and skip to the next bill. The skipped bill will be retried on the next pipeline run. ### Bills show up but summaries/questions/reports are missing The generate scripts cache their results. If a script was interrupted mid-run, re-run it individually to catch up: ```bash python data_updating_scripts/generate_summaries.py ``` ### Main page shows old data after pipeline runs Streamlit caches data in memory. Hard-refresh the page (`Cmd+Shift+R`) or restart Streamlit. ### HuggingFace upload fails Check that `HUGGINGFACE_HUB_TOKEN` and `HF_REPO_ID` are set in `.env`. Test the connection: ```bash python huggingface_upload.py ``` ### "OPENAI_API_KEY not set" error Ensure your `.env` file contains the key and that `python-dotenv` is installed. The pipeline loads `.env` automatically. --- ## Estimated Runtime | Scenario | Duration | |----------|----------| | Full pipeline, all bills new | 4-6 hours | | Incremental (100 new bills) | 30-60 minutes | | Re-run summaries only (catch-up) | ~15-30 min per 1,000 bills | | Vectorstore rebuild (all bills) | 15-30 minutes | | HuggingFace upload | 2-5 minutes |