Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # Data Update Pipeline Guide | |
| ## Overview | |
| The AI Legislation Tracker pipeline pulls AI-related bills from all 50 US states (+ US Congress) via the LegiScan API, processes them through multiple stages (text extraction, categorization, AI-generated summaries/questions/reports), builds searchable vectorstores, and optionally syncs everything to HuggingFace for cloud storage. | |
| --- | |
| ## Quick Reference | |
| **From the Admin Panel (recommended):** | |
| 1. Open the app -> navigate to Admin -> log in -> "Update Data" tab -> click "Update Data" | |
| **From the command line:** | |
| ```bash | |
| python update_data.py --pull --overwrite-pdf --continue-on-error | |
| ``` | |
| **After pipeline completes, sync to cloud:** | |
| ```bash | |
| python huggingface_upload.py | |
| ``` | |
| --- | |
| ## Step-by-Step: Running from the Admin Panel | |
| ### Step 1: Start the Streamlit App | |
| Open a terminal in the project root and run: | |
| ```bash | |
| streamlit run streamlit_app.py | |
| ``` | |
| The app opens in your browser (typically at `http://localhost:8501`). | |
| > **What you see:** The main VAILL AI Governance Bills Tracker dashboard with a map, bill table, filters, and analysis tools. | |
| ### Step 2: Navigate to the Admin Panel | |
| In the left sidebar, click **"Admin"** to open the Admin page. | |
| > **What you see:** A login form asking for username and password. | |
| ### Step 3: Log In | |
| Enter your admin credentials (configured in `auth_config.json` or HuggingFace). | |
| > **What you see:** After login, the Admin Panel with three tabs: **Overview**, **Update Data**, and **Manage Users**. Your name and username appear in the sidebar with a Logout button. | |
| ### Step 4: Review Current Data (Overview Tab) | |
| The **Overview** tab shows: | |
| - **System Status**: Connection status for OpenAI API, LegiScan API, and HuggingFace | |
| - **Current Data**: Counts for Total Bills, Bills with Details, Summaries, Question Sets, Reports, and Cached Bills | |
| - **Last Pipeline Run**: Timestamp of the most recent run | |
| - **Admin Users**: Table of all registered admin accounts | |
| > **Check that all three API connections show "Connected" before running the pipeline.** If any show "Missing", set the corresponding environment variable in your `.env` file. | |
| ### Step 5: Run the Pipeline (Update Data Tab) | |
| 1. Click the **"Update Data"** tab | |
| 2. Review the **Current Data** counts (same metrics as Overview) | |
| 3. **Optional**: Check "Skip uploading to HuggingFace after update" if you only want local updates | |
| 4. Click the blue **"Update Data"** button | |
| > **What you see:** A live status panel showing real-time log output as each script runs. The pipeline runs 10 scripts sequentially (detailed below). This takes **2-6 hours** depending on how many new bills need processing. | |
| ### Step 6: Monitor Progress | |
| The status panel streams output from each script. You'll see messages like: | |
| ``` | |
| --- Running data_updating_scripts/get_data.py --- | |
| Fetching bills for AL (2023-2026)... | |
| ... | |
| --- Running data_updating_scripts/generate_summaries.py --- | |
| Processing 1/3487: AL_SJR94 | |
| ... | |
| ``` | |
| > **Important:** You can navigate away from the page — the pipeline runs as a background process. To check on it later, look at the terminal where Streamlit is running, or check the log files in `data_updating_scripts/logs/`. | |
| ### Step 7: Review Results | |
| When the pipeline finishes, the Update Data tab shows: | |
| - **New / Updated Bills**: How many bills were fetched from LegiScan | |
| - **Unchanged Bills**: Bills that hadn't changed (API calls saved) | |
| - **Steps Completed**: How many scripts passed vs failed | |
| - **Data Changes**: Before/after comparison of all data file counts | |
| If HuggingFace upload was enabled, it automatically syncs all JSON files to the cloud. | |
| ### Step 8: View Updated Data | |
| Hard-refresh the main page (`Cmd+Shift+R` on Mac, `Ctrl+Shift+R` on Windows) or restart Streamlit to clear the cache and see the updated bills. | |
| --- | |
| ## What the Pipeline Does (All 10 Scripts) | |
| The pipeline orchestrator (`update_data.py`) runs these 10 scripts in order: | |
| ### Script 1: `get_data.py` — Pull Bills from LegiScan | |
| **What it does:** | |
| - Queries the LegiScan API for "artificial intelligence" bills across all 50 states + US Congress | |
| - Searches years 2023 through the current year | |
| - For each bill found, fetches full bill details (text, sponsors, status, history) | |
| - Uses a local cache (`data/bill_cache.json`) to skip bills that haven't changed since last pull | |
| - Extracts bill text from base64-encoded documents (HTML or PDF) | |
| **Input:** LegiScan API | |
| **Output:** `data/known_bills.json`, `data/bill_cache.json` | |
| **API used:** LegiScan (uses `LEGISCAN_API_KEY`) | |
| **Cost:** Free tier allows ~30,000 requests/month | |
| ### Script 2: `fix_pdf_bills.py` — Extract Text from PDF Bills | |
| **What it does:** | |
| - Finds bills where the text is still raw base64-encoded PDF content | |
| - Decodes the base64, extracts readable text using PyPDF2 | |
| - Also handles HTML-encoded bill text via BeautifulSoup | |
| - Marks successfully processed bills with `text_fixed: true` to avoid reprocessing | |
| - Creates a backup before modifying data | |
| **Input:** `data/known_bills.json` | |
| **Output:** `data/known_bills.json` (updated in place), `data/known_bills_backup.json` | |
| **API used:** None (local processing only) | |
| ### Script 3: `known_bills_status.py` — Clean and Merge Bill Data | |
| **What it does:** | |
| - Merges raw bill data from `known_bills.json` with the existing visualization dataset | |
| - Maps numeric status codes to human-readable labels (e.g., 1 -> "Introduced", 4 -> "Signed Into Law") | |
| - Preserves existing IAPP categories and other enrichments from previous runs | |
| - Removes bills that are no longer in the source data | |
| **Input:** `data/known_bills.json`, `data/known_bills_visualize.json` (existing) | |
| **Output:** `data/known_bills_fixed.json`, `data/known_bills_visualize.json` | |
| **API used:** None (local processing only) | |
| ### Script 4: `migrate_iapp_categories.py` — Categorize Bills (IAPP Framework) | |
| **What it does:** | |
| - Analyzes each bill's text using OpenAI to categorize it under the IAPP AI governance framework | |
| - Assigns bills to 4 main categories: Governance, Transparency, Assurance, Individual Rights | |
| - Each category has specific subcategories (e.g., "Program and documentation", "Opt out/appeal") | |
| - Uses a local cache (`data/iapp_categories_cache.json`) to avoid reprocessing unchanged bills | |
| - Falls back to default categories if the API call fails | |
| **Input:** `data/known_bills_fixed.json`, `data/known_bills_visualize.json` | |
| **Output:** `data/known_bills_visualize.json` (updated with `iapp_categories` field), `data/iapp_categories_cache.json` | |
| **API used:** OpenAI Chat API (`OPENAI_API_KEY`) | |
| **Cost:** ~$0.02 per bill | |
| ### Script 5: `mark_no_text_bills.py` — Flag Bills Without Text | |
| **What it does:** | |
| - Scans all bills and marks those with missing or very short text (< 50 characters) | |
| - Sets `iapp_categories` to `null` for these bills so they're excluded from categorization displays | |
| - This prevents empty bills from cluttering analysis results | |
| **Input:** `data/known_bills_visualize.json` | |
| **Output:** `data/known_bills_visualize.json` (updated in place) | |
| **API used:** None (local processing only) | |
| ### Script 6: `generate_summaries.py` — Generate Bill Summaries | |
| **What it does:** | |
| - For each bill with text, generates a concise AI summary explaining what the bill does | |
| - Skips bills that already have a valid summary in the cache | |
| - Saves progress every 10 bills (safe to interrupt and resume) | |
| - Summaries are stored separately from bill data to avoid reprocessing | |
| **Input:** `data/known_bills_visualize.json`, `data/bill_summaries.json` (existing cache) | |
| **Output:** `data/bill_summaries.json` | |
| **API used:** OpenAI Chat API | |
| **Cost:** ~$0.025 per bill | |
| ### Script 7: `generate_suggested_questions.py` — Generate Discussion Questions | |
| **What it does:** | |
| - For each bill, generates 5 specific questions that a user might ask about the bill | |
| - These questions appear in the UI as clickable suggestions when viewing a bill | |
| - Falls back to generic questions if generation fails | |
| - Saves progress every 10 bills | |
| **Input:** `data/known_bills_visualize.json`, `data/bill_suggested_questions.json` (existing cache) | |
| **Output:** `data/bill_suggested_questions.json` | |
| **API used:** OpenAI Chat API | |
| **Cost:** ~$0.02 per bill | |
| ### Script 8: `generate_reports.py` — Generate Detailed Reports | |
| **What it does:** | |
| - For each bill, generates a detailed Markdown report covering: title, status, sponsors, goals, key provisions, regulatory approaches, enforcement mechanisms, and notable features | |
| - Reports are the longest AI-generated content (~1-2 pages per bill) | |
| - Saves progress every 10 bills (resumable) | |
| **Input:** `data/known_bills_visualize.json`, `data/bill_reports.json` (existing cache) | |
| **Output:** `data/bill_reports.json` | |
| **API used:** OpenAI Chat API (uses `gpt-4o`) | |
| **Cost:** ~$0.045 per bill | |
| ### Script 9: `build_bills_vectorstore.py` — Build Searchable Vectorstore | |
| **What it does:** | |
| - Converts all bills into vector embeddings for semantic search | |
| - Splits long bill text into overlapping chunks (1500 chars, 200 overlap) | |
| - Stores embeddings in a ChromaDB vectorstore on disk | |
| - Uses a manifest file to skip bills that haven't changed since last build | |
| - This vectorstore powers the "Ask a Question" and "Compare Bills" features in the app | |
| **Input:** `data/known_bills_visualize.json` | |
| **Output:** `data/bills_vectorstore/` (ChromaDB files), `data/bills_vectorstore_manifest.json` | |
| **API used:** OpenAI Embeddings API (`text-embedding-3-small`) | |
| **Cost:** ~$0.0001 per bill (very cheap) | |
| ### Script 10: `eu_vectorstore.py` — Build EU AI Act Vectorstore | |
| **What it does:** | |
| - Extracts text from the EU AI Act PDF (`data_updating_scripts/eu-ai-act.pdf`) | |
| - Splits it into chunks and creates a FAISS vectorstore | |
| - This vectorstore powers the "Compare with EU AI Act" feature in the app | |
| **Input:** `data_updating_scripts/eu-ai-act.pdf` | |
| **Output:** `data/eu_ai_act_vectorstore/` (FAISS index + metadata) | |
| **API used:** OpenAI Embeddings API | |
| **Cost:** ~$0.01 (one-time, small document) | |
| --- | |
| ## Data Flow Diagram | |
| ``` | |
| LegiScan API | |
| | | |
| v | |
| [1] get_data.py | |
| | | |
| v | |
| data/known_bills.json + data/bill_cache.json | |
| | | |
| v | |
| [2] fix_pdf_bills.py (extract text from PDFs) | |
| | | |
| v | |
| data/known_bills.json (text extracted) | |
| | | |
| v | |
| [3] known_bills_status.py (merge + clean statuses) | |
| | | |
| v | |
| data/known_bills_fixed.json + data/known_bills_visualize.json | |
| | | |
| +---> [4] migrate_iapp_categories.py (IAPP categorization via OpenAI) | |
| | | | |
| | v | |
| | data/known_bills_visualize.json (with iapp_categories) | |
| | | |
| +---> [5] mark_no_text_bills.py (flag empty bills) | |
| | | | |
| | v | |
| | data/known_bills_visualize.json (final version) | |
| | | |
| +---> [6] generate_summaries.py ---------> data/bill_summaries.json | |
| | | |
| +---> [7] generate_suggested_questions.py -> data/bill_suggested_questions.json | |
| | | |
| +---> [8] generate_reports.py ------------> data/bill_reports.json | |
| | | |
| +---> [9] build_bills_vectorstore.py -----> data/bills_vectorstore/ | |
| | | |
| +---> [10] eu_vectorstore.py -------------> data/eu_ai_act_vectorstore/ | |
| | | |
| v | |
| [Optional] huggingface_upload.py (sync all JSONs to HuggingFace) | |
| | | |
| v | |
| HuggingFace Datasets Hub | |
| | | |
| v | |
| streamlit_app.py reads data/known_bills_visualize.json | |
| + data/bill_summaries.json | |
| + data/bill_suggested_questions.json | |
| + data/bill_reports.json | |
| + data/bills_vectorstore/ | |
| + data/eu_ai_act_vectorstore/ | |
| | | |
| v | |
| Website displays all bills with interactive features | |
| ``` | |
| --- | |
| ## Files Produced by the Pipeline | |
| | File | Description | Used By | | |
| |------|-------------|---------| | |
| | `data/known_bills.json` | Raw bill data from LegiScan | Pipeline scripts | | |
| | `data/known_bills_backup.json` | Backup before PDF text extraction | Recovery only | | |
| | `data/known_bills_fixed.json` | Bills with cleaned statuses | Pipeline scripts | | |
| | `data/known_bills_visualize.json` | **Main data file** — bills with IAPP categories | Website + all scripts | | |
| | `data/bill_cache.json` | LegiScan change hashes (skip unchanged bills) | `get_data.py` | | |
| | `data/iapp_categories_cache.json` | IAPP categorization cache | `migrate_iapp_categories.py` | | |
| | `data/bill_summaries.json` | AI-generated bill summaries | Website | | |
| | `data/bill_suggested_questions.json` | AI-generated discussion questions | Website | | |
| | `data/bill_reports.json` | AI-generated detailed reports | Website | | |
| | `data/bills_vectorstore/` | ChromaDB vectorstore for bill search | Website (Q&A, Compare) | | |
| | `data/bills_vectorstore_manifest.json` | Tracks which bills are in vectorstore | `build_bills_vectorstore.py` | | |
| | `data/eu_ai_act_vectorstore/` | FAISS vectorstore for EU AI Act | Website (EU comparison) | | |
| --- | |
| ## Running from the Command Line | |
| ### Full pipeline (pull new data + process everything): | |
| ```bash | |
| python update_data.py --pull --overwrite-pdf --continue-on-error | |
| ``` | |
| ### Skip LegiScan pull (reprocess existing local data only): | |
| ```bash | |
| python update_data.py --no-pull --continue-on-error | |
| ``` | |
| ### Skip HuggingFace upload: | |
| ```bash | |
| python update_data.py --pull --overwrite-pdf --continue-on-error --skip-upload | |
| ``` | |
| ### Test mode (pull only 3 bills from CA to verify pipeline works): | |
| ```bash | |
| python update_data.py --test --continue-on-error --skip-upload | |
| ``` | |
| ### Run individual scripts: | |
| ```bash | |
| # Re-run just summaries (e.g., to catch up after a hang) | |
| python data_updating_scripts/generate_summaries.py | |
| # Re-run just questions | |
| python data_updating_scripts/generate_suggested_questions.py | |
| # Re-run just reports | |
| python data_updating_scripts/generate_reports.py | |
| # Rebuild vectorstore | |
| python data_updating_scripts/build_bills_vectorstore.py | |
| # Upload all data to HuggingFace | |
| python huggingface_upload.py | |
| ``` | |
| --- | |
| ## Environment Variables Required | |
| Set these in a `.env` file in the project root: | |
| | Variable | Required For | Description | | |
| |----------|-------------|-------------| | |
| | `LEGISCAN_API_KEY` | Scripts 1 | LegiScan API key for pulling bill data | | |
| | `OPENAI_API_KEY` | Scripts 4, 6-10 | OpenAI API key for summaries, reports, embeddings | | |
| | `HUGGINGFACE_HUB_TOKEN` | HF upload | HuggingFace API token | | |
| | `HF_REPO_ID` | HF upload | HuggingFace dataset repo (e.g., `username/dataset-name`) | | |
| --- | |
| ## Troubleshooting | |
| ### Pipeline hangs on a bill | |
| All OpenAI API calls have a 120-second timeout. If a call hangs, it will timeout, log an error, and skip to the next bill. The skipped bill will be retried on the next pipeline run. | |
| ### Bills show up but summaries/questions/reports are missing | |
| The generate scripts cache their results. If a script was interrupted mid-run, re-run it individually to catch up: | |
| ```bash | |
| python data_updating_scripts/generate_summaries.py | |
| ``` | |
| ### Main page shows old data after pipeline runs | |
| Streamlit caches data in memory. Hard-refresh the page (`Cmd+Shift+R`) or restart Streamlit. | |
| ### HuggingFace upload fails | |
| Check that `HUGGINGFACE_HUB_TOKEN` and `HF_REPO_ID` are set in `.env`. Test the connection: | |
| ```bash | |
| python huggingface_upload.py | |
| ``` | |
| ### "OPENAI_API_KEY not set" error | |
| Ensure your `.env` file contains the key and that `python-dotenv` is installed. The pipeline loads `.env` automatically. | |
| --- | |
| ## Estimated Runtime | |
| | Scenario | Duration | | |
| |----------|----------| | |
| | Full pipeline, all bills new | 4-6 hours | | |
| | Incremental (100 new bills) | 30-60 minutes | | |
| | Re-run summaries only (catch-up) | ~15-30 min per 1,000 bills | | |
| | Vectorstore rebuild (all bills) | 15-30 minutes | | |
| | HuggingFace upload | 2-5 minutes | | |