Spaces:

VAILL
/

legislation-tracker

Running on CPU Upgrade

App Files Files Community

legislation-tracker / docs /data_update_pipeline_guide.md

ramanna

Deploy: newsletter display polish

3cc39aa about 1 month ago

preview code

raw

history blame contribute delete

15.4 kB

	# Data Update Pipeline Guide

	## Overview

	The AI Legislation Tracker pipeline pulls AI-related bills from all 50 US states (+ US Congress) via the LegiScan API, processes them through multiple stages (text extraction, categorization, AI-generated summaries/questions/reports), builds searchable vectorstores, and optionally syncs everything to HuggingFace for cloud storage.

	---

	## Quick Reference

	From the Admin Panel (recommended):
	1. Open the app -> navigate to Admin -> log in -> "Update Data" tab -> click "Update Data"

	From the command line:
	```bash
	python update_data.py --pull --overwrite-pdf --continue-on-error
	```

	After pipeline completes, sync to cloud:
	```bash
	python huggingface_upload.py
	```

	---

	## Step-by-Step: Running from the Admin Panel

	### Step 1: Start the Streamlit App

	Open a terminal in the project root and run:

	```bash
	streamlit run streamlit_app.py
	```

	The app opens in your browser (typically at `http://localhost:8501`).

	> What you see: The main VAILL AI Governance Bills Tracker dashboard with a map, bill table, filters, and analysis tools.

	### Step 2: Navigate to the Admin Panel

	In the left sidebar, click "Admin" to open the Admin page.

	> What you see: A login form asking for username and password.

	### Step 3: Log In

	Enter your admin credentials (configured in `auth_config.json` or HuggingFace).

	> What you see: After login, the Admin Panel with three tabs: Overview, Update Data, and Manage Users. Your name and username appear in the sidebar with a Logout button.

	### Step 4: Review Current Data (Overview Tab)

	The Overview tab shows:
	- System Status: Connection status for OpenAI API, LegiScan API, and HuggingFace
	- Current Data: Counts for Total Bills, Bills with Details, Summaries, Question Sets, Reports, and Cached Bills
	- Last Pipeline Run: Timestamp of the most recent run
	- Admin Users: Table of all registered admin accounts

	> Check that all three API connections show "Connected" before running the pipeline. If any show "Missing", set the corresponding environment variable in your `.env` file.

	### Step 5: Run the Pipeline (Update Data Tab)

	1. Click the "Update Data" tab
	2. Review the Current Data counts (same metrics as Overview)
	3. Optional: Check "Skip uploading to HuggingFace after update" if you only want local updates
	4. Click the blue "Update Data" button

	> What you see: A live status panel showing real-time log output as each script runs. The pipeline runs 10 scripts sequentially (detailed below). This takes 2-6 hours depending on how many new bills need processing.

	### Step 6: Monitor Progress

	The status panel streams output from each script. You'll see messages like:
	```
	--- Running data_updating_scripts/get_data.py ---
	Fetching bills for AL (2023-2026)...
	...
	--- Running data_updating_scripts/generate_summaries.py ---
	Processing 1/3487: AL_SJR94
	...
	```

	> Important: You can navigate away from the page — the pipeline runs as a background process. To check on it later, look at the terminal where Streamlit is running, or check the log files in `data_updating_scripts/logs/`.

	### Step 7: Review Results

	When the pipeline finishes, the Update Data tab shows:
	- New / Updated Bills: How many bills were fetched from LegiScan
	- Unchanged Bills: Bills that hadn't changed (API calls saved)
	- Steps Completed: How many scripts passed vs failed
	- Data Changes: Before/after comparison of all data file counts

	If HuggingFace upload was enabled, it automatically syncs all JSON files to the cloud.

	### Step 8: View Updated Data

	Hard-refresh the main page (`Cmd+Shift+R` on Mac, `Ctrl+Shift+R` on Windows) or restart Streamlit to clear the cache and see the updated bills.

	---

	## What the Pipeline Does (All 10 Scripts)

	The pipeline orchestrator (`update_data.py`) runs these 10 scripts in order:

	### Script 1: `get_data.py` — Pull Bills from LegiScan

	What it does:
	- Queries the LegiScan API for "artificial intelligence" bills across all 50 states + US Congress
	- Searches years 2023 through the current year
	- For each bill found, fetches full bill details (text, sponsors, status, history)
	- Uses a local cache (`data/bill_cache.json`) to skip bills that haven't changed since last pull
	- Extracts bill text from base64-encoded documents (HTML or PDF)

	Input: LegiScan API
	Output: `data/known_bills.json`, `data/bill_cache.json`
	API used: LegiScan (uses `LEGISCAN_API_KEY`)
	Cost: Free tier allows ~30,000 requests/month

	### Script 2: `fix_pdf_bills.py` — Extract Text from PDF Bills

	What it does:
	- Finds bills where the text is still raw base64-encoded PDF content
	- Decodes the base64, extracts readable text using PyPDF2
	- Also handles HTML-encoded bill text via BeautifulSoup
	- Marks successfully processed bills with `text_fixed: true` to avoid reprocessing
	- Creates a backup before modifying data

	Input: `data/known_bills.json`
	Output: `data/known_bills.json` (updated in place), `data/known_bills_backup.json`
	API used: None (local processing only)

	### Script 3: `known_bills_status.py` — Clean and Merge Bill Data

	What it does:
	- Merges raw bill data from `known_bills.json` with the existing visualization dataset
	- Maps numeric status codes to human-readable labels (e.g., 1 -> "Introduced", 4 -> "Signed Into Law")
	- Preserves existing IAPP categories and other enrichments from previous runs
	- Removes bills that are no longer in the source data

	Input: `data/known_bills.json`, `data/known_bills_visualize.json` (existing)
	Output: `data/known_bills_fixed.json`, `data/known_bills_visualize.json`
	API used: None (local processing only)

	### Script 4: `migrate_iapp_categories.py` — Categorize Bills (IAPP Framework)

	What it does:
	- Analyzes each bill's text using OpenAI to categorize it under the IAPP AI governance framework
	- Assigns bills to 4 main categories: Governance, Transparency, Assurance, Individual Rights
	- Each category has specific subcategories (e.g., "Program and documentation", "Opt out/appeal")
	- Uses a local cache (`data/iapp_categories_cache.json`) to avoid reprocessing unchanged bills
	- Falls back to default categories if the API call fails

	Input: `data/known_bills_fixed.json`, `data/known_bills_visualize.json`
	Output: `data/known_bills_visualize.json` (updated with `iapp_categories` field), `data/iapp_categories_cache.json`
	API used: OpenAI Chat API (`OPENAI_API_KEY`)
	Cost: ~$0.02 per bill

	### Script 5: `mark_no_text_bills.py` — Flag Bills Without Text

	What it does:
	- Scans all bills and marks those with missing or very short text (< 50 characters)
	- Sets `iapp_categories` to `null` for these bills so they're excluded from categorization displays
	- This prevents empty bills from cluttering analysis results

	Input: `data/known_bills_visualize.json`
	Output: `data/known_bills_visualize.json` (updated in place)
	API used: None (local processing only)

	### Script 6: `generate_summaries.py` — Generate Bill Summaries

	What it does:
	- For each bill with text, generates a concise AI summary explaining what the bill does
	- Skips bills that already have a valid summary in the cache
	- Saves progress every 10 bills (safe to interrupt and resume)
	- Summaries are stored separately from bill data to avoid reprocessing

	Input: `data/known_bills_visualize.json`, `data/bill_summaries.json` (existing cache)
	Output: `data/bill_summaries.json`
	API used: OpenAI Chat API
	Cost: ~$0.025 per bill

	### Script 7: `generate_suggested_questions.py` — Generate Discussion Questions

	What it does:
	- For each bill, generates 5 specific questions that a user might ask about the bill
	- These questions appear in the UI as clickable suggestions when viewing a bill
	- Falls back to generic questions if generation fails
	- Saves progress every 10 bills

	Input: `data/known_bills_visualize.json`, `data/bill_suggested_questions.json` (existing cache)
	Output: `data/bill_suggested_questions.json`
	API used: OpenAI Chat API
	Cost: ~$0.02 per bill

	### Script 8: `generate_reports.py` — Generate Detailed Reports

	What it does:
	- For each bill, generates a detailed Markdown report covering: title, status, sponsors, goals, key provisions, regulatory approaches, enforcement mechanisms, and notable features
	- Reports are the longest AI-generated content (~1-2 pages per bill)
	- Saves progress every 10 bills (resumable)

	Input: `data/known_bills_visualize.json`, `data/bill_reports.json` (existing cache)
	Output: `data/bill_reports.json`
	API used: OpenAI Chat API (uses `gpt-4o`)
	Cost: ~$0.045 per bill

	### Script 9: `build_bills_vectorstore.py` — Build Searchable Vectorstore

	What it does:
	- Converts all bills into vector embeddings for semantic search
	- Splits long bill text into overlapping chunks (1500 chars, 200 overlap)
	- Stores embeddings in a ChromaDB vectorstore on disk
	- Uses a manifest file to skip bills that haven't changed since last build
	- This vectorstore powers the "Ask a Question" and "Compare Bills" features in the app

	Input: `data/known_bills_visualize.json`
	Output: `data/bills_vectorstore/` (ChromaDB files), `data/bills_vectorstore_manifest.json`
	API used: OpenAI Embeddings API (`text-embedding-3-small`)
	Cost: ~$0.0001 per bill (very cheap)

	### Script 10: `eu_vectorstore.py` — Build EU AI Act Vectorstore

	What it does:
	- Extracts text from the EU AI Act PDF (`data_updating_scripts/eu-ai-act.pdf`)
	- Splits it into chunks and creates a FAISS vectorstore
	- This vectorstore powers the "Compare with EU AI Act" feature in the app

	Input: `data_updating_scripts/eu-ai-act.pdf`
	Output: `data/eu_ai_act_vectorstore/` (FAISS index + metadata)
	API used: OpenAI Embeddings API
	Cost: ~$0.01 (one-time, small document)

	---

	## Data Flow Diagram

	```
	LegiScan API
	\|
	v
	[1] get_data.py
	\|
	v
	data/known_bills.json + data/bill_cache.json
	\|
	v
	[2] fix_pdf_bills.py (extract text from PDFs)
	\|
	v
	data/known_bills.json (text extracted)
	\|
	v
	[3] known_bills_status.py (merge + clean statuses)
	\|
	v
	data/known_bills_fixed.json + data/known_bills_visualize.json
	\|
	+---> [4] migrate_iapp_categories.py (IAPP categorization via OpenAI)
	\| \|
	\| v
	\| data/known_bills_visualize.json (with iapp_categories)
	\|
	+---> [5] mark_no_text_bills.py (flag empty bills)
	\| \|
	\| v
	\| data/known_bills_visualize.json (final version)
	\|
	+---> [6] generate_summaries.py ---------> data/bill_summaries.json
	\|
	+---> [7] generate_suggested_questions.py -> data/bill_suggested_questions.json
	\|
	+---> [8] generate_reports.py ------------> data/bill_reports.json
	\|
	+---> [9] build_bills_vectorstore.py -----> data/bills_vectorstore/
	\|
	+---> [10] eu_vectorstore.py -------------> data/eu_ai_act_vectorstore/
	\|
	v
	[Optional] huggingface_upload.py (sync all JSONs to HuggingFace)
	\|
	v
	HuggingFace Datasets Hub
	\|
	v
	streamlit_app.py reads data/known_bills_visualize.json
	+ data/bill_summaries.json
	+ data/bill_suggested_questions.json
	+ data/bill_reports.json
	+ data/bills_vectorstore/
	+ data/eu_ai_act_vectorstore/
	\|
	v
	Website displays all bills with interactive features
	```

	---

	## Files Produced by the Pipeline

	\| File \| Description \| Used By \|
	\|------\|-------------\|---------\|
	\| `data/known_bills.json` \| Raw bill data from LegiScan \| Pipeline scripts \|
	\| `data/known_bills_backup.json` \| Backup before PDF text extraction \| Recovery only \|
	\| `data/known_bills_fixed.json` \| Bills with cleaned statuses \| Pipeline scripts \|
	\| `data/known_bills_visualize.json` \| Main data file — bills with IAPP categories \| Website + all scripts \|
	\| `data/bill_cache.json` \| LegiScan change hashes (skip unchanged bills) \| `get_data.py` \|
	\| `data/iapp_categories_cache.json` \| IAPP categorization cache \| `migrate_iapp_categories.py` \|
	\| `data/bill_summaries.json` \| AI-generated bill summaries \| Website \|
	\| `data/bill_suggested_questions.json` \| AI-generated discussion questions \| Website \|
	\| `data/bill_reports.json` \| AI-generated detailed reports \| Website \|
	\| `data/bills_vectorstore/` \| ChromaDB vectorstore for bill search \| Website (Q&A, Compare) \|
	\| `data/bills_vectorstore_manifest.json` \| Tracks which bills are in vectorstore \| `build_bills_vectorstore.py` \|
	\| `data/eu_ai_act_vectorstore/` \| FAISS vectorstore for EU AI Act \| Website (EU comparison) \|

	---

	## Running from the Command Line

	### Full pipeline (pull new data + process everything):
	```bash
	python update_data.py --pull --overwrite-pdf --continue-on-error
	```

	### Skip LegiScan pull (reprocess existing local data only):
	```bash
	python update_data.py --no-pull --continue-on-error
	```

	### Skip HuggingFace upload:
	```bash
	python update_data.py --pull --overwrite-pdf --continue-on-error --skip-upload
	```

	### Test mode (pull only 3 bills from CA to verify pipeline works):
	```bash
	python update_data.py --test --continue-on-error --skip-upload
	```

	### Run individual scripts:
	```bash
	# Re-run just summaries (e.g., to catch up after a hang)
	python data_updating_scripts/generate_summaries.py

	# Re-run just questions
	python data_updating_scripts/generate_suggested_questions.py

	# Re-run just reports
	python data_updating_scripts/generate_reports.py

	# Rebuild vectorstore
	python data_updating_scripts/build_bills_vectorstore.py

	# Upload all data to HuggingFace
	python huggingface_upload.py
	```

	---

	## Environment Variables Required

	Set these in a `.env` file in the project root:

	\| Variable \| Required For \| Description \|
	\|----------\|-------------\|-------------\|
	\| `LEGISCAN_API_KEY` \| Scripts 1 \| LegiScan API key for pulling bill data \|
	\| `OPENAI_API_KEY` \| Scripts 4, 6-10 \| OpenAI API key for summaries, reports, embeddings \|
	\| `HUGGINGFACE_HUB_TOKEN` \| HF upload \| HuggingFace API token \|
	\| `HF_REPO_ID` \| HF upload \| HuggingFace dataset repo (e.g., `username/dataset-name`) \|

	---

	## Troubleshooting

	### Pipeline hangs on a bill
	All OpenAI API calls have a 120-second timeout. If a call hangs, it will timeout, log an error, and skip to the next bill. The skipped bill will be retried on the next pipeline run.

	### Bills show up but summaries/questions/reports are missing
	The generate scripts cache their results. If a script was interrupted mid-run, re-run it individually to catch up:
	```bash
	python data_updating_scripts/generate_summaries.py
	```

	### Main page shows old data after pipeline runs
	Streamlit caches data in memory. Hard-refresh the page (`Cmd+Shift+R`) or restart Streamlit.

	### HuggingFace upload fails
	Check that `HUGGINGFACE_HUB_TOKEN` and `HF_REPO_ID` are set in `.env`. Test the connection:
	```bash
	python huggingface_upload.py
	```

	### "OPENAI_API_KEY not set" error
	Ensure your `.env` file contains the key and that `python-dotenv` is installed. The pipeline loads `.env` automatically.

	---

	## Estimated Runtime

	\| Scenario \| Duration \|
	\|----------\|----------\|
	\| Full pipeline, all bills new \| 4-6 hours \|
	\| Incremental (100 new bills) \| 30-60 minutes \|
	\| Re-run summaries only (catch-up) \| ~15-30 min per 1,000 bills \|
	\| Vectorstore rebuild (all bills) \| 15-30 minutes \|
	\| HuggingFace upload \| 2-5 minutes \|