Spaces:

VAILL
/

legislation-tracker

Running on CPU Upgrade

App Files Files Community

ramanna commited on Dec 3, 2025

Commit

0e39328

verified ·

1 Parent(s): 1ec6477

Upload 30 files

Browse files

Files changed (31) hide show

.gitattributes +1 -0
data/data_updating_scripts/PROMPTS/__pycache__/bill_summary_prompt.cpython-313.pyc +0 -0
data/data_updating_scripts/PROMPTS/bill_summary_prompt.py +29 -0
data/data_updating_scripts/PROMPTS/suggested_questions_prompt.md +25 -0
data/data_updating_scripts/__pycache__/config.cpython-313.pyc +0 -0
data/data_updating_scripts/build_bills_vectorstore.py +46 -0
data/data_updating_scripts/build_bills_vectorstore_pinecone_delta.py +43 -0
data/data_updating_scripts/config.py +43 -0
data/data_updating_scripts/eu-ai-act.pdf +3 -0
data/data_updating_scripts/eu_vectorstore.py +269 -0
data/data_updating_scripts/fix_pdf_bills.py +282 -0
data/data_updating_scripts/generate_reports.py +274 -0
data/data_updating_scripts/generate_suggested_questions.py +269 -0
data/data_updating_scripts/generate_summaries.py +204 -0
data/data_updating_scripts/get_data.py +251 -0
data/data_updating_scripts/get_data_ORIGINAL.py +251 -0
data/data_updating_scripts/known_bills_status.py +199 -0
data/data_updating_scripts/logs/eu_vectorstore.log +128 -0
data/data_updating_scripts/logs/fetch_ai_bills.log +0 -0
data/data_updating_scripts/logs/fix_pdf_bills.log +0 -0
data/data_updating_scripts/logs/generate_reports.log +0 -0
data/data_updating_scripts/logs/generate_suggested_questions.log +0 -0
data/data_updating_scripts/logs/generate_summaries.log +0 -0
data/data_updating_scripts/logs/mark_no_text_bills.log +293 -0
data/data_updating_scripts/logs/migrate_iapp_categories.log +0 -0
data/data_updating_scripts/mark_no_text_bills.py +120 -0
data/data_updating_scripts/migrate_iapp_categories.py +358 -0
data/generate_password_hash.py +135 -0
data/huggingface_upload.py +251 -0
data/pages/Admin.py +459 -0
data/update_data.py +64 -0

.gitattributes CHANGED Viewed

@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 data/eu_ai_act_vectorstore/index.faiss filter=lfs diff=lfs merge=lfs -text
 data/known_bills_visualize.json filter=lfs diff=lfs merge=lfs -text

 *tfevents* filter=lfs diff=lfs merge=lfs -text
 data/eu_ai_act_vectorstore/index.faiss filter=lfs diff=lfs merge=lfs -text
 data/known_bills_visualize.json filter=lfs diff=lfs merge=lfs -text
+data/data_updating_scripts/eu-ai-act.pdf filter=lfs diff=lfs merge=lfs -text

data/data_updating_scripts/PROMPTS/__pycache__/bill_summary_prompt.cpython-313.pyc ADDED Viewed

Binary file (1.37 kB). View file

data/data_updating_scripts/PROMPTS/bill_summary_prompt.py ADDED Viewed

	@@ -0,0 +1,29 @@

+# PROMPTS/bill_summary_prompt.py
+BILL_SUMMARY_PROMPT = """
+You are an expert legislative analyst specializing in AI governance and technology policy. Your task is to provide a clear, concise summary of the given bill text.
+Please analyze the bill and provide a comprehensive summary that includes:
+1. **Main Purpose**: What is the primary objective of this bill?
+2. **Key Provisions**: What are the main requirements, prohibitions, or authorizations?
+3. **AI-Related Elements**: How does this bill relate to artificial intelligence, if at all?
+4. **Scope and Impact**: Who does this bill affect and what are the potential consequences?
+5. **Implementation**: What mechanisms or processes does the bill establish?
+**Requirements:**
+- Keep the summary concise but comprehensive (aim for 200-400 words)
+- Use clear, professional language
+- Focus on the most important aspects of the bill
+- If the bill is not related to AI, clearly state this
+- Structure the response with clear sections using markdown formatting
+**Bill Information:**
+- Bill Number: {bill_number}
+- Bill Title: {bill_title}
+- State: {state}
+**Bill Text:**
+{bill_text}
+Please provide your analysis:
+"""

data/data_updating_scripts/PROMPTS/suggested_questions_prompt.md ADDED Viewed

	@@ -0,0 +1,25 @@

+You are an AI governance legislation expert. Your task is to analyze the provided bill and generate exactly 5 relevant, specific questions that users might want to ask about this particular bill.
+The questions should:
+- Be specific to the content and provisions of this bill
+- Cover different aspects of the legislation (definitions, scope, enforcement, compliance, etc.)
+- Be phrased as user-friendly questions that someone analyzing AI governance would ask
+- Be practical and actionable for understanding the bill's impact
+- Avoid generic questions that could apply to any bill
+Focus on aspects like:
+- Key definitions and terminology
+- Scope and applicability
+- Enforcement mechanisms and penalties
+- Compliance requirements
+- Rights and obligations
+- Implementation timelines
+- Regulatory oversight
+- Specific AI technologies or systems mentioned
+Format your response as exactly 5 questions, one per line, with no numbering or bullet points. Each question should be complete and ready to use.
+### Bill Content
+{context}
+Generate 5 specific questions about this bill:

data/data_updating_scripts/__pycache__/config.cpython-313.pyc ADDED Viewed

Binary file (2.43 kB). View file

data/data_updating_scripts/build_bills_vectorstore.py ADDED Viewed

	@@ -0,0 +1,46 @@

+#!/usr/bin/env python3
+import argparse, os
+from pathlib import Path
+from dotenv import load_dotenv
+load_dotenv(dotenv_path=Path.cwd() / ".env")
+import sys
+from pathlib import Path
+sys.path.append(str(Path(__file__).resolve().parents[1]))
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("--source", default="data/known_bills_visualize.json")
+    p.add_argument("--backend", choices=["chroma","pinecone"], default=os.getenv("VECTOR_BACKEND","chroma"))
+    p.add_argument("--persist", default="data/bills_vectorstore")
+    p.add_argument("--collection", default="bills")
+    p.add_argument("--manifest", default="data/bills_vectorstore_manifest.json")
+    p.add_argument("--model", default=None)
+    p.add_argument("--batch", type=int, default=128)
+    args = p.parse_args()
+    if args.backend == "pinecone":
+        from vectorstore.pinecone_bills_vectorstore import upsert_from_bills_json
+        stats = upsert_from_bills_json(
+            source_json_path=args.source,
+            manifest_path=args.manifest,
+            embed_model=args.model,
+            batch_size=args.batch,
+        )
+    else:
+        from vectorstore.bills_vectorstore import upsert_from_bills_json
+        stats = upsert_from_bills_json(
+            source_json_path=args.source,
+            persist_dir=args.persist,
+            collection=args.collection,
+            manifest_path=args.manifest,
+            embed_model=args.model,
+            batch_size=args.batch,
+        )
+    print("✅ Vectorstore updated")
+    for k, v in stats.items():
+        print(f"  {k}: {v}")
+if __name__ == "__main__":
+    main()

data/data_updating_scripts/build_bills_vectorstore_pinecone_delta.py ADDED Viewed

	@@ -0,0 +1,43 @@

+#!/usr/bin/env python3
+import os, json, time
+from pathlib import Path
+from typing import List, Dict, Any
+from dotenv import load_dotenv
+import sys
+sys.path.append(str(Path(__file__).resolve().parents[1]))
+load_dotenv(dotenv_path=Path.cwd() / ".env")
+from vectorstore.pinecone_delta_upsert import chunk_bill, upsert_changed_vectors
+SRC = "data/known_bills_visualize.json"
+BATCH = int(os.getenv("PINECONE_BATCH", "128"))
+def main():
+    p = Path(SRC)
+    if not p.exists():
+        raise SystemExit(f"Missing {SRC}")
+    bills: List[Dict[str, Any]] = json.loads(p.read_text(encoding="utf-8"))
+    bills = [b for b in bills if (b.get("text") or b.get("description") or b.get("title"))]
+    chunks: List[Dict[str, Any]] = []
+    for b in bills:
+        chunks.extend(chunk_bill(b))
+    print(f"Total chunks computed: {len(chunks):,}")
+    changed = 0
+    t0 = time.time()
+    for i in range(0, len(chunks), BATCH):
+        batch = chunks[i:i+BATCH]
+        changed += upsert_changed_vectors(batch)
+        if (i // BATCH) % 10 == 0:
+            print(f"… {i+len(batch):,}/{len(chunks):,} processed")
+    dt = time.time() - t0
+    print("✅ Pinecone delta upsert complete")
+    print(f"   changed_upserts: {changed}")
+    print(f"   elapsed_sec: {dt:.1f}")
+if __name__ == "__main__":
+    main()

data/data_updating_scripts/config.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""Configuration settings for LegiScan AI Governance Bills Tracker."""
+import os
+from pathlib import Path
+import dotenv
+dotenv.load_dotenv()
+class ConfigManager:
+    def __init__(self):
+        """
+        Initialize configuration with profile-specific settings.
+        Args:
+            profile (str): Configuration profile (production, development, testing)
+        """
+        self._load_base_config()
+    def _load_base_config(self):
+        """Load base configuration that applies to all profiles."""
+        self.OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
+        self.OPENAI_LLM_MODEL = os.getenv("OPENAI_LLM_MODEL", "gpt-4o")
+    def reload(self):
+        """Reload configuration."""
+        self._load_base_config()
+        self._load_profile_config(self.profile)
+        self._validate_config()
+    def __str__(self) -> str:
+        """Return string representation of non-sensitive config."""
+        sensitive_keys = ["OPENAI_API_KEY", "LEGISCAN_API_KEY"]
+        config_str = f"Configuration Profile: {self.profile}\n"
+        for key, value in self.__dict__.items():
+            if key.startswith("_"):
+                continue
+            if key in sensitive_keys:
+                config_str += f"{key}: {'*' * 8}\n"
+            else:
+                config_str += f"{key}: {value}\n"
+        return config_str
+# Create default instance
+config = ConfigManager()

data/data_updating_scripts/eu-ai-act.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bba630444b3278e881066774002a1d7824308934f49ccfa203e65be43692f55e
+size 2583319

data/data_updating_scripts/eu_vectorstore.py ADDED Viewed

	@@ -0,0 +1,269 @@

+#!/usr/bin/env python3
+# scripts/create_eu_ai_act_vectorstore.py
+"""
+Script to create and save a vectorstore from the EU AI Act PDF.
+This creates a FAISS vectorstore that can be loaded quickly in the main app.
+"""
+import os
+import logging
+from pathlib import Path
+import pickle
+from typing import Optional
+import dotenv
+# Import config
+from config import config
+# PDF processing
+import PyPDF2
+# LangChain components
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain_openai import OpenAIEmbeddings
+from langchain_community.vectorstores import FAISS
+from langchain.schema import Document
+# Load environment variables
+dotenv.load_dotenv()
+# Create logs directory if it doesn't exist
+os.makedirs("data_updating_scripts/logs", exist_ok=True)
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(message)s",
+    handlers=[logging.StreamHandler(), logging.FileHandler("data_updating_scripts/logs/eu_vectorstore.log")],
+)
+logger = logging.getLogger(__name__)
+def extract_text_from_pdf(pdf_path: str) -> str:
+    """Extract text from PDF file."""
+    try:
+        with open(pdf_path, 'rb') as file:
+            pdf_reader = PyPDF2.PdfReader(file)
+            text = ""
+            logger.info(f"Processing {len(pdf_reader.pages)} pages from {pdf_path}")
+            for page_num, page in enumerate(pdf_reader.pages):
+                try:
+                    page_text = page.extract_text()
+                    text += f"\n\n--- Page {page_num + 1} ---\n\n{page_text}"
+                except Exception as e:
+                    logger.warning(f"Error extracting text from page {page_num + 1}: {e}")
+                    continue
+            logger.info(f"Extracted {len(text)} characters from PDF")
+            return text
+    except Exception as e:
+        logger.error(f"Error reading PDF {pdf_path}: {e}")
+        raise e
+def create_eu_ai_act_documents(text_content: str) -> list:
+    """Convert EU AI Act text to Document objects with metadata."""
+    try:
+        # Initialize text splitter with appropriate settings for legal documents
+        text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=1500,  # Larger chunks for legal text
+            chunk_overlap=200,  # More overlap for context preservation
+            length_function=len,
+            separators=["\n\n", "\n", ". ", " ", ""]
+        )
+        # Create initial document
+        doc = Document(
+            page_content=text_content,
+            metadata={
+                'source': 'EU AI Act',
+                'document_type': 'regulation',
+                'jurisdiction': 'European Union',
+                'title': 'Regulation (EU) 2024/1689 on Artificial Intelligence (AI Act)'
+            }
+        )
+        # Split into chunks
+        splits = text_splitter.split_documents([doc])
+        # Add chunk-specific metadata
+        for i, split in enumerate(splits):
+            split.metadata.update({
+                'chunk_id': i,
+                'total_chunks': len(splits)
+            })
+        logger.info(f"Created {len(splits)} document chunks")
+        return splits
+    except Exception as e:
+        logger.error(f"Error creating documents: {e}")
+        raise e
+def create_and_save_eu_vectorstore(
+    pdf_path: str = "data_updating_scripts/eu-ai-act.pdf",
+    vectorstore_path: str = "data/eu_ai_act_vectorstore",
+    openai_api_key: Optional[str] = None
+) -> bool:
+    """
+    Create FAISS vectorstore from EU AI Act PDF and save it locally.
+    Args:
+        pdf_path: Path to the EU AI Act PDF file
+        vectorstore_path: Directory to save the vectorstore
+        openai_api_key: OpenAI API key (if not provided, uses environment variable)
+    Returns:
+        bool: True if successful, False otherwise
+    """
+    try:
+        # Check if PDF exists
+        if not Path(pdf_path).exists():
+            logger.error(f"PDF file not found: {pdf_path}")
+            return False
+        # Get API key
+        api_key = openai_api_key or config.OPENAI_API_KEY
+        if not api_key:
+            logger.error("OpenAI API key not found")
+            return False
+        logger.info("Starting EU AI Act vectorstore creation...")
+        # Extract text from PDF
+        logger.info("Extracting text from PDF...")
+        text_content = extract_text_from_pdf(pdf_path)
+        if not text_content or len(text_content) < 1000:
+            logger.error("Insufficient text extracted from PDF")
+            return False
+        # Create documents
+        logger.info("Creating document chunks...")
+        documents = create_eu_ai_act_documents(text_content)
+        if not documents:
+            logger.error("No documents created")
+            return False
+        # Initialize embeddings
+        logger.info("Initializing embeddings...")
+        embeddings = OpenAIEmbeddings(
+            api_key=api_key,
+            model="text-embedding-3-small"
+        )
+        # Create vectorstore
+        logger.info("Creating FAISS vectorstore...")
+        vectorstore = FAISS.from_documents(documents, embeddings)
+        # Create directory if it doesn't exist
+        Path(vectorstore_path).mkdir(exist_ok=True)
+        # Save vectorstore
+        logger.info(f"Saving vectorstore to {vectorstore_path}...")
+        vectorstore.save_local(vectorstore_path)
+        # Save metadata
+        metadata = {
+            'pdf_path': pdf_path,
+            'total_chunks': len(documents),
+            'text_length': len(text_content),
+            'embedding_model': 'text-embedding-3-small',
+            'chunk_size': 1500,
+            'chunk_overlap': 200
+        }
+        metadata_path = Path(vectorstore_path) / "metadata.pickle"
+        with open(metadata_path, 'wb') as f:
+            pickle.dump(metadata, f)
+        logger.info(f"✅ EU AI Act vectorstore created successfully!")
+        logger.info(f"   - Total chunks: {len(documents)}")
+        logger.info(f"   - Text length: {len(text_content):,} characters")
+        logger.info(f"   - Saved to: {vectorstore_path}")
+        return True
+    except Exception as e:
+        logger.error(f"Error creating EU AI Act vectorstore: {e}")
+        return False
+def load_eu_vectorstore(
+    vectorstore_path: str = "eu_ai_act_vectorstore",
+    openai_api_key: Optional[str] = None
+) -> Optional[FAISS]:
+    """
+    Load the EU AI Act vectorstore from disk.
+    Args:
+        vectorstore_path: Path to the saved vectorstore
+        openai_api_key: OpenAI API key
+    Returns:
+        FAISS vectorstore or None if failed
+    """
+    try:
+        if not Path(vectorstore_path).exists():
+            logger.error(f"Vectorstore not found: {vectorstore_path}")
+            return None
+        # Get API key
+        api_key = openai_api_key or config.OPENAI_API_KEY
+        if not api_key:
+            logger.error("OpenAI API key not found")
+            return None
+        # Initialize embeddings
+        embeddings = OpenAIEmbeddings(
+            api_key=api_key,
+            model="text-embedding-3-small"
+        )
+        # Load vectorstore
+        vectorstore = FAISS.load_local(
+            vectorstore_path,
+            embeddings,
+            allow_dangerous_deserialization=True  # Required for loading pickled objects
+        )
+        logger.info(f"✅ EU AI Act vectorstore loaded from {vectorstore_path}")
+        return vectorstore
+    except Exception as e:
+        logger.error(f"Error loading EU AI Act vectorstore: {e}")
+        return None
+def get_vectorstore_info(vectorstore_path: str = "data/eu_ai_act_vectorstore") -> dict:
+    """Get information about the saved vectorstore."""
+    try:
+        metadata_path = Path(vectorstore_path) / "metadata.pickle"
+        if metadata_path.exists():
+            with open(metadata_path, 'rb') as f:
+                metadata = pickle.load(f)
+            return metadata
+        else:
+            return {"error": "Metadata not found"}
+    except Exception as e:
+        return {"error": str(e)}
+if __name__ == "__main__":
+    # Create the vectorstore
+    success = create_and_save_eu_vectorstore()
+    if success:
+        # Display info
+        info = get_vectorstore_info()
+        print("\n" + "="*50)
+        print("EU AI Act Vectorstore Information:")
+        print("="*50)
+        for key, value in info.items():
+            if key != 'error':
+                print(f"{key}: {value}")
+        print("="*50)
+    else:
+        print("❌ Failed to create EU AI Act vectorstore")
+        exit(1)

data/data_updating_scripts/fix_pdf_bills.py ADDED Viewed

	@@ -0,0 +1,282 @@

+import os
+import json
+import base64
+import logging
+import sys
+from datetime import datetime, timezone
+import requests
+from dotenv import load_dotenv
+import PyPDF2
+from io import BytesIO
+import re
+import shutil
+# Load environment variables
+load_dotenv()
+API_KEY = os.getenv("LEGISCAN_API_KEY")
+# Files
+INPUT_FILE = "data/known_bills.json"
+OUTPUT_FILE = "data/known_bills_fixed.json"
+BACKUP_FILE = "data/known_bills_backup.json"
+# Rate limiting
+import time
+RATE_LIMIT = 0.2  # seconds between API requests
+# Logging configuration
+LOG_FILE = "data_updating_scripts/logs/fix_pdf_bills.log"
+os.makedirs("data_updating_scripts/logs", exist_ok=True)
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(message)s",
+    handlers=[
+        logging.StreamHandler(sys.stdout),
+        logging.FileHandler(LOG_FILE)
+    ]
+)
+logger = logging.getLogger(__name__)
+def is_pdf_content(text):
+    """Check if the text content is an unprocessed PDF."""
+    if not text:
+        return False
+    # Check for PDF header signatures
+    pdf_signatures = ["%PDF-1.3", "%PDF-1.4", "%PDF-1.5", "%PDF-1.6", "%PDF-1.7", "%PDF1.3", "%PDF1.4", "%PDF1.5", "%PDF1.6", "%PDF1.7"]
+    text_start = text[:20] if len(text) >= 20 else text
+    return any(text_start.startswith(sig) for sig in pdf_signatures)
+def extract_text_from_pdf_bytes(pdf_bytes):
+    """Extract text from PDF bytes using PyPDF2."""
+    try:
+        pdf_file = BytesIO(pdf_bytes)
+        pdf_reader = PyPDF2.PdfReader(pdf_file)
+        text_content = []
+        for page_num in range(len(pdf_reader.pages)):
+            page = pdf_reader.pages[page_num]
+            page_text = page.extract_text()
+            if page_text:
+                text_content.append(page_text)
+        full_text = "\n".join(text_content)
+        # Clean up the extracted text
+        # Remove excessive whitespace while preserving paragraph breaks
+        full_text = re.sub(r'\n{3,}', '\n\n', full_text)
+        full_text = re.sub(r' {2,}', ' ', full_text)
+        full_text = full_text.strip()
+        return full_text
+    except Exception as e:
+        logger.error(f"Error extracting text from PDF: {e}")
+        return None
+def legi_request(op, params):
+    """Make a request to the LegiScan API."""
+    base = "https://api.legiscan.com/"
+    params.update({"key": API_KEY, "op": op})
+    try:
+        resp = requests.get(base, params=params, timeout=10)
+        resp.raise_for_status()
+        data = resp.json()
+        if data.get("status") != "OK":
+            logger.error(f"API error {op}: {data.get('message', data)}")
+            return None
+        return data
+    except requests.RequestException as e:
+        logger.error(f"Request failed ({op}): {e}")
+        return None
+def fix_pdf_bill(bill):
+    """Fix a single bill with unprocessed PDF content."""
+    bill_id = bill.get("bill_id")
+    state = bill.get("state")
+    bill_num = bill.get("bill_number")
+    logger.info(f"Fixing PDF content for {state} {bill_num} (ID: {bill_id})")
+    # First, try to get the bill details again
+    details_resp = legi_request("getBill", {"id": bill_id})
+    if not details_resp:
+        logger.warning(f"Could not fetch bill details for {bill_id}")
+        return None
+    details = details_resp.get("bill", {})
+    texts = details.get("texts", [])
+    if not texts:
+        logger.warning(f"No text documents available for {bill_id}")
+        return None
+    # Try to get the text document
+    doc_id = texts[0].get("doc_id")
+    text_resp = legi_request("getBillText", {"id": doc_id})
+    if not text_resp or "text" not in text_resp:
+        logger.warning(f"Could not fetch text for {bill_id}")
+        return None
+    raw_b64 = text_resp["text"].get("doc", "")
+    if not raw_b64:
+        logger.warning(f"No document content for {bill_id}")
+        return None
+    try:
+        # Decode the base64 content
+        decoded = base64.b64decode(raw_b64)
+        # Check if it's a PDF by looking at the magic bytes
+        if decoded[:4] == b'%PDF':
+            # It's a PDF, extract text
+            extracted_text = extract_text_from_pdf_bytes(decoded)
+            if extracted_text and len(extracted_text.strip()) > 100:  # Ensure we got meaningful text
+                logger.info(f"Successfully extracted {len(extracted_text)} characters from PDF for {bill_id}")
+                return extracted_text
+            else:
+                logger.warning(f"Extracted text too short or empty for {bill_id}")
+                return None
+        else:
+            # Try to decode as HTML (shouldn't happen for these cases, but just in case)
+            try:
+                from bs4 import BeautifulSoup
+                html = decoded.decode("utf-8", errors="ignore")
+                soup = BeautifulSoup(html, "html.parser")
+                plain_text = soup.get_text(separator="\n", strip=True)
+                if plain_text and len(plain_text.strip()) > 100:
+                    logger.info(f"Successfully extracted HTML text for {bill_id}")
+                    return plain_text
+            except:
+                pass
+        logger.warning(f"Could not process document for {bill_id}")
+        return None
+    except Exception as e:
+        logger.error(f"Error processing document for {bill_id}: {e}")
+        return None
+def main(overwrite: bool | None = None):
+    # Load the bills
+    logger.info(f"Loading bills from {INPUT_FILE}")
+    try:
+        with open(INPUT_FILE, 'r') as f:
+            bills = json.load(f)
+    except Exception as e:
+        logger.error(f"Could not load bills file: {e}")
+        sys.exit(1)
+    logger.info(f"Loaded {len(bills)} bills")
+    # Create a backup
+    logger.info(f"Creating backup at {BACKUP_FILE}")
+    with open(BACKUP_FILE, 'w') as f:
+        json.dump(bills, f, indent=2)
+    # Find bills with unprocessed PDF content
+    pdf_bills = []
+    for i, bill in enumerate(bills):
+        if is_pdf_content(bill.get("text")):
+            pdf_bills.append(i)
+    logger.info(f"Found {len(pdf_bills)} bills with unprocessed PDF content")
+    # Process each PDF bill
+    fixed_count = 0
+    failed_count = 0
+    for idx, bill_idx in enumerate(pdf_bills):
+        bill = bills[bill_idx]
+        logger.info(f"Processing {idx + 1}/{len(pdf_bills)}: {bill.get('state')} {bill.get('bill_number')}")
+        # Try to fix the PDF content
+        fixed_text = fix_pdf_bill(bill)
+        if fixed_text:
+            # Update the bill with the fixed text
+            bills[bill_idx]["text"] = fixed_text
+            bills[bill_idx]["lastUpdatedAt"] = datetime.now(timezone.utc).isoformat()
+            bills[bill_idx]["text_fixed"] = True  # Mark that we fixed this
+            fixed_count += 1
+            logger.info(f"Successfully fixed bill {bill.get('bill_id')}")
+        else:
+            # Mark that we tried but failed
+            bills[bill_idx]["text_extraction_failed"] = True
+            bills[bill_idx]["lastUpdatedAt"] = datetime.now(timezone.utc).isoformat()
+            failed_count += 1
+            logger.warning(f"Failed to fix bill {bill.get('bill_id')}")
+        # Rate limiting
+        time.sleep(RATE_LIMIT)
+        # Save progress every 50 bills
+        if (idx + 1) % 50 == 0:
+            logger.info(f"Saving progress... ({idx + 1}/{len(pdf_bills)} processed)")
+            with open(OUTPUT_FILE, 'w') as f:
+                json.dump(bills, f, indent=2)
+    # Save final results
+    logger.info(f"Saving final results to {OUTPUT_FILE}")
+    with open(OUTPUT_FILE, 'w') as f:
+        json.dump(bills, f, indent=2)
+    logger.info(f"Processing complete!")
+    logger.info(f"Successfully fixed: {fixed_count} bills")
+    logger.info(f"Failed to fix: {failed_count} bills")
+    logger.info(f"Output saved to: {OUTPUT_FILE}")
+    if fixed_count > 0:
+        # Decide overwrite behavior
+        if overwrite is None:
+            # CLI mode: ask the user (guardrail preserved)
+            try:
+                response = input(
+                    f"\nDo you want to overwrite {INPUT_FILE} with the fixed data? (y/n): "
+                )
+            except EOFError:
+                logger.error(
+                    "No input available (EOF). Leaving original file unchanged."
+                )
+                return
+            overwrite_flag = response.strip().lower().startswith("y")
+        else:
+            # Non-interactive mode (e.g. Streamlit pipeline)
+            overwrite_flag = overwrite
+        if overwrite_flag:
+            shutil.copy2(OUTPUT_FILE, INPUT_FILE)
+            logger.info(f"Original file {INPUT_FILE} has been updated with fixed data.")
+        else:
+            logger.info("Overwrite declined; original file left unchanged.")
+if __name__ == "__main__":
+    # If running under Streamlit / pipeline, we expect FIX_PDF_OVERWRITE in env:
+    #   "yes", "y", "true", "1"  -> overwrite=True
+    #   "no", "n", "false", "0"  -> overwrite=False
+    # If it's not set, we fall back to CLI mode and ask via input().
+    env_choice = os.getenv("FIX_PDF_OVERWRITE")
+    if env_choice is None:
+        # Local CLI run → still interactive
+        main(overwrite=None)
+    else:
+        choice = env_choice.strip().lower()
+        if choice in ("yes", "y", "true", "1"):
+            main(overwrite=True)
+        elif choice in ("no", "n", "false", "0"):
+            main(overwrite=False)
+        else:
+            logger.warning(
+                f"Invalid FIX_PDF_OVERWRITE='{env_choice}', defaulting to no overwrite."
+            )
+            main(overwrite=False)

data/data_updating_scripts/generate_reports.py ADDED Viewed

	@@ -0,0 +1,274 @@

+"""
+generate_reports.py
+--------------------
+Generates detailed Markdown reports for AI-related bills from `known_bills_visualize.json`
+using the latest LangChain pipeline syntax.
+Now includes resume functionality - can be safely stopped and restarted.
+"""
+from __future__ import annotations
+import json
+import logging
+import os
+import time
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional
+import dotenv
+dotenv.load_dotenv()
+# Create logs directory if it doesn't exist
+os.makedirs("data_updating_scripts/logs", exist_ok=True)
+# Latest LangChain imports
+try:
+    from langchain_openai import ChatOpenAI
+    from langchain.prompts import ChatPromptTemplate
+except ImportError:  # pragma: no cover
+    ChatOpenAI = None  # type: ignore
+    ChatPromptTemplate = None  # type: ignore
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(message)s",
+    handlers=[logging.StreamHandler(), logging.FileHandler("data_updating_scripts/logs/generate_reports.log")],
+)
+logger = logging.getLogger(__name__)
+@dataclass
+class BillReport:
+    """Stores a bill ID and its generated detailed report."""
+    bill_id: str
+    report_markdown: str
+# Prompt template
+DETAILED_REPORT_PROMPT = ChatPromptTemplate.from_template(
+    """You are a seasoned legislative analyst adept at interpreting and
+    summarising bills related to artificial intelligence. Using the bill
+    information provided as JSON, produce a detailed report in Markdown
+    format for stakeholders.
+    Include:
+    - Bill's title, number, and state
+    - Status and key dates
+    - URL to the bill on legiscan
+    - Sponsors and scope
+    - Goals and intent
+    - Key provisions, regulatory approaches, implementation & enforcement
+    - Unique aspects or notable features
+    Format:
+    - Use Markdown headings and bullet points
+    - Paraphrase content
+    - Do not invent facts
+    - If bill text is truncated in source JSON, note this at the end
+    Bill JSON:
+    ```json
+    {bill_json}
+    ```
+    Now craft the detailed report.
+    """
+)
+def _ensure_llm() -> ChatOpenAI:
+    """Initialise ChatOpenAI with latest settings."""
+    if ChatOpenAI is None:
+        raise RuntimeError(
+            "The 'langchain' and 'openai' packages are required. Install them via 'pip install langchain openai'."
+        )
+    api_key = os.getenv("OPENAI_API_KEY")
+    if not api_key:
+        raise RuntimeError("The OPENAI_API_KEY environment variable is not set.")
+    model_name = os.getenv("MODEL_NAME", "gpt-4o")
+    logger.debug("Initialising ChatOpenAI with model %s", model_name)
+    return ChatOpenAI(model=model_name, temperature=0)
+def create_detailed_report(
+    bill: Dict[str, Any], *, llm: Optional[ChatOpenAI] = None
+) -> BillReport:
+    """Generate a detailed report for a single bill using latest LangChain syntax."""
+    if llm is None:
+        llm = _ensure_llm()
+    bill_json = json.dumps(bill, ensure_ascii=False, indent=2)
+    # Latest syntax: prompt | llm
+    chain = DETAILED_REPORT_PROMPT | llm
+    result = chain.invoke({"bill_json": bill_json})
+    # result can be AIMessage; get text
+    report_text = getattr(result, "content", str(result))
+    return BillReport(bill_id=str(bill.get("bill_id")), report_markdown=report_text)
+def load_existing_reports(output_path: str) -> Dict[str, str]:
+    """Load existing reports from file if it exists."""
+    if os.path.exists(output_path):
+        try:
+            with open(output_path, "r", encoding="utf-8") as f:
+                reports_list = json.load(f)
+                # Convert list to dict for easy lookup
+                reports_dict = {
+                    report["bill_id"]: report["report_markdown"]
+                    for report in reports_list
+                    if "bill_id" in report and "report_markdown" in report
+                }
+                logger.info(f"Loaded {len(reports_dict)} existing reports from {output_path}")
+                return reports_dict
+        except Exception as e:
+            logger.warning(f"Could not load existing reports: {e}")
+            return {}
+    return {}
+def save_reports_to_file(reports_dict: Dict[str, str], output_path: str) -> None:
+    """Save reports dictionary to a JSON file."""
+    # Convert dict back to list format for consistency
+    out_list = [
+        {"bill_id": bill_id, "report_markdown": report_markdown}
+        for bill_id, report_markdown in reports_dict.items()
+    ]
+    with open(output_path, "w", encoding="utf-8") as f:
+        json.dump(out_list, f, ensure_ascii=False, indent=2)
+    logger.info("Saved %d reports to %s", len(out_list), output_path)
+def create_reports_with_resume(
+    bills: List[Dict[str, Any]],
+    output_path: str,
+    *,
+    llm: Optional[ChatOpenAI] = None,
+    save_interval: int = 10
+) -> Dict[str, str]:
+    """
+    Generate detailed reports for multiple bills with resume capability.
+    Args:
+        bills: List of bill dictionaries
+        output_path: Path to save reports
+        llm: Optional LLM instance
+        save_interval: Save progress every N bills
+    Returns:
+        Dictionary of bill_id -> report_markdown
+    """
+    if not bills:
+        return {}
+    if llm is None:
+        llm = _ensure_llm()
+    # Load existing reports
+    reports_dict = load_existing_reports(output_path)
+    # Track progress
+    total_bills = len(bills)
+    processed = 0
+    skipped = 0
+    errors = 0
+    logger.info(f"Starting report generation for {total_bills} bills")
+    for i, bill in enumerate(bills, 1):
+        bill_id = str(bill.get("bill_id"))
+        # Skip if already processed
+        if bill_id in reports_dict and reports_dict[bill_id] and not reports_dict[bill_id].startswith("ERROR:"):
+            logger.info(f"Skipping bill {bill_id} - already processed ({i}/{total_bills})")
+            skipped += 1
+            continue
+        logger.info(f"Processing {i}/{total_bills}: Bill ID {bill_id}")
+        try:
+            report = create_detailed_report(bill, llm=llm)
+            reports_dict[bill_id] = report.report_markdown
+            processed += 1
+        except Exception as exc:
+            logger.exception(
+                "Failed to generate report for bill %s: %s", bill_id, exc
+            )
+            reports_dict[bill_id] = f"ERROR: Failed to generate report - {str(exc)}"
+            errors += 1
+        # Save progress periodically
+        if i % save_interval == 0:
+            save_reports_to_file(reports_dict, output_path)
+            logger.info(f"Progress: {i}/{total_bills} - Processed: {processed}, Skipped: {skipped}, Errors: {errors}")
+        # Rate limiting to avoid API throttling
+        if bill_id not in reports_dict or reports_dict[bill_id].startswith("ERROR:"):
+            time.sleep(1)  # 1 second delay between API calls
+    # Final save
+    save_reports_to_file(reports_dict, output_path)
+    logger.info(f"Report generation complete!")
+    logger.info(f"Total bills: {total_bills}")
+    logger.info(f"Successfully processed: {processed}")
+    logger.info(f"Skipped (already done): {skipped}")
+    logger.info(f"Errors: {errors}")
+    return reports_dict
+def read_bills_from_file(path: str) -> List[Dict[str, Any]]:
+    """Read bill records from a JSON file."""
+    with open(path, "r", encoding="utf-8") as f:
+        data = json.load(f)
+        if not isinstance(data, list):
+            raise ValueError(f"Expected list of bills in {path}, got {type(data)}")
+        return data
+def generate_reports_from_files(
+    input_path: str = "data/known_bills_visualize.json",
+    output_path: str = "data/bill_reports.json",
+) -> None:
+    """Read bills, generate reports with resume capability, and write them to disk."""
+    bills = read_bills_from_file(input_path)
+    create_reports_with_resume(bills, output_path)
+def main() -> None:
+    import argparse
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+    )
+    parser = argparse.ArgumentParser(
+        description="Generate detailed AI legislation reports from bill data with resume capability."
+    )
+    parser.add_argument("--input", default="data/known_bills_visualize.json", help="Path to input JSON file")
+    parser.add_argument("--output", default="data/bill_reports.json", help="Path to output JSON file")
+    parser.add_argument("--save-interval", type=int, default=10, help="Save progress every N bills (default: 10)")
+    args = parser.parse_args()
+    try:
+        bills = read_bills_from_file(args.input)
+        create_reports_with_resume(bills, args.output, save_interval=args.save_interval)
+        print(f"✅ Report generation completed successfully!")
+        print(f"   Reports saved to: {args.output}")
+    except Exception as e:
+        logger.error(f"Fatal error: {e}")
+        print(f"❌ Error: {e}")
+        import sys
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

data/data_updating_scripts/generate_suggested_questions.py ADDED Viewed

	@@ -0,0 +1,269 @@

+#!/usr/bin/env python3
+"""
+Script to generate suggested questions for all bills in known_bills_visualize.json.
+This script reads all bills from known_bills_visualize.json, generates 5 suggested questions using OpenAI API,
+and saves them to data/bill_suggested_questions.json to avoid repeated API calls.
+"""
+import json
+import logging
+import time
+import pandas as pd
+from pathlib import Path
+from typing import Dict, List, Optional
+import sys
+import os
+# Add the project root to the path
+sys.path.append(str(Path(__file__).parent.parent))
+from config import ConfigManager
+from langchain_openai import ChatOpenAI
+from langchain_core.prompts import ChatPromptTemplate
+from langchain.chains.combine_documents import create_stuff_documents_chain
+from langchain_core.documents import Document
+# Create logs directory if it doesn't exist
+os.makedirs("data_updating_scripts/logs", exist_ok=True)
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(message)s",
+    handlers=[logging.StreamHandler(), logging.FileHandler("data_updating_scripts/logs/generate_suggested_questions.log")]
+)
+logger = logging.getLogger(__name__)
+class SuggestedQuestionsGenerator:
+    """Generates suggested questions for all bills in known_bills_visualize.json."""
+    def __init__(self):
+        """Initialize the questions generator with configuration."""
+        self.config = ConfigManager()
+        self.known_bills_file = Path("data/known_bills_visualize.json")
+        self.questions_file = Path("data/bill_suggested_questions.json")
+        # Initialize OpenAI LLM
+        if not self.config.OPENAI_API_KEY:
+            raise ValueError("OPENAI_API_KEY not found in environment variables")
+        self.llm = ChatOpenAI(
+            model=self.config.OPENAI_LLM_MODEL,
+            temperature=0.3,
+            max_tokens=500
+        )
+        # Load the system prompt from markdown file
+        prompt_path = "data_updating_scripts/PROMPTS/suggested_questions_prompt.md"
+        if not os.path.exists(prompt_path):
+            raise FileNotFoundError(f"The specified file was not found: {prompt_path}")
+        with open(prompt_path, "r") as file:
+            system_prompt = file.read()
+        # Create the prompt and chain
+        self.prompt = ChatPromptTemplate.from_messages(
+            [
+                ("system", system_prompt),
+                ("human", "Generate 5 specific questions about this bill based on its content."),
+            ]
+        )
+        self.question_generation_chain = create_stuff_documents_chain(
+            self.llm, self.prompt
+        )
+        # Fallback questions
+        self.fallback_questions = [
+            "What are the key definitions in this bill?",
+            "What are the enforcement mechanisms?",
+            "Who does this bill apply to?",
+            "What are the compliance requirements?",
+            "What penalties are specified?"
+        ]
+        logger.info(f"Initialized SuggestedQuestionsGenerator with model: {self.config.OPENAI_LLM_MODEL}")
+    def dataframe_to_documents(self, df):
+        """Convert DataFrame to list of Document objects."""
+        documents = []
+        for _, row in df.iterrows():
+            if 'text' in row and pd.notna(row['text']) and row['text'].strip():
+                doc = Document(
+                    page_content=row['text'],
+                    metadata={
+                        'bill_key': f"{row.get('state', 'Unknown')}_{row.get('bill_number', 'Unknown')}",
+                        'state': row.get('state', 'Unknown'),
+                        'bill_number': row.get('bill_number', 'Unknown'),
+                        'title': row.get('title', 'No title')
+                    }
+                )
+                documents.append(doc)
+        return documents
+    def load_known_bills(self) -> List[Dict]:
+        """Load bills from known_bills_visualize.json."""
+        try:
+            with open(self.known_bills_file, 'r', encoding='utf-8') as f:
+                bills = json.load(f)
+            logger.info(f"Loaded {len(bills)} bills from {self.known_bills_file}")
+            return bills
+        except FileNotFoundError:
+            logger.error(f"File not found: {self.known_bills_file}")
+            raise
+        except json.JSONDecodeError as e:
+            logger.error(f"Error parsing JSON: {e}")
+            raise
+    def load_existing_questions(self) -> Dict:
+        """Load existing questions if available."""
+        if self.questions_file.exists():
+            try:
+                with open(self.questions_file, 'r', encoding='utf-8') as f:
+                    questions = json.load(f)
+                logger.info(f"Loaded {len(questions)} existing question sets")
+                return questions
+            except Exception as e:
+                logger.warning(f"Could not load existing questions: {e}")
+                return {}
+        return {}
+    def save_questions(self, questions: Dict) -> None:
+        """Save questions to JSON file."""
+        try:
+            with open(self.questions_file, 'w', encoding='utf-8') as f:
+                json.dump(questions, f, indent=2, ensure_ascii=False)
+            logger.info(f"Saved {len(questions)} question sets to {self.questions_file}")
+        except Exception as e:
+            logger.error(f"Error saving questions: {e}")
+            raise
+    def parse_questions_response(self, response: str) -> List[str]:
+        """Parse the LLM response into individual questions."""
+        questions = []
+        if isinstance(response, str):
+            # Split by lines and clean up
+            lines = [line.strip() for line in response.split('\n') if line.strip()]
+            # Filter out any numbering or bullet points
+            for line in lines:
+                # Remove common prefixes like "1.", "2.", "3.", "4.", "5.", "•", "-", "*", etc.
+                clean_line = line
+                if line.startswith(('1.', '2.', '3.', '4.', '5.', '•', '-', '*')):
+                    clean_line = line[2:].strip()
+                elif line.startswith(('1)', '2)', '3)', '4)', '5)')):
+                    clean_line = line[2:].strip()
+                if clean_line and clean_line.endswith('?'):
+                    questions.append(clean_line)
+        # Ensure we have exactly 5 questions
+        if len(questions) < 5:
+            # Use fallback questions to fill up to 5
+            questions.extend(self.fallback_questions[len(questions):])
+        return questions[:5]  # Return only the first 5
+    def generate_questions(self, bill: Dict) -> Optional[List[str]]:
+        """Generate suggested questions for a single bill."""
+        try:
+            bill_number = bill.get('bill_number', 'Unknown')
+            bill_title = bill.get('title', 'No title')
+            bill_text = bill.get('text', '')
+            if not bill_text:
+                logger.warning(f"No text found for bill {bill_number}")
+                return self.fallback_questions
+            # Convert bill to document format
+            df = pd.DataFrame([bill])
+            docs = self.dataframe_to_documents(df)
+            if not docs:
+                logger.warning(f"No document created for bill {bill_number}")
+                return self.fallback_questions
+            # Generate questions using the chain
+            response = self.question_generation_chain.invoke({"context": docs})
+            # Parse the response into questions
+            questions = self.parse_questions_response(response)
+            logger.info(f"Generated {len(questions)} questions for {bill_number}")
+            return questions
+        except Exception as e:
+            logger.error(f"Error generating questions for bill {bill.get('bill_number', 'Unknown')}: {e}")
+            return self.fallback_questions
+    def generate_all_questions(self) -> None:
+        """Generate suggested questions for all bills."""
+        # Load bills and existing questions
+        bills = self.load_known_bills()
+        existing_questions = self.load_existing_questions()
+        # Track progress
+        total_bills = len(bills)
+        processed = 0
+        errors = 0
+        logger.info(f"Starting question generation for {total_bills} bills")
+        for i, bill in enumerate(bills, 1):
+            bill_key = f"{bill.get('state', 'Unknown')}_{bill.get('bill_number', 'Unknown')}"
+            # Skip if already processed successfully
+            if bill_key in existing_questions and len(existing_questions[bill_key].get('suggested_questions', [])) == 5:
+                logger.info(f"Skipping {bill_key} - already processed")
+                processed += 1
+                continue
+            logger.info(f"Processing {i}/{total_bills}: {bill_key}")
+            # Generate questions
+            questions = self.generate_questions(bill)
+            # Store result
+            existing_questions[bill_key] = {
+                'bill_number': bill.get('bill_number', 'Unknown'),
+                'title': bill.get('title', 'No title'),
+                'suggested_questions': questions
+            }
+            if questions == self.fallback_questions:
+                errors += 1
+            else:
+                processed += 1
+            # Save progress every 10 bills
+            if i % 10 == 0:
+                self.save_questions(existing_questions)
+                logger.info(f"Progress: {i}/{total_bills} processed, {errors} errors")
+            # Rate limiting
+            time.sleep(1)  # 1 second delay between API calls
+        # Final save
+        self.save_questions(existing_questions)
+        logger.info(f"Question generation complete!")
+        logger.info(f"Total bills: {total_bills}")
+        logger.info(f"Successfully processed: {processed}")
+        logger.info(f"Errors: {errors}")
+        logger.info(f"Questions saved to: {self.questions_file}")
+def main():
+    """Main function to run the question generation."""
+    try:
+        generator = SuggestedQuestionsGenerator()
+        generator.generate_all_questions()
+        print("✅ Suggested questions generation completed successfully!")
+        print(f" Questions saved to: {generator.questions_file}")
+    except Exception as e:
+        logger.error(f"Fatal error: {e}")
+        print(f"❌ Error: {e}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

data/data_updating_scripts/generate_summaries.py ADDED Viewed

	@@ -0,0 +1,204 @@

+#!/usr/bin/env python3
+"""
+Script to generate summaries for all bills in known_bills_visualize.json.
+This script reads all bills from known_bills_visualize.json, generates summaries using OpenAI API,
+and saves them to data/bill_summaries.json to avoid repeated API calls.
+"""
+import json
+import logging
+import time
+from pathlib import Path
+from typing import Dict, List, Optional
+import sys
+import os
+# Add the project root to the path
+sys.path.append(str(Path(__file__).parent.parent))
+from config import ConfigManager
+from langchain_openai import ChatOpenAI
+from langchain_core.prompts import PromptTemplate
+from langchain_core.output_parsers import StrOutputParser
+from PROMPTS.bill_summary_prompt import BILL_SUMMARY_PROMPT
+# Create logs directory if it doesn't exist
+os.makedirs("data_updating_scripts/logs", exist_ok=True)
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(message)s",
+    handlers=[logging.StreamHandler(), logging.FileHandler("data_updating_scripts/logs/generate_summaries.log")]
+)
+logger = logging.getLogger(__name__)
+class BillSummaryGenerator:
+    """Generates summaries for all bills in known_bills_visualize.json."""
+    def __init__(self):
+        """Initialize the summary generator with configuration."""
+        self.config = ConfigManager()
+        self.known_bills_file = Path("data/known_bills_visualize.json")
+        self.summaries_file = Path("data/bill_summaries.json")
+        # Initialize OpenAI LLM
+        if not self.config.OPENAI_API_KEY:
+            raise ValueError("OPENAI_API_KEY not found in environment variables")
+        self.llm = ChatOpenAI(
+            model=self.config.OPENAI_LLM_MODEL,
+            temperature=0.1,
+            max_tokens=1000
+        )
+        # Create the prompt template
+        self.prompt_template = PromptTemplate(
+            template=BILL_SUMMARY_PROMPT,
+            input_variables=["bill_number", "bill_title", "state", "bill_text"]
+        )
+        # Create the chain
+        self.chain = self.prompt_template | self.llm | StrOutputParser()
+        logger.info(f"Initialized BillSummaryGenerator with model: {self.config.OPENAI_LLM_MODEL}")
+    def load_known_bills(self) -> List[Dict]:
+        """Load bills from known_bills_visualize.json."""
+        try:
+            with open(self.known_bills_file, 'r', encoding='utf-8') as f:
+                bills = json.load(f)
+            logger.info(f"Loaded {len(bills)} bills from {self.known_bills_file}")
+            return bills
+        except FileNotFoundError:
+            logger.error(f"File not found: {self.known_bills_file}")
+            raise
+        except json.JSONDecodeError as e:
+            logger.error(f"Error parsing JSON: {e}")
+            raise
+    def load_existing_summaries(self) -> Dict:
+        """Load existing summaries if available."""
+        if self.summaries_file.exists():
+            try:
+                with open(self.summaries_file, 'r', encoding='utf-8') as f:
+                    summaries = json.load(f)
+                logger.info(f"Loaded {len(summaries)} existing summaries")
+                return summaries
+            except Exception as e:
+                logger.warning(f"Could not load existing summaries: {e}")
+                return {}
+        return {}
+    def save_summaries(self, summaries: Dict) -> None:
+        """Save summaries to JSON file."""
+        try:
+            with open(self.summaries_file, 'w', encoding='utf-8') as f:
+                json.dump(summaries, f, indent=2, ensure_ascii=False)
+            logger.info(f"Saved {len(summaries)} summaries to {self.summaries_file}")
+        except Exception as e:
+            logger.error(f"Error saving summaries: {e}")
+            raise
+    def generate_summary(self, bill: Dict) -> Optional[str]:
+        """Generate summary for a single bill."""
+        try:
+            bill_number = bill.get('bill_number', 'Unknown')
+            bill_title = bill.get('title', 'No title')
+            state = bill.get('state', 'Unknown')
+            bill_text = bill.get('text', '')
+            if not bill_text:
+                logger.warning(f"No text found for bill {bill_number}")
+                return "ERROR: No bill text available"
+            # Prepare the input for the chain
+            chain_input = {
+                "bill_number": bill_number,
+                "bill_title": bill_title,
+                "state": state,
+                "bill_text": bill_text[:8000]  # Limit text length to avoid token limits
+            }
+            # Generate summary using the chain
+            summary = self.chain.invoke(chain_input)
+            logger.info(f"Generated summary for {bill_number}")
+            return summary
+        except Exception as e:
+            logger.error(f"Error generating summary for bill {bill.get('bill_number', 'Unknown')}: {e}")
+            return f"ERROR: {str(e)}"
+    def generate_all_summaries(self) -> None:
+        """Generate summaries for all bills."""
+        # Load bills and existing summaries
+        bills = self.load_known_bills()
+        existing_summaries = self.load_existing_summaries()
+        # Track progress
+        total_bills = len(bills)
+        processed = 0
+        errors = 0
+        logger.info(f"Starting summary generation for {total_bills} bills")
+        for i, bill in enumerate(bills, 1):
+            bill_key = f"{bill.get('state', 'Unknown')}_{bill.get('bill_number', 'Unknown')}"
+            # Skip if already processed successfully
+            if bill_key in existing_summaries and not existing_summaries[bill_key].get('summary', '').startswith('ERROR:'):
+                logger.info(f"Skipping {bill_key} - already processed")
+                processed += 1
+                continue
+            logger.info(f"Processing {i}/{total_bills}: {bill_key}")
+            # Generate summary
+            summary = self.generate_summary(bill)
+            # Store result
+            existing_summaries[bill_key] = {
+                'bill_number': bill.get('bill_number', 'Unknown'),
+                'title': bill.get('title', 'No title'),
+                'summary': summary
+            }
+            if summary.startswith('ERROR:'):
+                errors += 1
+            else:
+                processed += 1
+            # Save progress every 10 bills
+            if i % 10 == 0:
+                self.save_summaries(existing_summaries)
+                logger.info(f"Progress: {i}/{total_bills} processed, {errors} errors")
+            # Rate limiting
+            time.sleep(1)  # 1 second delay between API calls
+        # Final save
+        self.save_summaries(existing_summaries)
+        logger.info(f"Summary generation complete!")
+        logger.info(f"Total bills: {total_bills}")
+        logger.info(f"Successfully processed: {processed}")
+        logger.info(f"Errors: {errors}")
+        logger.info(f"Summaries saved to: {self.summaries_file}")
+def main():
+    """Main function to run the summary generation."""
+    try:
+        generator = BillSummaryGenerator()
+        generator.generate_all_summaries()
+        print("✅ Summary generation completed successfully!")
+        print(f" Summaries saved to: {generator.summaries_file}")
+    except Exception as e:
+        logger.error(f"Fatal error: {e}")
+        print(f"❌ Error: {e}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

data/data_updating_scripts/get_data.py ADDED Viewed

	@@ -0,0 +1,251 @@

+import os
+import sys
+import json
+import time
+import logging
+import base64
+from datetime import datetime, timezone
+import requests
+from dotenv import load_dotenv
+from bs4 import BeautifulSoup
+# Load environment variables from .env file
+load_dotenv()
+# Pull API key from environment
+API_KEY = os.getenv("LEGISCAN_API_KEY")  # Set your LegiScan API key in .env
+if not API_KEY:
+    print("Error: Please set LEGISCAN_API_KEY in your .env file.")
+    sys.exit(1)
+# Modes for testing
+# Quick test: pulls only TEST_MAX_BILLS bills
+TESTING_MODE = False
+# Full test: pulls all bills for TEST_STATE and TEST_YEAR without bill count cap
+FULL_TESTING_MODE = False
+TEST_STATE = 'CA'
+TEST_YEAR = 2023
+TEST_MAX_BILLS = 3
+# Output files
+CACHE_FILE = "data/bill_cache.json"        # Stores bill_id -> change_hash
+OUTPUT_FILE = "data/known_bills.json"     # Final bills data
+# Query settings
+QUERY = "artificial intelligence"
+START_YEAR = 2023
+END_YEAR = datetime.now(timezone.utc).year
+# Include all state legislatures plus U.S. Congress (both chambers)
+STATES = [
+    "AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA",
+    "HI","ID","IL","IN","IA","KS","KY","LA","ME","MD",
+    "MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ",
+    "NM","NY","NC","ND","OH","OK","OR","PA","RI","SC",
+    "SD","TN","TX","UT","VT","VA","WA","WV","WI","WY",
+    "US"  # U.S. Congress
+]
+# Rate limiting (seconds between requests)
+RATE_LIMIT = 0.2
+# Create logs directory if it doesn't exist
+os.makedirs("data_updating_scripts/logs", exist_ok=True)
+# Logging configuration
+LOG_FILE = "data_updating_scripts/logs/fetch_ai_bills.log"
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(message)s",
+    handlers=[
+        logging.StreamHandler(sys.stdout),
+        logging.FileHandler(LOG_FILE)
+    ]
+)
+logger = logging.getLogger(__name__)
+# Apply testing overrides
+if TESTING_MODE:
+    logger.info(f"*** TESTING MODE: fetching only {TEST_MAX_BILLS} bills from {TEST_STATE} ({TEST_YEAR}) ***")
+    STATES = [TEST_STATE]
+if FULL_TESTING_MODE:
+    logger.info(f"*** FULL TESTING MODE: fetching all bills from {TEST_STATE} ({TEST_YEAR}) ***")
+    STATES = [TEST_STATE]
+def load_json(path, default):
+    try:
+        with open(path, 'r') as f:
+            return json.load(f)
+    except (FileNotFoundError, json.JSONDecodeError):
+        return default
+def save_json(path, data):
+    # Create directory if it doesn't exist
+    os.makedirs(os.path.dirname(path), exist_ok=True)
+    with open(path, 'w') as f:
+        json.dump(data, f, indent=2)
+    logger.info(f"Saved JSON to {path}")
+def legi_request(op, params):
+    base = "https://api.legiscan.com/"
+    params.update({"key": API_KEY, "op": op})
+    try:
+        resp = requests.get(base, params=params, timeout=10)
+        resp.raise_for_status()
+        data = resp.json()
+        if data.get("status") != "OK":
+            logger.error(f"API error {op}: {data.get('message', data)}")
+            return None
+        return data
+    except requests.RequestException as e:
+        logger.error(f"Request failed ({op}): {e}")
+        return None
+def extract_plain_text(html_content: str) -> str:
+    soup = BeautifulSoup(html_content, "html.parser")
+    return soup.get_text(separator="\n", strip=True)
+def main():
+    cache = load_json(CACHE_FILE, {})
+    existing = load_json(OUTPUT_FILE, [])
+    existing_map = {b.get("bill_id"): b for b in existing}
+    logger.info(f"Loaded cache entries: {len(cache)}, existing bills: {len(existing)}")
+    collected = []
+    total_fetched = 0
+    years = [TEST_YEAR] if (TESTING_MODE or FULL_TESTING_MODE) else list(range(START_YEAR, END_YEAR + 1))
+    for state in STATES:
+        for year in years:
+            page = 1
+            while True:
+                if TESTING_MODE and total_fetched >= TEST_MAX_BILLS:
+                    logger.info("Reached TEST_MAX_BILLS limit, stopping early.")
+                    break
+                params = {"state": state, "year": year, "query": QUERY, "page": page}
+                logger.info(f"Searching {state} for {year}, page {page}")
+                data = legi_request("getSearch", params)
+                if not data:
+                    break
+                results = data.get("searchresult", {})
+                summary = results.get("summary", {})
+                bills = [v for k, v in results.items() if k != "summary"]
+                if not bills:
+                    logger.info(f"No bills on page {page} for {state} {year}")
+                    break
+                logger.info(f"Found {len(bills)} bills on {state} {year} page {page}")
+                for bill in bills:
+                    if TESTING_MODE and total_fetched >= TEST_MAX_BILLS:
+                        break
+                    bill_id = str(bill.get("bill_id"))
+                    state_code = bill.get("state")
+                    bill_num = bill.get("bill_number")
+                    logger.info(f"Processing bill {state_code}_{bill_num} (ID: {bill_id})")
+                    details_resp = legi_request("getBill", {"id": bill_id})
+                    if not details_resp:
+                        continue
+                    details = details_resp.get("bill", {})
+                    sess_year = details.get("session", {}).get("year_start", 0)
+                    if sess_year < START_YEAR:
+                        continue
+                    new_hash = details.get("change_hash")
+                    old_hash = cache.get(bill_id)
+                    now_iso = datetime.now(timezone.utc).isoformat()
+                    # Extract all relevant dates
+                    explicit = details.get("last_action_date")
+                    status_date = details.get("status_date")
+                    last_vote_date = details.get("last_vote_date")
+                    last_amendment_date = details.get("last_amendment_date")
+                    actions = details.get("actions", [])
+                    action_dates = [a.get("action_date") for a in actions if a.get("action_date")]
+                    most_recent_action = max(action_dates) if action_dates else None
+                    candidates = [d for d in [explicit, status_date, last_vote_date, last_amendment_date, most_recent_action] if d]
+                    last_action_date = max(candidates) if candidates else None
+                    bill_url = details.get("url")  # Bill detail page URL
+                    if new_hash and new_hash == old_hash and bill_id in existing_map:
+                        entry = existing_map[bill_id]
+                        entry.update({
+                            "status": details.get("status"),
+                            "session_year": f"{details.get('session', {}).get('year_start', '')}-{details.get('session', {}).get('year_end', '')}",
+                            "last_action_date": last_action_date,
+                            "status_date": status_date,
+                            "last_vote_date": last_vote_date,
+                            "last_amendment_date": last_amendment_date,
+                            "actions": actions,
+                            "bill_url": bill_url,
+                            "lastUpdatedAt": now_iso
+                        })
+                        logger.info(f"Reused cache; updated status={entry['status']}, last_action_date={entry['last_action_date']}")
+                    else:
+                        plain_text = None
+                        texts = details.get("texts", [])
+                        if texts:
+                            doc_id = texts[0].get("doc_id")
+                            text_resp = legi_request("getBillText", {"id": doc_id})
+                            if text_resp and "text" in text_resp:
+                                raw_b64 = text_resp["text"].get("doc", "")
+                                try:
+                                    decoded = base64.b64decode(raw_b64)
+                                    html = decoded.decode("utf-8", errors="ignore")
+                                    plain_text = extract_plain_text(html)
+                                except Exception as e:
+                                    logger.error(f"Failed decoding HTML for {bill_id}: {e}")
+                        entry = {
+                            "bill_id": bill_id,
+                            "state": state_code,
+                            "bill_number": bill_num,
+                            "session_year": f"{details.get('session', {}).get('year_start', '')}-{details.get('session', {}).get('year_end', '')}",
+                            "title": details.get("title"),
+                            "description": details.get("description"),
+                            "status": details.get("status"),
+                            "sponsors": [s.get("name") for s in details.get("sponsors", [])],
+                            "text": plain_text,
+                            "last_action_date": last_action_date,
+                            "status_date": status_date,
+                            "last_vote_date": last_vote_date,
+                            "last_amendment_date": last_amendment_date,
+                            "actions": actions,
+                            "bill_url": bill_url,
+                            "change_hash": new_hash,
+                            "lastUpdatedAt": now_iso
+                        }
+                        cache[bill_id] = new_hash
+                        logger.info(
+                            f"Entry data: title='{entry['title']}', sponsors={len(entry['sponsors'])}, "
+                            f"status={entry['status']}, last_action_date={entry['last_action_date']}"
+                        )
+                    collected.append(entry)
+                    total_fetched += 1
+                    time.sleep(RATE_LIMIT)
+                if page >= summary.get("page_total", 1):
+                    break
+                page += 1
+                time.sleep(RATE_LIMIT)
+            if TESTING_MODE and total_fetched >= TEST_MAX_BILLS:
+                break
+        if TESTING_MODE and total_fetched >= TEST_MAX_BILLS:
+            break
+    dedup = {e["bill_id"]: e for e in collected}
+    all_bills = list(dedup.values())
+    save_json(OUTPUT_FILE, all_bills)
+    save_json(CACHE_FILE, cache)
+    logger.info(f"Completed run, saved {len(all_bills)} bills to {OUTPUT_FILE}")
+if __name__ == "__main__":
+    main()

data/data_updating_scripts/get_data_ORIGINAL.py ADDED Viewed

	@@ -0,0 +1,251 @@

+import os
+import sys
+import json
+import time
+import logging
+import base64
+from datetime import datetime, timezone
+import requests
+from dotenv import load_dotenv
+from bs4 import BeautifulSoup
+# Load environment variables from .env file
+load_dotenv()
+# Pull API key from environment
+API_KEY = os.getenv("LEGISCAN_API_KEY")  # Set your LegiScan API key in .env
+if not API_KEY:
+    print("Error: Please set LEGISCAN_API_KEY in your .env file.")
+    sys.exit(1)
+# Modes for testing
+# Quick test: pulls only TEST_MAX_BILLS bills
+TESTING_MODE = False
+# Full test: pulls all bills for TEST_STATE and TEST_YEAR without bill count cap
+FULL_TESTING_MODE = False
+TEST_STATE = 'CA'
+TEST_YEAR = 2023
+TEST_MAX_BILLS = 3
+# Output files
+CACHE_FILE = "data/bill_cache.json"        # Stores bill_id -> change_hash
+OUTPUT_FILE = "data/known_bills.json"     # Final bills data
+# Query settings
+QUERY = "artificial intelligence"
+START_YEAR = 2023
+END_YEAR = datetime.now(timezone.utc).year
+# Include all state legislatures plus U.S. Congress (both chambers)
+STATES = [
+    "AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA",
+    "HI","ID","IL","IN","IA","KS","KY","LA","ME","MD",
+    "MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ",
+    "NM","NY","NC","ND","OH","OK","OR","PA","RI","SC",
+    "SD","TN","TX","UT","VT","VA","WA","WV","WI","WY",
+    "US"  # U.S. Congress
+]
+# Rate limiting (seconds between requests)
+RATE_LIMIT = 0.2
+# Create logs directory if it doesn't exist
+os.makedirs("data_updating_scripts/logs", exist_ok=True)
+# Logging configuration
+LOG_FILE = "data_updating_scripts/logs/fetch_ai_bills.log"
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(message)s",
+    handlers=[
+        logging.StreamHandler(sys.stdout),
+        logging.FileHandler(LOG_FILE)
+    ]
+)
+logger = logging.getLogger(__name__)
+# Apply testing overrides
+if TESTING_MODE:
+    logger.info(f"*** TESTING MODE: fetching only {TEST_MAX_BILLS} bills from {TEST_STATE} ({TEST_YEAR}) ***")
+    STATES = [TEST_STATE]
+if FULL_TESTING_MODE:
+    logger.info(f"*** FULL TESTING MODE: fetching all bills from {TEST_STATE} ({TEST_YEAR}) ***")
+    STATES = [TEST_STATE]
+def load_json(path, default):
+    try:
+        with open(path, 'r') as f:
+            return json.load(f)
+    except (FileNotFoundError, json.JSONDecodeError):
+        return default
+def save_json(path, data):
+    # Create directory if it doesn't exist
+    os.makedirs(os.path.dirname(path), exist_ok=True)
+    with open(path, 'w') as f:
+        json.dump(data, f, indent=2)
+    logger.info(f"Saved JSON to {path}")
+def legi_request(op, params):
+    base = "https://api.legiscan.com/"
+    params.update({"key": API_KEY, "op": op})
+    try:
+        resp = requests.get(base, params=params, timeout=10)
+        resp.raise_for_status()
+        data = resp.json()
+        if data.get("status") != "OK":
+            logger.error(f"API error {op}: {data.get('message', data)}")
+            return None
+        return data
+    except requests.RequestException as e:
+        logger.error(f"Request failed ({op}): {e}")
+        return None
+def extract_plain_text(html_content: str) -> str:
+    soup = BeautifulSoup(html_content, "html.parser")
+    return soup.get_text(separator="\n", strip=True)
+def main():
+    cache = load_json(CACHE_FILE, {})
+    existing = load_json(OUTPUT_FILE, [])
+    existing_map = {b.get("bill_id"): b for b in existing}
+    logger.info(f"Loaded cache entries: {len(cache)}, existing bills: {len(existing)}")
+    collected = []
+    total_fetched = 0
+    years = [TEST_YEAR] if (TESTING_MODE or FULL_TESTING_MODE) else list(range(START_YEAR, END_YEAR + 1))
+    for state in STATES:
+        for year in years:
+            page = 1
+            while True:
+                if TESTING_MODE and total_fetched >= TEST_MAX_BILLS:
+                    logger.info("Reached TEST_MAX_BILLS limit, stopping early.")
+                    break
+                params = {"state": state, "year": year, "query": QUERY, "page": page}
+                logger.info(f"Searching {state} for {year}, page {page}")
+                data = legi_request("getSearch", params)
+                if not data:
+                    break
+                results = data.get("searchresult", {})
+                summary = results.get("summary", {})
+                bills = [v for k, v in results.items() if k != "summary"]
+                if not bills:
+                    logger.info(f"No bills on page {page} for {state} {year}")
+                    break
+                logger.info(f"Found {len(bills)} bills on {state} {year} page {page}")
+                for bill in bills:
+                    if TESTING_MODE and total_fetched >= TEST_MAX_BILLS:
+                        break
+                    bill_id = str(bill.get("bill_id"))
+                    state_code = bill.get("state")
+                    bill_num = bill.get("bill_number")
+                    logger.info(f"Processing bill {state_code}_{bill_num} (ID: {bill_id})")
+                    details_resp = legi_request("getBill", {"id": bill_id})
+                    if not details_resp:
+                        continue
+                    details = details_resp.get("bill", {})
+                    sess_year = details.get("session", {}).get("year_start", 0)
+                    if sess_year < START_YEAR:
+                        continue
+                    new_hash = details.get("change_hash")
+                    old_hash = cache.get(bill_id)
+                    now_iso = datetime.now(timezone.utc).isoformat()
+                    # Extract all relevant dates
+                    explicit = details.get("last_action_date")
+                    status_date = details.get("status_date")
+                    last_vote_date = details.get("last_vote_date")
+                    last_amendment_date = details.get("last_amendment_date")
+                    actions = details.get("actions", [])
+                    action_dates = [a.get("action_date") for a in actions if a.get("action_date")]
+                    most_recent_action = max(action_dates) if action_dates else None
+                    candidates = [d for d in [explicit, status_date, last_vote_date, last_amendment_date, most_recent_action] if d]
+                    last_action_date = max(candidates) if candidates else None
+                    bill_url = details.get("url")  # Bill detail page URL
+                    if new_hash and new_hash == old_hash and bill_id in existing_map:
+                        entry = existing_map[bill_id]
+                        entry.update({
+                            "status": details.get("status"),
+                            "session_year": f"{details.get('session', {}).get('year_start', '')}-{details.get('session', {}).get('year_end', '')}",
+                            "last_action_date": last_action_date,
+                            "status_date": status_date,
+                            "last_vote_date": last_vote_date,
+                            "last_amendment_date": last_amendment_date,
+                            "actions": actions,
+                            "bill_url": bill_url,
+                            "lastUpdatedAt": now_iso
+                        })
+                        logger.info(f"Reused cache; updated status={entry['status']}, last_action_date={entry['last_action_date']}")
+                    else:
+                        plain_text = None
+                        texts = details.get("texts", [])
+                        if texts:
+                            doc_id = texts[0].get("doc_id")
+                            text_resp = legi_request("getBillText", {"id": doc_id})
+                            if text_resp and "text" in text_resp:
+                                raw_b64 = text_resp["text"].get("doc", "")
+                                try:
+                                    decoded = base64.b64decode(raw_b64)
+                                    html = decoded.decode("utf-8", errors="ignore")
+                                    plain_text = extract_plain_text(html)
+                                except Exception as e:
+                                    logger.error(f"Failed decoding HTML for {bill_id}: {e}")
+                        entry = {
+                            "bill_id": bill_id,
+                            "state": state_code,
+                            "bill_number": bill_num,
+                            "session_year": f"{details.get('session', {}).get('year_start', '')}-{details.get('session', {}).get('year_end', '')}",
+                            "title": details.get("title"),
+                            "description": details.get("description"),
+                            "status": details.get("status"),
+                            "sponsors": [s.get("name") for s in details.get("sponsors", [])],
+                            "text": plain_text,
+                            "last_action_date": last_action_date,
+                            "status_date": status_date,
+                            "last_vote_date": last_vote_date,
+                            "last_amendment_date": last_amendment_date,
+                            "actions": actions,
+                            "bill_url": bill_url,
+                            "change_hash": new_hash,
+                            "lastUpdatedAt": now_iso
+                        }
+                        cache[bill_id] = new_hash
+                        logger.info(
+                            f"Entry data: title='{entry['title']}', sponsors={len(entry['sponsors'])}, "
+                            f"status={entry['status']}, last_action_date={entry['last_action_date']}"
+                        )
+                    collected.append(entry)
+                    total_fetched += 1
+                    time.sleep(RATE_LIMIT)
+                if page >= summary.get("page_total", 1):
+                    break
+                page += 1
+                time.sleep(RATE_LIMIT)
+            if TESTING_MODE and total_fetched >= TEST_MAX_BILLS:
+                break
+        if TESTING_MODE and total_fetched >= TEST_MAX_BILLS:
+            break
+    dedup = {e["bill_id"]: e for e in collected}
+    all_bills = list(dedup.values())
+    save_json(OUTPUT_FILE, all_bills)
+    save_json(CACHE_FILE, cache)
+    logger.info(f"Completed run, saved {len(all_bills)} bills to {OUTPUT_FILE}")
+if __name__ == "__main__":
+    main()

data/data_updating_scripts/known_bills_status.py ADDED Viewed

	@@ -0,0 +1,199 @@

+#!/usr/bin/env python3
+"""
+known_bills_status.py
+Reads known_bills_fixed.json and updates existing known_bills_visualize.json.
+Merges new bills and updates existing ones while preserving clean status fields.
+"""
+import json
+from pathlib import Path
+from datetime import datetime, timezone
+def map_status(original_status):
+    """Map LegiScan status codes to clean display text."""
+    # Direct mapping for numeric codes
+    status_mapping = {
+        "0": "Inactive",           # Pre-filed
+        "1": "Active",             # Introduced
+        "2": "Active",             # Engrossed
+        "3": "Active",             # Enrolled
+        "4": "Signed Into Law",    # Passed
+        "5": "Vetoed",             # Vetoed
+        "6": "Inactive",           # Failed
+        "7": "Signed Into Law",    # Override
+        "8": "Signed Into Law",    # Chaptered
+        "9": "Active",             # Refer
+        "10": "Active",            # Report Pass
+        "11": "Inactive",          # Report DNP
+        "12": "Active",            # Draft
+        # Integer versions
+        0: "Inactive", 1: "Active", 2: "Active", 3: "Active",
+        4: "Signed Into Law", 5: "Vetoed", 6: "Inactive",
+        7: "Signed Into Law", 8: "Signed Into Law", 9: "Active",
+        10: "Active", 11: "Inactive", 12: "Active"
+    }
+    # Try direct mapping first
+    if original_status in status_mapping:
+        return status_mapping[original_status]
+    # Handle text statuses
+    if original_status:
+        status_str = str(original_status).lower()
+        if "pass" in status_str or "signed" in status_str or "enacted" in status_str:
+            return "Signed Into Law"
+        elif "veto" in status_str:
+            return "Vetoed"
+        elif "fail" in status_str or "dead" in status_str or "killed" in status_str:
+            return "Inactive"
+        elif "active" in status_str or "intro" in status_str or "pending" in status_str:
+            return "Active"
+    # Default fallback
+    return "Inactive"
+def create_bill_key(bill):
+    """Create a unique key for each bill."""
+    return f"{bill.get('state', 'Unknown')}_{bill.get('bill_number', 'Unknown')}"
+def merge_bill_data(new_bill, existing_bill=None):
+    """Merge new bill data with existing bill, preserving processed fields."""
+    if not existing_bill:
+        # New bill - create clean version
+        merged_bill = new_bill.copy()
+        original_status = merged_bill.get('status')
+        merged_bill['original_status'] = original_status
+        merged_bill['status'] = map_status(original_status)
+        merged_bill['status_updated_at'] = datetime.now(timezone.utc).isoformat()
+        return merged_bill
+    # Existing bill - merge carefully
+    merged_bill = existing_bill.copy()
+    # Update with new data from source (except status fields)
+    for key, value in new_bill.items():
+        if key not in ['status', 'original_status', 'status_updated_at']:
+            merged_bill[key] = value
+    # Check if original status actually changed
+    new_original_status = new_bill.get('status')
+    old_original_status = existing_bill.get('original_status')
+    # Convert both to strings for comparison to handle int vs string
+    new_status_str = str(new_original_status) if new_original_status is not None else None
+    old_status_str = str(old_original_status) if old_original_status is not None else None
+    if new_status_str != old_status_str:
+        # Real change in underlying data
+        new_clean_status = map_status(new_original_status)
+        merged_bill['original_status'] = new_original_status
+        merged_bill['status'] = new_clean_status
+        merged_bill['status_updated_at'] = datetime.now(timezone.utc).isoformat()
+        return merged_bill
+    # No change - keep existing clean status but ensure it's properly mapped
+    if 'status' not in merged_bill or merged_bill['status'] in ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']:
+        # Only remap if status is still numeric (needs cleaning)
+        merged_bill['status'] = map_status(old_original_status)
+    return merged_bill
+def main():
+    # File paths
+    input_file = Path("data/known_bills_fixed.json")
+    output_file = Path("data/known_bills_visualize.json")
+    print(f"Reading source bills from: {input_file}")
+    # Load source bills data
+    with open(input_file, 'r', encoding='utf-8') as f:
+        source_bills = json.load(f)
+    print(f"Loaded {len(source_bills)} bills from source")
+    # Load existing visualization data if it exists
+    existing_bills = []
+    if output_file.exists():
+        print(f"Reading existing visualization data from: {output_file}")
+        with open(output_file, 'r', encoding='utf-8') as f:
+            existing_bills = json.load(f)
+        print(f"Loaded {len(existing_bills)} existing bills")
+    else:
+        print("No existing visualization data found - will create new file")
+    # Create lookup dictionary for existing bills
+    existing_bills_dict = {}
+    for bill in existing_bills:
+        key = create_bill_key(bill)
+        existing_bills_dict[key] = bill
+    # Process and merge bills
+    merged_bills = []
+    new_bills_count = 0
+    updated_bills_count = 0
+    unchanged_bills_count = 0
+    print(f"\nProcessing {len(source_bills)} bills...")
+    for source_bill in source_bills:
+        bill_key = create_bill_key(source_bill)
+        existing_bill = existing_bills_dict.get(bill_key)
+        if existing_bill:
+            # Check if anything actually changed
+            old_original_status = existing_bill.get('original_status')
+            new_original_status = source_bill.get('status')
+            if old_original_status != new_original_status:
+                updated_bills_count += 1
+            else:
+                unchanged_bills_count += 1
+        else:
+            new_bills_count += 1
+        merged_bill = merge_bill_data(source_bill, existing_bill)
+        merged_bills.append(merged_bill)
+    # Check for bills that exist in visualization but not in source (removed bills)
+    source_keys = {create_bill_key(bill) for bill in source_bills}
+    existing_keys = set(existing_bills_dict.keys())
+    removed_keys = existing_keys - source_keys
+    # Save updated bills
+    print(f"\nSaving updated bills to: {output_file}")
+    with open(output_file, 'w', encoding='utf-8') as f:
+        json.dump(merged_bills, f, indent=2, ensure_ascii=False)
+    # Show status distribution
+    status_counts = {}
+    for bill in merged_bills:
+        status = bill['status']
+        status_counts[status] = status_counts.get(status, 0) + 1
+    # Summary
+    print(f"\n✅ Update complete!")
+    print(f"  📊 Total bills: {len(merged_bills)}")
+    if new_bills_count > 0:
+        print(f"  🆕 New bills: {new_bills_count}")
+    if updated_bills_count > 0:
+        print(f"  🔄 Updated bills: {updated_bills_count}")
+    if unchanged_bills_count > 0:
+        print(f"  ✅ Unchanged bills: {unchanged_bills_count}")
+    if removed_keys:
+        print(f"  🗑️  Removed bills: {len(removed_keys)}")
+    if new_bills_count == 0 and updated_bills_count == 0:
+        print(f"  🎉 All bills are up to date - no changes needed!")
+    print(f"\n📈 Status distribution:")
+    for status, count in sorted(status_counts.items()):
+        print(f"  {status}: {count}")
+    print(f"\n📁 Clean data saved to: {output_file}")
+    print("Now run: streamlit run scripts/visualize-MIT.py")
+if __name__ == "__main__":
+    main()

data/data_updating_scripts/logs/eu_vectorstore.log ADDED Viewed

	@@ -0,0 +1,128 @@

+2025-11-03 11:40:25,451 [INFO] Starting EU AI Act vectorstore creation...
+2025-11-03 11:40:25,451 [INFO] Extracting text from PDF...
+2025-11-03 11:40:25,480 [INFO] Processing 144 pages from data_updating_scripts/eu-ai-act.pdf
+2025-11-03 11:40:27,260 [INFO] Extracted 612396 characters from PDF
+2025-11-03 11:40:27,260 [INFO] Creating document chunks...
+2025-11-03 11:40:27,268 [INFO] Created 648 document chunks
+2025-11-03 11:40:27,268 [INFO] Initializing embeddings...
+2025-11-03 11:40:27,397 [INFO] Creating FAISS vectorstore...
+2025-11-03 11:40:31,088 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
+2025-11-03 11:40:31,414 [INFO] Loading faiss.
+2025-11-03 11:40:31,881 [INFO] Successfully loaded faiss.
+2025-11-03 11:40:31,936 [INFO] Saving vectorstore to data/eu_ai_act_vectorstore...
+2025-11-03 11:40:31,945 [INFO] ✅ EU AI Act vectorstore created successfully!
+2025-11-03 11:40:31,945 [INFO]    - Total chunks: 648
+2025-11-03 11:40:31,945 [INFO]    - Text length: 612,396 characters
+2025-11-03 11:40:31,945 [INFO]    - Saved to: data/eu_ai_act_vectorstore
+2025-11-03 12:24:44,470 [INFO] Starting EU AI Act vectorstore creation...
+2025-11-03 12:24:44,471 [INFO] Extracting text from PDF...
+2025-11-03 12:24:44,492 [INFO] Processing 144 pages from data_updating_scripts/eu-ai-act.pdf
+2025-11-03 12:24:46,209 [INFO] Extracted 612396 characters from PDF
+2025-11-03 12:24:46,209 [INFO] Creating document chunks...
+2025-11-03 12:24:46,217 [INFO] Created 648 document chunks
+2025-11-03 12:24:46,217 [INFO] Initializing embeddings...
+2025-11-03 12:24:46,357 [INFO] Creating FAISS vectorstore...
+2025-11-03 12:24:49,286 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
+2025-11-03 12:24:49,669 [INFO] Loading faiss.
+2025-11-03 12:24:49,700 [INFO] Successfully loaded faiss.
+2025-11-03 12:24:49,749 [INFO] Saving vectorstore to data/eu_ai_act_vectorstore...
+2025-11-03 12:24:49,754 [INFO] ✅ EU AI Act vectorstore created successfully!
+2025-11-03 12:24:49,754 [INFO]    - Total chunks: 648
+2025-11-03 12:24:49,754 [INFO]    - Text length: 612,396 characters
+2025-11-03 12:24:49,754 [INFO]    - Saved to: data/eu_ai_act_vectorstore
+2025-11-04 15:55:15,879 [INFO] Starting EU AI Act vectorstore creation...
+2025-11-04 15:55:15,879 [INFO] Extracting text from PDF...
+2025-11-04 15:55:15,899 [INFO] Processing 144 pages from data_updating_scripts/eu-ai-act.pdf
+2025-11-04 15:55:17,629 [INFO] Extracted 612396 characters from PDF
+2025-11-04 15:55:17,629 [INFO] Creating document chunks...
+2025-11-04 15:55:17,637 [INFO] Created 648 document chunks
+2025-11-04 15:55:17,637 [INFO] Initializing embeddings...
+2025-11-04 15:55:17,768 [INFO] Creating FAISS vectorstore...
+2025-11-04 15:55:21,406 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
+2025-11-04 15:55:21,846 [INFO] Loading faiss.
+2025-11-04 15:55:21,917 [INFO] Successfully loaded faiss.
+2025-11-04 15:55:21,968 [INFO] Saving vectorstore to data/eu_ai_act_vectorstore...
+2025-11-04 15:55:21,981 [INFO] ✅ EU AI Act vectorstore created successfully!
+2025-11-04 15:55:21,981 [INFO]    - Total chunks: 648
+2025-11-04 15:55:21,981 [INFO]    - Text length: 612,396 characters
+2025-11-04 15:55:21,981 [INFO]    - Saved to: data/eu_ai_act_vectorstore
+2025-11-14 15:36:40,441 [INFO] Starting EU AI Act vectorstore creation...
+2025-11-14 15:36:40,442 [INFO] Extracting text from PDF...
+2025-11-14 15:36:40,455 [INFO] Processing 144 pages from data_updating_scripts/eu-ai-act.pdf
+2025-11-14 15:36:41,830 [INFO] Extracted 612396 characters from PDF
+2025-11-14 15:36:41,830 [INFO] Creating document chunks...
+2025-11-14 15:36:41,837 [INFO] Created 648 document chunks
+2025-11-14 15:36:41,837 [INFO] Initializing embeddings...
+2025-11-14 15:36:41,983 [INFO] Creating FAISS vectorstore...
+2025-11-14 15:36:46,413 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
+2025-11-14 15:36:46,791 [INFO] Loading faiss.
+2025-11-14 15:36:47,362 [INFO] Successfully loaded faiss.
+2025-11-14 15:36:47,404 [INFO] Saving vectorstore to data/eu_ai_act_vectorstore...
+2025-11-14 15:36:47,410 [INFO] ✅ EU AI Act vectorstore created successfully!
+2025-11-14 15:36:47,410 [INFO]    - Total chunks: 648
+2025-11-14 15:36:47,410 [INFO]    - Text length: 612,396 characters
+2025-11-14 15:36:47,410 [INFO]    - Saved to: data/eu_ai_act_vectorstore
+2025-11-20 14:15:10,012 [INFO] Starting EU AI Act vectorstore creation...
+2025-11-20 14:15:10,013 [INFO] Extracting text from PDF...
+2025-11-20 14:15:10,029 [INFO] Processing 144 pages from data_updating_scripts/eu-ai-act.pdf
+2025-11-20 14:15:11,997 [INFO] Extracted 612396 characters from PDF
+2025-11-20 14:15:11,998 [INFO] Creating document chunks...
+2025-11-20 14:15:12,006 [INFO] Created 648 document chunks
+2025-11-20 14:15:12,006 [INFO] Initializing embeddings...
+2025-11-20 14:15:12,200 [INFO] Creating FAISS vectorstore...
+2025-11-20 14:15:16,058 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
+2025-11-20 14:15:16,386 [INFO] Loading faiss.
+2025-11-20 14:15:16,477 [INFO] Successfully loaded faiss.
+2025-11-20 14:15:16,521 [INFO] Saving vectorstore to data/eu_ai_act_vectorstore...
+2025-11-20 14:15:16,529 [INFO] ✅ EU AI Act vectorstore created successfully!
+2025-11-20 14:15:16,529 [INFO]    - Total chunks: 648
+2025-11-20 14:15:16,529 [INFO]    - Text length: 612,396 characters
+2025-11-20 14:15:16,529 [INFO]    - Saved to: data/eu_ai_act_vectorstore
+2025-12-01 12:38:49,653 [INFO] Starting EU AI Act vectorstore creation...
+2025-12-01 12:38:49,653 [INFO] Extracting text from PDF...
+2025-12-01 12:38:49,669 [INFO] Processing 144 pages from data_updating_scripts/eu-ai-act.pdf
+2025-12-01 12:38:51,518 [INFO] Extracted 612396 characters from PDF
+2025-12-01 12:38:51,518 [INFO] Creating document chunks...
+2025-12-01 12:38:51,526 [INFO] Created 648 document chunks
+2025-12-01 12:38:51,526 [INFO] Initializing embeddings...
+2025-12-01 12:38:51,709 [INFO] Creating FAISS vectorstore...
+2025-12-01 12:38:54,252 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
+2025-12-01 12:38:54,675 [INFO] Loading faiss.
+2025-12-01 12:38:54,817 [INFO] Successfully loaded faiss.
+2025-12-01 12:38:54,859 [INFO] Saving vectorstore to data/eu_ai_act_vectorstore...
+2025-12-01 12:38:54,865 [INFO] ✅ EU AI Act vectorstore created successfully!
+2025-12-01 12:38:54,866 [INFO]    - Total chunks: 648
+2025-12-01 12:38:54,866 [INFO]    - Text length: 612,396 characters
+2025-12-01 12:38:54,866 [INFO]    - Saved to: data/eu_ai_act_vectorstore
+2025-12-01 13:21:15,236 [INFO] Starting EU AI Act vectorstore creation...
+2025-12-01 13:21:15,237 [INFO] Extracting text from PDF...
+2025-12-01 13:21:15,253 [INFO] Processing 144 pages from data_updating_scripts/eu-ai-act.pdf
+2025-12-01 13:21:17,069 [INFO] Extracted 612396 characters from PDF
+2025-12-01 13:21:17,069 [INFO] Creating document chunks...
+2025-12-01 13:21:17,078 [INFO] Created 648 document chunks
+2025-12-01 13:21:17,078 [INFO] Initializing embeddings...
+2025-12-01 13:21:17,343 [INFO] Creating FAISS vectorstore...
+2025-12-01 13:21:20,254 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
+2025-12-01 13:21:20,654 [INFO] Loading faiss.
+2025-12-01 13:21:20,768 [INFO] Successfully loaded faiss.
+2025-12-01 13:21:20,815 [INFO] Saving vectorstore to data/eu_ai_act_vectorstore...
+2025-12-01 13:21:20,821 [INFO] ✅ EU AI Act vectorstore created successfully!
+2025-12-01 13:21:20,821 [INFO]    - Total chunks: 648
+2025-12-01 13:21:20,821 [INFO]    - Text length: 612,396 characters
+2025-12-01 13:21:20,822 [INFO]    - Saved to: data/eu_ai_act_vectorstore
+2025-12-03 11:09:39,059 [INFO] Starting EU AI Act vectorstore creation...
+2025-12-03 11:09:39,060 [INFO] Extracting text from PDF...
+2025-12-03 11:09:39,075 [INFO] Processing 144 pages from data_updating_scripts/eu-ai-act.pdf
+2025-12-03 11:09:40,933 [INFO] Extracted 612396 characters from PDF
+2025-12-03 11:09:40,934 [INFO] Creating document chunks...
+2025-12-03 11:09:40,942 [INFO] Created 648 document chunks
+2025-12-03 11:09:40,942 [INFO] Initializing embeddings...
+2025-12-03 11:09:41,136 [INFO] Creating FAISS vectorstore...
+2025-12-03 11:09:44,436 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
+2025-12-03 11:09:44,820 [INFO] Loading faiss.
+2025-12-03 11:09:44,925 [INFO] Successfully loaded faiss.
+2025-12-03 11:09:44,968 [INFO] Saving vectorstore to data/eu_ai_act_vectorstore...
+2025-12-03 11:09:44,974 [INFO] ✅ EU AI Act vectorstore created successfully!
+2025-12-03 11:09:44,974 [INFO]    - Total chunks: 648
+2025-12-03 11:09:44,974 [INFO]    - Text length: 612,396 characters
+2025-12-03 11:09:44,974 [INFO]    - Saved to: data/eu_ai_act_vectorstore

data/data_updating_scripts/logs/fetch_ai_bills.log ADDED Viewed