Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

File size: 6,969 Bytes

61d29fc

# 🚀 Intel Arc + DuckDB Quick Reference

**Get started with local AI legislative analysis in 5 minutes**

## ⚡ Performance at a Glance

| Task | Standard (Postgres + CPU) | Optimized (DuckDB + Arc GPU) | Speedup |
|------|--------------------------|------------------------------|---------|
| Context injection (100 bills) | 500ms | 20ms | **25x** |
| Vector search (10K records) | 800ms | 18ms | **44x** |
| LLM inference (3B model) | 350 tok/s | 1,200 tok/s | **3.4x** |
| Full testimony analysis | 2,000ms | 80ms | **25x** |

## 🎯 Three-Step Setup

### 1. Install (5 minutes)

```bash
cd /path/to/open-navigator
./scripts/intel_llm_setup.sh
source .venv-intel/bin/activate
```

### 2. Test DuckDB VSS (30 seconds)

```bash
python scripts/duckdb_vss_demo.py
```

Expected output:
```
📊 Creating demo DuckDB database with VSS...
✅ Demo database created!
📈 Results (searching 1,000 bills):
   Average: 18.45ms
🎯 Top 3 most similar bills: ...
```

### 3. Run Analysis (1 minute)

```bash
python scripts/legislative_analysis_intel.py
```

## 🧠 Code Examples

### Example 1: Fast Bill Search

```python
from scripts.legislative_analysis_intel import DuckDBLegislativeAnalyzer

with DuckDBLegislativeAnalyzer() as analyzer:
    # Get bill context in < 50ms
    bill = analyzer.get_bill_context("HB1234")
    testimony = analyzer.get_all_testimony_for_bill("HB1234")
    
    print(f"Bill: {bill['title']}")
    print(f"Testimony records: {len(testimony)}")
```

### Example 2: Vector Similarity Search

```python
import numpy as np

# Your query embedding (384 dimensions from sentence-transformers)
query_embedding = model.encode("water fluoridation policy")

# Fast vector search (< 20ms for 10K bills)
similar_bills = analyzer.search_similar_testimony(
    query_embedding.tolist(),
    limit=10
)

for bill in similar_bills:
    print(f"{bill['bill_id']}: {bill['text'][:100]}... (similarity: {bill['similarity']:.2f})")
```

### Example 3: Extract Interest Groups

```python
from scripts.legislative_analysis_intel import IntelOptimizedLLM, InterestGroup

# Initialize Intel-optimized LLM (uses Arc GPU)
llm = IntelOptimizedLLM(model_name="meta-llama/Llama-3.2-3B-Instruct")
llm.load_model(use_openvino=True)  # OpenVINO = best Arc GPU performance

# Extract structured data
groups = llm.extract_interest_groups(bill_context, testimony)

# Results
for group in groups:
    print(f"""
    Group: {group.group_name}
    Lobbyist: {group.lobbyist}
    Stance: {group.stance} (score: {group.stance_score})
    Tradeoffs: {group.tradeoff_notes}
    Confidence: {group.confidence}
    """)
```

### Example 4: Query Hugging Face Datasets Directly

```python
import duckdb

conn = duckdb.connect()

# No download needed - streams from HF!
df = conn.execute("""
    SELECT * 
    FROM read_parquet(
        'hf://datasets/CommunityOne/states-al-nonprofits-locations/data/train-*.parquet'
    )
    WHERE city = 'Birmingham'
    LIMIT 100
""").fetchdf()

print(f"Found {len(df)} organizations in Birmingham, AL")
```

## 🎨 Output Schema

**Interest Group Extraction:**

```json
{
  "groups": [
    {
      "group_name": "Alabama Dental Association",
      "lobbyist": "John Smith, DDS",
      "stance": "conditional",
      "stance_score": 0.6,
      "tradeoff_notes": "Support if Section 4 amended to include rural exemption and phased implementation timeline",
      "testimony_excerpt": "While we have concerns about Section 4's implementation timeline, we support the overall goals if rural communities receive proper resources...",
      "bill_id": "HB1234",
      "confidence": 0.85
    },
    {
      "group_name": "Sierra Club Alabama Chapter",
      "lobbyist": null,
      "stance": "oppose",
      "stance_score": -0.9,
      "tradeoff_notes": null,
      "testimony_excerpt": "We strongly oppose this bill due to environmental concerns...",
      "bill_id": "HB1234",
      "confidence": 0.92
    }
  ]
}
```

## 🔧 Environment Variables

```bash
# Enable Intel GPU
export ZES_ENABLE_SYSMAN=1

# Ollama GPU usage (if using Ollama)
export OLLAMA_NUM_GPU=999

# IPEX-LLM optimizations
export IPEX_LLM_NUM_GPU=1
export ONEAPI_DEVICE_SELECTOR=level_zero:0
```

## 💡 Best Practices

### 1. Cache Embeddings

**DON'T** recompute every time:
```python
# Slow - recomputes embeddings every run
for bill in bills:
    embedding = model.encode(bill['text'])
    analyze(embedding)
```

**DO** cache in DuckDB:
```python
# Fast - compute once, reuse forever
conn.execute("""
    CREATE TABLE bill_embeddings AS
    SELECT bill_id, embedding
    FROM ... -- computed once
""")

# Then just query
similar = conn.execute("""
    SELECT * FROM bill_embeddings
    ORDER BY array_distance(embedding, ?) 
    LIMIT 10
""", [query]).fetchall()
```

### 2. Batch Processing

**DON'T** process one at a time:
```python
for bill_id in bill_ids:  # Slow!
    result = analyze_single_bill(bill_id)
```

**DO** batch efficiently:
```python
# Fast - processes 100 bills in parallel
results = llm.extract_interest_groups_batch(
    bill_contexts=bills,
    testimony_batches=all_testimony,
    batch_size=32  # Fits in Arc GPU memory
)
```

### 3. Monitor GPU Usage

```bash
# Linux: intel_gpu_top
sudo apt install intel-gpu-tools
intel_gpu_top

# Windows: Task Manager → Performance → GPU
# Look for "GPU 0 - Intel Arc Graphics"
```

## 🐛 Troubleshooting

### Issue: "ModuleNotFoundError: optimum"

```bash
pip install optimum[openvino]
```

### Issue: Slow inference (still using CPU)

Check device:
```python
import torch
print(f"Device: {torch.cuda.get_device_name(0)}")  # Should show Arc GPU

# Force GPU
model = OVModelForCausalLM.from_pretrained(
    model_name,
    device="GPU"  # Explicitly set
)
```

### Issue: Out of memory

Use smaller model or reduce batch size:
```python
# Use 3B instead of 8B
model_name = "meta-llama/Llama-3.2-3B-Instruct"

# Reduce context
testimony = testimony[:10]  # Top 10 only
```

## 📚 Resources

- **Full Guide**: [website/docs/guides/intel-arc-optimization.md](../website/docs/guides/intel-arc-optimization.md)
- **DuckDB Docs**: https://duckdb.org/docs/
- **Intel IPEX**: https://github.com/intel/intel-extension-for-pytorch
- **OpenVINO**: https://docs.openvino.ai/

## 🎯 Next Steps

1. ✅ Run the demo: `python scripts/duckdb_vss_demo.py`
2. ✅ Test analysis: `python scripts/legislative_analysis_intel.py`
3. 📚 Read full guide: [Intel Arc Optimization Guide](../website/docs/guides/intel-arc-optimization.md)
4. 🚀 Build your own: Use the `DuckDBLegislativeAnalyzer` class
5. 🤝 Share results: Open an issue with your findings!

## 💬 Questions?

- **GitHub Issues**: https://github.com/getcommunityone/open-navigator/issues
- **Documentation**: https://www.communityone.com/docs
- **Intel AI Forums**: https://community.intel.com/t5/Intel-AI-Analytics-and/bd-p/software-ai

---

**Built with ❤️ for Data Engineering Managers who want local, private, fast legislative intelligence.**