open-navigator / website /docs /for-developers.md
jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
metadata
sidebar_position: 1
displayed_sidebar: developersSidebar

For Developers & Technical Users

Welcome! This section contains technical documentation for developers, data scientists, and system administrators working with Open Navigator.

Platform Scale & Data Volume

Open Navigator processes data at scale across the United States:

Category Count Source
Total Jurisdictions 90,000+ Census Bureau Gazetteer 2024
Counties 3,144 All U.S. counties (FIPS coded)
Municipalities 19,500+ Cities, towns, villages, boroughs
Townships 36,000+ County subdivisions, census divisions
School Districts 13,000+ NCES Common Core of Data
Nonprofit Organizations 3,000,000+ IRS TEOS + ProPublica Nonprofit Explorer
State Legislatures 50 All U.S. states
Video Channels 50+ YouTube state legislature channels
Meeting Datasets 1,000+ MeetingBank, LocalView, City Scrapers
.gov Domains 15,000+ CISA validated government websites

Storage & Processing Requirements

Estimated Data Volumes:

  • Meeting Minutes: 10-100 MB per municipality Γ— 1,000+ cities = 10-100 GB
  • Financial Documents: 5-50 MB per jurisdiction Γ— 90,000 = 450 GB - 4.5 TB
  • Nonprofit 990s: 1-5 MB per org Γ— 3M = 3-15 TB
  • Video Content: Variable (streaming recommended over storage)

Medallion Architecture (Delta Lake):

  • Bronze Layer: Raw scraped data (largest storage footprint)
  • Silver Layer: Cleaned/standardized (50-70% compression)
  • Gold Layer: Analyzed/aggregated (90%+ compression)

API Rate Limits & Quotas

Free Tier (No Cost):

  • Census Bureau: Unlimited downloads
  • NCES: Unlimited bulk downloads
  • ProPublica API: Respectful use (~1 req/sec suggested)
  • IRS TEOS: Bulk data downloads (monthly updates)
  • CISA .gov Domains: GitHub dataset (updated daily)

Paid/Limited:

  • OpenAI API: Pay per token (required for LLM features)
  • Harvard Dataverse: API key recommended (free registration)

:::info[Complete Technical Citations & Standards] For full citations, licenses, API documentation, and technical specifications:

Citations & Data Sources

Includes:

  • Academic Research: MeetingBank (ACL 2023), LocalView (Harvard), Council Data Project, City Scrapers
  • Government APIs: U.S. Census, NCES, IRS, Open States
  • Standards: OCD-ID (OCDEP 2), Popolo Project, Schema.org, CEDS, OMOP CDM (OHDSI), IATI v2.03
  • Data Models: Microsoft CDM for Nonprofits, OMOP vocabulary system
  • Fact-Checking: N/A (not currently integrated)
  • Nonprofit Data: IRS BMF (43,726 orgs from 5 states)
  • Churches & Faith-Based: 4,372 congregations from IRS data
  • Enterprise Tech: Microsoft (Nonprofit CDM), Google (Data Commons), AWS (Open Data), Databricks (Unity Catalog, MLflow), Snowflake, Salesforce (NPSP)
  • BibTeX citations for academic papers and research use :::

What You'll Find Here

πŸš€ Setup & Installation

Get the platform running:

πŸ“Š Data Sources (Technical)

Technical details on data ingestion:

πŸ› οΈ How-To Guides

Step-by-step technical guides:

πŸ”Œ Integrations

Connect external services:

πŸš€ Deployment

Production deployment:

πŸ’» Development

Contributing and development:

Quick Start (TL;DR)

# Clone and install
git clone https://github.com/getcommunityone/open-navigator-for-engagement.git
cd oral-health-policy-pulse
./install.sh

# Install frontend and docs
cd frontend && npm install && cd ..
cd website && npm install && cd ..

# Start all services
./start-all.sh

# Visit:
# - Main App:  http://localhost:5173
# - API Docs:  http://localhost:8000/docs
# - This Site: http://localhost:3000

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Open Navigator Platform         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  React App   β”‚   β”‚  FastAPI     β”‚  β”‚
β”‚  β”‚  (Frontend)  │──▢│  (Backend)   β”‚  β”‚
β”‚  β”‚  Port 5173   β”‚   β”‚  Port 8000   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                             β”‚           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚      Delta Lake (Data Storage)   β”‚ β”‚
β”‚  β”‚  β€’ Bronze: Raw data              β”‚ β”‚
β”‚  β”‚  β€’ Silver: Cleaned data          β”‚ β”‚
β”‚  β”‚  β€’ Gold: Analyzed data           β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Common Tasks

Run Jurisdiction Discovery

source .venv/bin/activate

# Test run (100 jurisdictions)
python main.py discover-jurisdictions --limit 100

# Single state
python main.py discover-jurisdictions --state CA

# Full discovery (~30k jurisdictions)
python main.py discover-jurisdictions

Ingest Reference Data

# Census jurisdictions (90,000+ entities)
python -m discovery.census_ingestion

# NCES school districts (13,000+)
python -m discovery.nces_ingestion

# Pre-built datasets
python discovery/meetingbank_ingestion.py
python discovery/city_scrapers_urls.py
python discovery/openstates_sources.py

Scrape Meeting Minutes

# Batch scraping from discovered sites
python main.py scrape-batch --source discovered --limit 50

# Single jurisdiction
python main.py scrape --url "https://chicago.legistar.com" \
                      --state "IL" \
                      --municipality "Chicago"

Publish to HuggingFace

# Requires HUGGINGFACE_TOKEN in .env
python main.py publish-to-hf --dataset all
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset census --sample

Technology Stack

Backend

  • Python 3.11+ - Core language
  • FastAPI - REST API framework
  • Delta Lake - Data lakehouse storage
  • Databricks - Production data platform
  • OpenAI API - LLM capabilities

Frontend

  • React 18 - UI framework
  • Vite - Build tool
  • TypeScript - Type safety
  • Leaflet - Interactive maps

Data Processing

  • Pandas - Data manipulation
  • BeautifulSoup - HTML parsing
  • PyPDF2 - PDF extraction
  • Tesseract OCR - Image to text

Deployment

  • Docker - Containerization
  • tmux - Session management
  • Databricks Apps - Production hosting

API Reference

Start API Server

python main.py serve --host 0.0.0.0 --port 8000

Visit http://localhost:8000/docs for interactive API documentation.

Example: Start Workflow

curl -X POST "http://localhost:8000/workflow/start" \
     -H "Content-Type: application/json" \
     -d '{
       "scrape_targets": [
         {
           "url": "https://chicago.legistar.com",
           "municipality": "Chicago",
           "state": "IL",
           "platform": "legistar"
         }
       ]
     }'

Example: Query Opportunities

curl "http://localhost:8000/opportunities?state=CA&urgency=critical"

Development Workflow

1. Local Development

# Terminal 1: API (with hot reload)
source .venv/bin/activate
python main.py serve --reload

# Terminal 2: Frontend (with hot reload)
cd frontend
npm run dev

# Terminal 3: Documentation
cd website
npm start

2. Testing

# Run all tests
pytest

# With coverage
pytest --cov=agents --cov=pipeline --cov=visualization

# Specific test file
pytest tests/test_agents.py

3. Deployment

# Deploy to Databricks
export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_TOKEN=dapi...
./scripts/deploy-databricks-app.sh

Data Pipeline

Medallion Architecture

Bronze (Raw)          Silver (Cleaned)       Gold (Analyzed)
────────────────────────────────────────────────────────────
Scraped PDFs     β†’    Extracted text    β†’    Classifications
Meeting videos   β†’    Transcripts       β†’    Sentiment scores
Budget docs      β†’    Line items        β†’    Budget analysis
Form 990s        β†’    Financial data    β†’    Spending patterns

File Locations

  • Bronze: data/bronze/ - Raw downloaded files
  • Silver: data/silver/ - Cleaned and standardized
  • Gold: data/gold/ - Enriched with analysis
  • Cache: cache/ - Temporary processing files

Configuration

Environment Variables

Create .env file:

# Required
OPENAI_API_KEY=sk-...

# Optional (for production)
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=dapi...

# Optional (for publishing)
HUGGINGFACE_TOKEN=hf_...

# Optional (for Harvard Dataverse)
DATAVERSE_API_KEY=...

Settings File

Edit config/settings.py for:

  • Delta Lake paths
  • Scraping rate limits
  • Batch sizes
  • Model configurations

Contributing

1. Fork & Clone

git clone https://github.com/YOUR-USERNAME/oral-health-policy-pulse.git
cd oral-health-policy-pulse
git remote add upstream https://github.com/getcommunityone/open-navigator-for-engagement.git

2. Create Branch

git checkout -b feature/your-feature-name

3. Make Changes

  • Add tests for new features
  • Update documentation
  • Follow existing code style
  • Keep commits focused and atomic

4. Submit PR

git push origin feature/your-feature-name
# Then create PR on GitHub

See CONTRIBUTING.md for details.

Troubleshooting

Port Already in Use

# Find process using port
lsof -i :8000
lsof -i :5173
lsof -i :3000

# Kill process
kill -9 <PID>

Dependencies Not Installing

# Clear cache and reinstall
rm -rf .venv
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Scraping Failures

Check logs:

tail -f logs/scraper.log

Adjust rate limits in config/settings.py.

Next Steps

  1. Read Architecture β†’ System Design
  2. Set Up Environment β†’ Quick Start
  3. Run Discovery β†’ Jurisdiction Setup
  4. Deploy to Production β†’ Databricks Apps
  5. Contribute β†’ GitHub Issues

Support