Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / website /docs /for-developers.md

jcbowyer

Clean HuggingFace deployment without binary files

61d29fc 28 days ago

preview code

raw

history blame contribute delete

13.1 kB

metadata

sidebar_position: 1
displayed_sidebar: developersSidebar

For Developers & Technical Users

Welcome! This section contains technical documentation for developers, data scientists, and system administrators working with Open Navigator.

Platform Scale & Data Volume

Open Navigator processes data at scale across the United States:

Category	Count	Source
Total Jurisdictions	90,000+	Census Bureau Gazetteer 2024
Counties	3,144	All U.S. counties (FIPS coded)
Municipalities	19,500+	Cities, towns, villages, boroughs
Townships	36,000+	County subdivisions, census divisions
School Districts	13,000+	NCES Common Core of Data
Nonprofit Organizations	3,000,000+	IRS TEOS + ProPublica Nonprofit Explorer
State Legislatures	50	All U.S. states
Video Channels	50+	YouTube state legislature channels
Meeting Datasets	1,000+	MeetingBank, LocalView, City Scrapers
.gov Domains	15,000+	CISA validated government websites

Storage & Processing Requirements

Estimated Data Volumes:

Meeting Minutes: 10-100 MB per municipality × 1,000+ cities = 10-100 GB
Financial Documents: 5-50 MB per jurisdiction × 90,000 = 450 GB - 4.5 TB
Nonprofit 990s: 1-5 MB per org × 3M = 3-15 TB
Video Content: Variable (streaming recommended over storage)

Medallion Architecture (Delta Lake):

Bronze Layer: Raw scraped data (largest storage footprint)
Silver Layer: Cleaned/standardized (50-70% compression)
Gold Layer: Analyzed/aggregated (90%+ compression)

API Rate Limits & Quotas

Free Tier (No Cost):

Census Bureau: Unlimited downloads
NCES: Unlimited bulk downloads
ProPublica API: Respectful use (~1 req/sec suggested)
IRS TEOS: Bulk data downloads (monthly updates)
CISA .gov Domains: GitHub dataset (updated daily)

Paid/Limited:

OpenAI API: Pay per token (required for LLM features)
Harvard Dataverse: API key recommended (free registration)

:::info[Complete Technical Citations & Standards] For full citations, licenses, API documentation, and technical specifications:

Citations & Data Sources

Includes:

Academic Research: MeetingBank (ACL 2023), LocalView (Harvard), Council Data Project, City Scrapers
Government APIs: U.S. Census, NCES, IRS, Open States
Standards: OCD-ID (OCDEP 2), Popolo Project, Schema.org, CEDS, OMOP CDM (OHDSI), IATI v2.03
Data Models: Microsoft CDM for Nonprofits, OMOP vocabulary system
Fact-Checking: N/A (not currently integrated)
Nonprofit Data: IRS BMF (43,726 orgs from 5 states)
Churches & Faith-Based: 4,372 congregations from IRS data
Enterprise Tech: Microsoft (Nonprofit CDM), Google (Data Commons), AWS (Open Data), Databricks (Unity Catalog, MLflow), Snowflake, Salesforce (NPSP)
BibTeX citations for academic papers and research use :::

What You'll Find Here

🚀 Setup & Installation

Get the platform running:

Quick Start - Detailed installation instructions
Quick Reference - CLI commands cheat sheet
Architecture - System design and components

📊 Data Sources (Technical)

Technical details on data ingestion:

Jurisdiction Discovery - Finding 90,000+ government websites
Census Data - Ingesting Census Bureau datasets
HuggingFace Datasets - Pre-built meeting collections
YouTube Discovery - Video channel scraping

🛠️ How-To Guides

Step-by-step technical guides:

Jurisdiction Setup - Configure discovery for your area
HuggingFace Publishing - Publish datasets to HuggingFace Hub
Handling Formats - Process different document types
Scraper Improvements - Enhance scraping capabilities

🔌 Integrations

Connect external services:

Dataverse Integration - Harvard Dataverse API
Frontend Integration - React application setup
LocalView - LocalView dataset ingestion

🚀 Deployment

Production deployment:

Databricks Apps - Deploy to Databricks
Scale Deployment - Handle large datasets
Cost Management - Optimize expenses

💻 Development

Contributing and development:

Changelog - Version history
Migration Guides - Upgrading between versions
Refactoring Summary - Recent changes

Quick Start (TL;DR)

# Clone and install
git clone https://github.com/getcommunityone/open-navigator-for-engagement.git
cd oral-health-policy-pulse
./install.sh

# Install frontend and docs
cd frontend && npm install && cd ..
cd website && npm install && cd ..

# Start all services
./start-all.sh

# Visit:
# - Main App:  http://localhost:5173
# - API Docs:  http://localhost:8000/docs
# - This Site: http://localhost:3000

Architecture Overview

┌─────────────────────────────────────────┐
│         Open Navigator Platform         │
├─────────────────────────────────────────┤
│                                         │
│  ┌──────────────┐   ┌──────────────┐  │
│  │  React App   │   │  FastAPI     │  │
│  │  (Frontend)  │──▶│  (Backend)   │  │
│  │  Port 5173   │   │  Port 8000   │  │
│  └──────────────┘   └──────┬───────┘  │
│                             │           │
│  ┌──────────────────────────▼────────┐ │
│  │      Delta Lake (Data Storage)   │ │
│  │  • Bronze: Raw data              │ │
│  │  • Silver: Cleaned data          │ │
│  │  • Gold: Analyzed data           │ │
│  └──────────────────────────────────┘ │
└─────────────────────────────────────────┘

Common Tasks

Run Jurisdiction Discovery

source .venv/bin/activate

# Test run (100 jurisdictions)
python main.py discover-jurisdictions --limit 100

# Single state
python main.py discover-jurisdictions --state CA

# Full discovery (~30k jurisdictions)
python main.py discover-jurisdictions

Ingest Reference Data

# Census jurisdictions (90,000+ entities)
python -m discovery.census_ingestion

# NCES school districts (13,000+)
python -m discovery.nces_ingestion

# Pre-built datasets
python discovery/meetingbank_ingestion.py
python discovery/city_scrapers_urls.py
python discovery/openstates_sources.py

Scrape Meeting Minutes

# Batch scraping from discovered sites
python main.py scrape-batch --source discovered --limit 50

# Single jurisdiction
python main.py scrape --url "https://chicago.legistar.com" \
                      --state "IL" \
                      --municipality "Chicago"

Publish to HuggingFace

# Requires HUGGINGFACE_TOKEN in .env
python main.py publish-to-hf --dataset all
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset census --sample

Technology Stack

Backend

Python 3.11+ - Core language
FastAPI - REST API framework
Delta Lake - Data lakehouse storage
Databricks - Production data platform
OpenAI API - LLM capabilities

Frontend

React 18 - UI framework
Vite - Build tool
TypeScript - Type safety
Leaflet - Interactive maps

Data Processing

Pandas - Data manipulation
BeautifulSoup - HTML parsing
PyPDF2 - PDF extraction
Tesseract OCR - Image to text

Deployment

Docker - Containerization
tmux - Session management
Databricks Apps - Production hosting

API Reference

Start API Server

python main.py serve --host 0.0.0.0 --port 8000

Visit http://localhost:8000/docs for interactive API documentation.

Example: Start Workflow

curl -X POST "http://localhost:8000/workflow/start" \
     -H "Content-Type: application/json" \
     -d '{
       "scrape_targets": [
         {
           "url": "https://chicago.legistar.com",
           "municipality": "Chicago",
           "state": "IL",
           "platform": "legistar"
         }
       ]
     }'

Example: Query Opportunities

curl "http://localhost:8000/opportunities?state=CA&urgency=critical"

Development Workflow

1. Local Development

# Terminal 1: API (with hot reload)
source .venv/bin/activate
python main.py serve --reload

# Terminal 2: Frontend (with hot reload)
cd frontend
npm run dev

# Terminal 3: Documentation
cd website
npm start

2. Testing

# Run all tests
pytest

# With coverage
pytest --cov=agents --cov=pipeline --cov=visualization

# Specific test file
pytest tests/test_agents.py

3. Deployment

# Deploy to Databricks
export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_TOKEN=dapi...
./scripts/deploy-databricks-app.sh

Data Pipeline

Medallion Architecture

Bronze (Raw)          Silver (Cleaned)       Gold (Analyzed)
────────────────────────────────────────────────────────────
Scraped PDFs     →    Extracted text    →    Classifications
Meeting videos   →    Transcripts       →    Sentiment scores
Budget docs      →    Line items        →    Budget analysis
Form 990s        →    Financial data    →    Spending patterns

File Locations

Bronze: data/bronze/ - Raw downloaded files
Silver: data/silver/ - Cleaned and standardized
Gold: data/gold/ - Enriched with analysis
Cache: cache/ - Temporary processing files

Configuration

Environment Variables

Create .env file:

# Required
OPENAI_API_KEY=sk-...

# Optional (for production)
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=dapi...

# Optional (for publishing)
HUGGINGFACE_TOKEN=hf_...

# Optional (for Harvard Dataverse)
DATAVERSE_API_KEY=...

Settings File

Edit config/settings.py for:

Delta Lake paths
Scraping rate limits
Batch sizes
Model configurations

Contributing

1. Fork & Clone

git clone https://github.com/YOUR-USERNAME/oral-health-policy-pulse.git
cd oral-health-policy-pulse
git remote add upstream https://github.com/getcommunityone/open-navigator-for-engagement.git

2. Create Branch

git checkout -b feature/your-feature-name

3. Make Changes

Add tests for new features
Update documentation
Follow existing code style
Keep commits focused and atomic

4. Submit PR

git push origin feature/your-feature-name
# Then create PR on GitHub

See CONTRIBUTING.md for details.

Troubleshooting

Port Already in Use

# Find process using port
lsof -i :8000
lsof -i :5173
lsof -i :3000

# Kill process
kill -9 <PID>

Dependencies Not Installing

# Clear cache and reinstall
rm -rf .venv
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Scraping Failures

Check logs:

tail -f logs/scraper.log

Adjust rate limits in config/settings.py.

Next Steps

Read Architecture → System Design
Set Up Environment → Quick Start
Run Discovery → Jurisdiction Setup
Deploy to Production → Databricks Apps
Contribute → GitHub Issues

Support

GitHub Issues: Report bugs or request features
Documentation: Browse the sidebar
API Docs: http://localhost:8000/docs
Email: johnbowyer@communityone.com