open-navigator / website /docs /for-developers.md
jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
---
sidebar_position: 1
displayed_sidebar: developersSidebar
---
# For Developers & Technical Users
Welcome! This section contains **technical documentation** for developers, data scientists, and system administrators working with Open Navigator.
## Platform Scale & Data Volume
Open Navigator processes data at scale across the United States:
| Category | Count | Source |
|----------|-------|--------|
| **Total Jurisdictions** | 90,000+ | Census Bureau Gazetteer 2024 |
| **Counties** | 3,144 | All U.S. counties (FIPS coded) |
| **Municipalities** | 19,500+ | Cities, towns, villages, boroughs |
| **Townships** | 36,000+ | County subdivisions, census divisions |
| **School Districts** | 13,000+ | NCES Common Core of Data |
| **Nonprofit Organizations** | 3,000,000+ | IRS TEOS + ProPublica Nonprofit Explorer |
| **State Legislatures** | 50 | All U.S. states |
| **Video Channels** | 50+ | YouTube state legislature channels |
| **Meeting Datasets** | 1,000+ | MeetingBank, LocalView, City Scrapers |
| **.gov Domains** | 15,000+ | CISA validated government websites |
### Storage & Processing Requirements
**Estimated Data Volumes:**
- **Meeting Minutes**: 10-100 MB per municipality Γ— 1,000+ cities = 10-100 GB
- **Financial Documents**: 5-50 MB per jurisdiction Γ— 90,000 = 450 GB - 4.5 TB
- **Nonprofit 990s**: 1-5 MB per org Γ— 3M = 3-15 TB
- **Video Content**: Variable (streaming recommended over storage)
**Medallion Architecture (Delta Lake):**
- **Bronze Layer**: Raw scraped data (largest storage footprint)
- **Silver Layer**: Cleaned/standardized (50-70% compression)
- **Gold Layer**: Analyzed/aggregated (90%+ compression)
### API Rate Limits & Quotas
**Free Tier (No Cost):**
- Census Bureau: Unlimited downloads
- NCES: Unlimited bulk downloads
- ProPublica API: Respectful use (~1 req/sec suggested)
- IRS TEOS: Bulk data downloads (monthly updates)
- CISA .gov Domains: GitHub dataset (updated daily)
**Paid/Limited:**
- OpenAI API: Pay per token (required for LLM features)
- Harvard Dataverse: API key recommended (free registration)
:::info[Complete Technical Citations & Standards]
For full citations, licenses, API documentation, and technical specifications:
**[Citations & Data Sources](/docs/data-sources/citations)**
Includes:
- **Academic Research**: MeetingBank (ACL 2023), LocalView (Harvard), Council Data Project, City Scrapers
- **Government APIs**: U.S. Census, NCES, IRS, Open States
- **Standards**: OCD-ID (OCDEP 2), Popolo Project, Schema.org, CEDS, OMOP CDM (OHDSI), IATI v2.03
- **Data Models**: Microsoft CDM for Nonprofits, OMOP vocabulary system
- **Fact-Checking**: N/A (not currently integrated)
- **Nonprofit Data**: IRS BMF (43,726 orgs from 5 states)
- **Churches & Faith-Based**: 4,372 congregations from IRS data
- **Enterprise Tech**: Microsoft (Nonprofit CDM), Google (Data Commons), AWS (Open Data), Databricks (Unity Catalog, MLflow), Snowflake, Salesforce (NPSP)
- **BibTeX citations** for academic papers and research use
:::
---
## What You'll Find Here
### πŸš€ Setup & Installation
Get the platform running:
- **[Quick Start](/docs/quickstart)** - Detailed installation instructions
- **[Quick Reference](/docs/quick-reference)** - CLI commands cheat sheet
- **[Architecture](/docs/architecture)** - System design and components
### πŸ“Š Data Sources (Technical)
Technical details on data ingestion:
- **[Jurisdiction Discovery](/docs/data-sources/jurisdiction-discovery)** - Finding 90,000+ government websites
- **[Census Data](/docs/data-sources/census-data)** - Ingesting Census Bureau datasets
- **[HuggingFace Datasets](/docs/data-sources/huggingface-datasets)** - Pre-built meeting collections
- **[YouTube Discovery](/docs/data-sources/youtube-discovery)** - Video channel scraping
### πŸ› οΈ How-To Guides
Step-by-step technical guides:
- **[Jurisdiction Setup](/docs/guides/jurisdiction-setup)** - Configure discovery for your area
- **[HuggingFace Publishing](/docs/guides/huggingface-publishing)** - Publish datasets to HuggingFace Hub
- **[Handling Formats](/docs/guides/handling-formats)** - Process different document types
- **[Scraper Improvements](/docs/guides/scraper-improvements)** - Enhance scraping capabilities
### πŸ”Œ Integrations
Connect external services:
- **[Dataverse Integration](/docs/integrations/dataverse)** - Harvard Dataverse API
- **[Frontend Integration](/docs/integrations/frontend)** - React application setup
- **[LocalView](/docs/integrations/localview)** - LocalView dataset ingestion
### πŸš€ Deployment
Production deployment:
- **[Databricks Apps](/docs/deployment/databricks-apps)** - Deploy to Databricks
- **[Scale Deployment](/docs/deployment/scale)** - Handle large datasets
- **[Cost Management](/docs/deployment/costs)** - Optimize expenses
### πŸ’» Development
Contributing and development:
- **[Changelog](/docs/development/changelog)** - Version history
- **[Migration Guides](/docs/development/migration-v2)** - Upgrading between versions
- **[Refactoring Summary](/docs/development/refactoring-summary)** - Recent changes
## Quick Start (TL;DR)
```bash
# Clone and install
git clone https://github.com/getcommunityone/open-navigator-for-engagement.git
cd oral-health-policy-pulse
./install.sh
# Install frontend and docs
cd frontend && npm install && cd ..
cd website && npm install && cd ..
# Start all services
./start-all.sh
# Visit:
# - Main App: http://localhost:5173
# - API Docs: http://localhost:8000/docs
# - This Site: http://localhost:3000
```
## Architecture Overview
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Open Navigator Platform β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ React App β”‚ β”‚ FastAPI β”‚ β”‚
β”‚ β”‚ (Frontend) │──▢│ (Backend) β”‚ β”‚
β”‚ β”‚ Port 5173 β”‚ β”‚ Port 8000 β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Delta Lake (Data Storage) β”‚ β”‚
β”‚ β”‚ β€’ Bronze: Raw data β”‚ β”‚
β”‚ β”‚ β€’ Silver: Cleaned data β”‚ β”‚
β”‚ β”‚ β€’ Gold: Analyzed data β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Common Tasks
### Run Jurisdiction Discovery
```bash
source .venv/bin/activate
# Test run (100 jurisdictions)
python main.py discover-jurisdictions --limit 100
# Single state
python main.py discover-jurisdictions --state CA
# Full discovery (~30k jurisdictions)
python main.py discover-jurisdictions
```
### Ingest Reference Data
```bash
# Census jurisdictions (90,000+ entities)
python -m discovery.census_ingestion
# NCES school districts (13,000+)
python -m discovery.nces_ingestion
# Pre-built datasets
python discovery/meetingbank_ingestion.py
python discovery/city_scrapers_urls.py
python discovery/openstates_sources.py
```
### Scrape Meeting Minutes
```bash
# Batch scraping from discovered sites
python main.py scrape-batch --source discovered --limit 50
# Single jurisdiction
python main.py scrape --url "https://chicago.legistar.com" \
--state "IL" \
--municipality "Chicago"
```
### Publish to HuggingFace
```bash
# Requires HUGGINGFACE_TOKEN in .env
python main.py publish-to-hf --dataset all
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset census --sample
```
## Technology Stack
### Backend
- **Python 3.11+** - Core language
- **FastAPI** - REST API framework
- **Delta Lake** - Data lakehouse storage
- **Databricks** - Production data platform
- **OpenAI API** - LLM capabilities
### Frontend
- **React 18** - UI framework
- **Vite** - Build tool
- **TypeScript** - Type safety
- **Leaflet** - Interactive maps
### Data Processing
- **Pandas** - Data manipulation
- **BeautifulSoup** - HTML parsing
- **PyPDF2** - PDF extraction
- **Tesseract OCR** - Image to text
### Deployment
- **Docker** - Containerization
- **tmux** - Session management
- **Databricks Apps** - Production hosting
## API Reference
### Start API Server
```bash
python main.py serve --host 0.0.0.0 --port 8000
```
Visit http://localhost:8000/docs for interactive API documentation.
### Example: Start Workflow
```bash
curl -X POST "http://localhost:8000/workflow/start" \
-H "Content-Type: application/json" \
-d '{
"scrape_targets": [
{
"url": "https://chicago.legistar.com",
"municipality": "Chicago",
"state": "IL",
"platform": "legistar"
}
]
}'
```
### Example: Query Opportunities
```bash
curl "http://localhost:8000/opportunities?state=CA&urgency=critical"
```
## Development Workflow
### 1. Local Development
```bash
# Terminal 1: API (with hot reload)
source .venv/bin/activate
python main.py serve --reload
# Terminal 2: Frontend (with hot reload)
cd frontend
npm run dev
# Terminal 3: Documentation
cd website
npm start
```
### 2. Testing
```bash
# Run all tests
pytest
# With coverage
pytest --cov=agents --cov=pipeline --cov=visualization
# Specific test file
pytest tests/test_agents.py
```
### 3. Deployment
```bash
# Deploy to Databricks
export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_TOKEN=dapi...
./scripts/deploy-databricks-app.sh
```
## Data Pipeline
### Medallion Architecture
```
Bronze (Raw) Silver (Cleaned) Gold (Analyzed)
────────────────────────────────────────────────────────────
Scraped PDFs β†’ Extracted text β†’ Classifications
Meeting videos β†’ Transcripts β†’ Sentiment scores
Budget docs β†’ Line items β†’ Budget analysis
Form 990s β†’ Financial data β†’ Spending patterns
```
### File Locations
- **Bronze**: `data/bronze/` - Raw downloaded files
- **Silver**: `data/silver/` - Cleaned and standardized
- **Gold**: `data/gold/` - Enriched with analysis
- **Cache**: `cache/` - Temporary processing files
## Configuration
### Environment Variables
Create `.env` file:
```bash
# Required
OPENAI_API_KEY=sk-...
# Optional (for production)
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=dapi...
# Optional (for publishing)
HUGGINGFACE_TOKEN=hf_...
# Optional (for Harvard Dataverse)
DATAVERSE_API_KEY=...
```
### Settings File
Edit `config/settings.py` for:
- Delta Lake paths
- Scraping rate limits
- Batch sizes
- Model configurations
## Contributing
### 1. Fork & Clone
```bash
git clone https://github.com/YOUR-USERNAME/oral-health-policy-pulse.git
cd oral-health-policy-pulse
git remote add upstream https://github.com/getcommunityone/open-navigator-for-engagement.git
```
### 2. Create Branch
```bash
git checkout -b feature/your-feature-name
```
### 3. Make Changes
- Add tests for new features
- Update documentation
- Follow existing code style
- Keep commits focused and atomic
### 4. Submit PR
```bash
git push origin feature/your-feature-name
# Then create PR on GitHub
```
See [CONTRIBUTING.md](https://github.com/getcommunityone/open-navigator-for-engagement/blob/main/CONTRIBUTING.md) for details.
## Troubleshooting
### Port Already in Use
```bash
# Find process using port
lsof -i :8000
lsof -i :5173
lsof -i :3000
# Kill process
kill -9 <PID>
```
### Dependencies Not Installing
```bash
# Clear cache and reinstall
rm -rf .venv
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
```
### Scraping Failures
Check logs:
```bash
tail -f logs/scraper.log
```
Adjust rate limits in `config/settings.py`.
## Next Steps
1. **Read Architecture** β†’ [System Design](/docs/architecture)
2. **Set Up Environment** β†’ [Quick Start](/docs/quickstart)
3. **Run Discovery** β†’ [Jurisdiction Setup](/docs/guides/jurisdiction-setup)
4. **Deploy to Production** β†’ [Databricks Apps](/docs/deployment/databricks-apps)
5. **Contribute** β†’ [GitHub Issues](https://github.com/getcommunityone/open-navigator-for-engagement/issues)
## Support
- **GitHub Issues**: [Report bugs or request features](https://github.com/getcommunityone/open-navigator-for-engagement/issues)
- **Documentation**: Browse the sidebar
- **API Docs**: http://localhost:8000/docs
- **Email**: johnbowyer@communityone.com