--- sidebar_position: 1 displayed_sidebar: developersSidebar --- # For Developers & Technical Users Welcome! This section contains **technical documentation** for developers, data scientists, and system administrators working with Open Navigator. ## Platform Scale & Data Volume Open Navigator processes data at scale across the United States: | Category | Count | Source | |----------|-------|--------| | **Total Jurisdictions** | 90,000+ | Census Bureau Gazetteer 2024 | | **Counties** | 3,144 | All U.S. counties (FIPS coded) | | **Municipalities** | 19,500+ | Cities, towns, villages, boroughs | | **Townships** | 36,000+ | County subdivisions, census divisions | | **School Districts** | 13,000+ | NCES Common Core of Data | | **Nonprofit Organizations** | 3,000,000+ | IRS TEOS + ProPublica Nonprofit Explorer | | **State Legislatures** | 50 | All U.S. states | | **Video Channels** | 50+ | YouTube state legislature channels | | **Meeting Datasets** | 1,000+ | MeetingBank, LocalView, City Scrapers | | **.gov Domains** | 15,000+ | CISA validated government websites | ### Storage & Processing Requirements **Estimated Data Volumes:** - **Meeting Minutes**: 10-100 MB per municipality Γ— 1,000+ cities = 10-100 GB - **Financial Documents**: 5-50 MB per jurisdiction Γ— 90,000 = 450 GB - 4.5 TB - **Nonprofit 990s**: 1-5 MB per org Γ— 3M = 3-15 TB - **Video Content**: Variable (streaming recommended over storage) **Medallion Architecture (Delta Lake):** - **Bronze Layer**: Raw scraped data (largest storage footprint) - **Silver Layer**: Cleaned/standardized (50-70% compression) - **Gold Layer**: Analyzed/aggregated (90%+ compression) ### API Rate Limits & Quotas **Free Tier (No Cost):** - Census Bureau: Unlimited downloads - NCES: Unlimited bulk downloads - ProPublica API: Respectful use (~1 req/sec suggested) - IRS TEOS: Bulk data downloads (monthly updates) - CISA .gov Domains: GitHub dataset (updated daily) **Paid/Limited:** - OpenAI API: Pay per token (required for LLM features) - Harvard Dataverse: API key recommended (free registration) :::info[Complete Technical Citations & Standards] For full citations, licenses, API documentation, and technical specifications: **[Citations & Data Sources](/docs/data-sources/citations)** Includes: - **Academic Research**: MeetingBank (ACL 2023), LocalView (Harvard), Council Data Project, City Scrapers - **Government APIs**: U.S. Census, NCES, IRS, Open States - **Standards**: OCD-ID (OCDEP 2), Popolo Project, Schema.org, CEDS, OMOP CDM (OHDSI), IATI v2.03 - **Data Models**: Microsoft CDM for Nonprofits, OMOP vocabulary system - **Fact-Checking**: N/A (not currently integrated) - **Nonprofit Data**: IRS BMF (43,726 orgs from 5 states) - **Churches & Faith-Based**: 4,372 congregations from IRS data - **Enterprise Tech**: Microsoft (Nonprofit CDM), Google (Data Commons), AWS (Open Data), Databricks (Unity Catalog, MLflow), Snowflake, Salesforce (NPSP) - **BibTeX citations** for academic papers and research use ::: --- ## What You'll Find Here ### πŸš€ Setup & Installation Get the platform running: - **[Quick Start](/docs/quickstart)** - Detailed installation instructions - **[Quick Reference](/docs/quick-reference)** - CLI commands cheat sheet - **[Architecture](/docs/architecture)** - System design and components ### πŸ“Š Data Sources (Technical) Technical details on data ingestion: - **[Jurisdiction Discovery](/docs/data-sources/jurisdiction-discovery)** - Finding 90,000+ government websites - **[Census Data](/docs/data-sources/census-data)** - Ingesting Census Bureau datasets - **[HuggingFace Datasets](/docs/data-sources/huggingface-datasets)** - Pre-built meeting collections - **[YouTube Discovery](/docs/data-sources/youtube-discovery)** - Video channel scraping ### πŸ› οΈ How-To Guides Step-by-step technical guides: - **[Jurisdiction Setup](/docs/guides/jurisdiction-setup)** - Configure discovery for your area - **[HuggingFace Publishing](/docs/guides/huggingface-publishing)** - Publish datasets to HuggingFace Hub - **[Handling Formats](/docs/guides/handling-formats)** - Process different document types - **[Scraper Improvements](/docs/guides/scraper-improvements)** - Enhance scraping capabilities ### πŸ”Œ Integrations Connect external services: - **[Dataverse Integration](/docs/integrations/dataverse)** - Harvard Dataverse API - **[Frontend Integration](/docs/integrations/frontend)** - React application setup - **[LocalView](/docs/integrations/localview)** - LocalView dataset ingestion ### πŸš€ Deployment Production deployment: - **[Databricks Apps](/docs/deployment/databricks-apps)** - Deploy to Databricks - **[Scale Deployment](/docs/deployment/scale)** - Handle large datasets - **[Cost Management](/docs/deployment/costs)** - Optimize expenses ### πŸ’» Development Contributing and development: - **[Changelog](/docs/development/changelog)** - Version history - **[Migration Guides](/docs/development/migration-v2)** - Upgrading between versions - **[Refactoring Summary](/docs/development/refactoring-summary)** - Recent changes ## Quick Start (TL;DR) ```bash # Clone and install git clone https://github.com/getcommunityone/open-navigator-for-engagement.git cd oral-health-policy-pulse ./install.sh # Install frontend and docs cd frontend && npm install && cd .. cd website && npm install && cd .. # Start all services ./start-all.sh # Visit: # - Main App: http://localhost:5173 # - API Docs: http://localhost:8000/docs # - This Site: http://localhost:3000 ``` ## Architecture Overview ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Open Navigator Platform β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ React App β”‚ β”‚ FastAPI β”‚ β”‚ β”‚ β”‚ (Frontend) │──▢│ (Backend) β”‚ β”‚ β”‚ β”‚ Port 5173 β”‚ β”‚ Port 8000 β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Delta Lake (Data Storage) β”‚ β”‚ β”‚ β”‚ β€’ Bronze: Raw data β”‚ β”‚ β”‚ β”‚ β€’ Silver: Cleaned data β”‚ β”‚ β”‚ β”‚ β€’ Gold: Analyzed data β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## Common Tasks ### Run Jurisdiction Discovery ```bash source .venv/bin/activate # Test run (100 jurisdictions) python main.py discover-jurisdictions --limit 100 # Single state python main.py discover-jurisdictions --state CA # Full discovery (~30k jurisdictions) python main.py discover-jurisdictions ``` ### Ingest Reference Data ```bash # Census jurisdictions (90,000+ entities) python -m discovery.census_ingestion # NCES school districts (13,000+) python -m discovery.nces_ingestion # Pre-built datasets python discovery/meetingbank_ingestion.py python discovery/city_scrapers_urls.py python discovery/openstates_sources.py ``` ### Scrape Meeting Minutes ```bash # Batch scraping from discovered sites python main.py scrape-batch --source discovered --limit 50 # Single jurisdiction python main.py scrape --url "https://chicago.legistar.com" \ --state "IL" \ --municipality "Chicago" ``` ### Publish to HuggingFace ```bash # Requires HUGGINGFACE_TOKEN in .env python main.py publish-to-hf --dataset all python main.py publish-to-hf --dataset discovered-urls python main.py publish-to-hf --dataset census --sample ``` ## Technology Stack ### Backend - **Python 3.11+** - Core language - **FastAPI** - REST API framework - **Delta Lake** - Data lakehouse storage - **Databricks** - Production data platform - **OpenAI API** - LLM capabilities ### Frontend - **React 18** - UI framework - **Vite** - Build tool - **TypeScript** - Type safety - **Leaflet** - Interactive maps ### Data Processing - **Pandas** - Data manipulation - **BeautifulSoup** - HTML parsing - **PyPDF2** - PDF extraction - **Tesseract OCR** - Image to text ### Deployment - **Docker** - Containerization - **tmux** - Session management - **Databricks Apps** - Production hosting ## API Reference ### Start API Server ```bash python main.py serve --host 0.0.0.0 --port 8000 ``` Visit http://localhost:8000/docs for interactive API documentation. ### Example: Start Workflow ```bash curl -X POST "http://localhost:8000/workflow/start" \ -H "Content-Type: application/json" \ -d '{ "scrape_targets": [ { "url": "https://chicago.legistar.com", "municipality": "Chicago", "state": "IL", "platform": "legistar" } ] }' ``` ### Example: Query Opportunities ```bash curl "http://localhost:8000/opportunities?state=CA&urgency=critical" ``` ## Development Workflow ### 1. Local Development ```bash # Terminal 1: API (with hot reload) source .venv/bin/activate python main.py serve --reload # Terminal 2: Frontend (with hot reload) cd frontend npm run dev # Terminal 3: Documentation cd website npm start ``` ### 2. Testing ```bash # Run all tests pytest # With coverage pytest --cov=agents --cov=pipeline --cov=visualization # Specific test file pytest tests/test_agents.py ``` ### 3. Deployment ```bash # Deploy to Databricks export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com export DATABRICKS_TOKEN=dapi... ./scripts/deploy-databricks-app.sh ``` ## Data Pipeline ### Medallion Architecture ``` Bronze (Raw) Silver (Cleaned) Gold (Analyzed) ──────────────────────────────────────────────────────────── Scraped PDFs β†’ Extracted text β†’ Classifications Meeting videos β†’ Transcripts β†’ Sentiment scores Budget docs β†’ Line items β†’ Budget analysis Form 990s β†’ Financial data β†’ Spending patterns ``` ### File Locations - **Bronze**: `data/bronze/` - Raw downloaded files - **Silver**: `data/silver/` - Cleaned and standardized - **Gold**: `data/gold/` - Enriched with analysis - **Cache**: `cache/` - Temporary processing files ## Configuration ### Environment Variables Create `.env` file: ```bash # Required OPENAI_API_KEY=sk-... # Optional (for production) DATABRICKS_HOST=https://your-workspace.cloud.databricks.com DATABRICKS_TOKEN=dapi... # Optional (for publishing) HUGGINGFACE_TOKEN=hf_... # Optional (for Harvard Dataverse) DATAVERSE_API_KEY=... ``` ### Settings File Edit `config/settings.py` for: - Delta Lake paths - Scraping rate limits - Batch sizes - Model configurations ## Contributing ### 1. Fork & Clone ```bash git clone https://github.com/YOUR-USERNAME/oral-health-policy-pulse.git cd oral-health-policy-pulse git remote add upstream https://github.com/getcommunityone/open-navigator-for-engagement.git ``` ### 2. Create Branch ```bash git checkout -b feature/your-feature-name ``` ### 3. Make Changes - Add tests for new features - Update documentation - Follow existing code style - Keep commits focused and atomic ### 4. Submit PR ```bash git push origin feature/your-feature-name # Then create PR on GitHub ``` See [CONTRIBUTING.md](https://github.com/getcommunityone/open-navigator-for-engagement/blob/main/CONTRIBUTING.md) for details. ## Troubleshooting ### Port Already in Use ```bash # Find process using port lsof -i :8000 lsof -i :5173 lsof -i :3000 # Kill process kill -9 ``` ### Dependencies Not Installing ```bash # Clear cache and reinstall rm -rf .venv python3 -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install -r requirements.txt ``` ### Scraping Failures Check logs: ```bash tail -f logs/scraper.log ``` Adjust rate limits in `config/settings.py`. ## Next Steps 1. **Read Architecture** β†’ [System Design](/docs/architecture) 2. **Set Up Environment** β†’ [Quick Start](/docs/quickstart) 3. **Run Discovery** β†’ [Jurisdiction Setup](/docs/guides/jurisdiction-setup) 4. **Deploy to Production** β†’ [Databricks Apps](/docs/deployment/databricks-apps) 5. **Contribute** β†’ [GitHub Issues](https://github.com/getcommunityone/open-navigator-for-engagement/issues) ## Support - **GitHub Issues**: [Report bugs or request features](https://github.com/getcommunityone/open-navigator-for-engagement/issues) - **Documentation**: Browse the sidebar - **API Docs**: http://localhost:8000/docs - **Email**: johnbowyer@communityone.com