Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| sidebar_position: 1 | |
| displayed_sidebar: developersSidebar | |
| # For Developers & Technical Users | |
| Welcome! This section contains **technical documentation** for developers, data scientists, and system administrators working with Open Navigator. | |
| ## Platform Scale & Data Volume | |
| Open Navigator processes data at scale across the United States: | |
| | Category | Count | Source | | |
| |----------|-------|--------| | |
| | **Total Jurisdictions** | 90,000+ | Census Bureau Gazetteer 2024 | | |
| | **Counties** | 3,144 | All U.S. counties (FIPS coded) | | |
| | **Municipalities** | 19,500+ | Cities, towns, villages, boroughs | | |
| | **Townships** | 36,000+ | County subdivisions, census divisions | | |
| | **School Districts** | 13,000+ | NCES Common Core of Data | | |
| | **Nonprofit Organizations** | 3,000,000+ | IRS TEOS + ProPublica Nonprofit Explorer | | |
| | **State Legislatures** | 50 | All U.S. states | | |
| | **Video Channels** | 50+ | YouTube state legislature channels | | |
| | **Meeting Datasets** | 1,000+ | MeetingBank, LocalView, City Scrapers | | |
| | **.gov Domains** | 15,000+ | CISA validated government websites | | |
| ### Storage & Processing Requirements | |
| **Estimated Data Volumes:** | |
| - **Meeting Minutes**: 10-100 MB per municipality Γ 1,000+ cities = 10-100 GB | |
| - **Financial Documents**: 5-50 MB per jurisdiction Γ 90,000 = 450 GB - 4.5 TB | |
| - **Nonprofit 990s**: 1-5 MB per org Γ 3M = 3-15 TB | |
| - **Video Content**: Variable (streaming recommended over storage) | |
| **Medallion Architecture (Delta Lake):** | |
| - **Bronze Layer**: Raw scraped data (largest storage footprint) | |
| - **Silver Layer**: Cleaned/standardized (50-70% compression) | |
| - **Gold Layer**: Analyzed/aggregated (90%+ compression) | |
| ### API Rate Limits & Quotas | |
| **Free Tier (No Cost):** | |
| - Census Bureau: Unlimited downloads | |
| - NCES: Unlimited bulk downloads | |
| - ProPublica API: Respectful use (~1 req/sec suggested) | |
| - IRS TEOS: Bulk data downloads (monthly updates) | |
| - CISA .gov Domains: GitHub dataset (updated daily) | |
| **Paid/Limited:** | |
| - OpenAI API: Pay per token (required for LLM features) | |
| - Harvard Dataverse: API key recommended (free registration) | |
| :::info[Complete Technical Citations & Standards] | |
| For full citations, licenses, API documentation, and technical specifications: | |
| **[Citations & Data Sources](/docs/data-sources/citations)** | |
| Includes: | |
| - **Academic Research**: MeetingBank (ACL 2023), LocalView (Harvard), Council Data Project, City Scrapers | |
| - **Government APIs**: U.S. Census, NCES, IRS, Open States | |
| - **Standards**: OCD-ID (OCDEP 2), Popolo Project, Schema.org, CEDS, OMOP CDM (OHDSI), IATI v2.03 | |
| - **Data Models**: Microsoft CDM for Nonprofits, OMOP vocabulary system | |
| - **Fact-Checking**: N/A (not currently integrated) | |
| - **Nonprofit Data**: IRS BMF (43,726 orgs from 5 states) | |
| - **Churches & Faith-Based**: 4,372 congregations from IRS data | |
| - **Enterprise Tech**: Microsoft (Nonprofit CDM), Google (Data Commons), AWS (Open Data), Databricks (Unity Catalog, MLflow), Snowflake, Salesforce (NPSP) | |
| - **BibTeX citations** for academic papers and research use | |
| ::: | |
| --- | |
| ## What You'll Find Here | |
| ### π Setup & Installation | |
| Get the platform running: | |
| - **[Quick Start](/docs/quickstart)** - Detailed installation instructions | |
| - **[Quick Reference](/docs/quick-reference)** - CLI commands cheat sheet | |
| - **[Architecture](/docs/architecture)** - System design and components | |
| ### π Data Sources (Technical) | |
| Technical details on data ingestion: | |
| - **[Jurisdiction Discovery](/docs/data-sources/jurisdiction-discovery)** - Finding 90,000+ government websites | |
| - **[Census Data](/docs/data-sources/census-data)** - Ingesting Census Bureau datasets | |
| - **[HuggingFace Datasets](/docs/data-sources/huggingface-datasets)** - Pre-built meeting collections | |
| - **[YouTube Discovery](/docs/data-sources/youtube-discovery)** - Video channel scraping | |
| ### π οΈ How-To Guides | |
| Step-by-step technical guides: | |
| - **[Jurisdiction Setup](/docs/guides/jurisdiction-setup)** - Configure discovery for your area | |
| - **[HuggingFace Publishing](/docs/guides/huggingface-publishing)** - Publish datasets to HuggingFace Hub | |
| - **[Handling Formats](/docs/guides/handling-formats)** - Process different document types | |
| - **[Scraper Improvements](/docs/guides/scraper-improvements)** - Enhance scraping capabilities | |
| ### π Integrations | |
| Connect external services: | |
| - **[Dataverse Integration](/docs/integrations/dataverse)** - Harvard Dataverse API | |
| - **[Frontend Integration](/docs/integrations/frontend)** - React application setup | |
| - **[LocalView](/docs/integrations/localview)** - LocalView dataset ingestion | |
| ### π Deployment | |
| Production deployment: | |
| - **[Databricks Apps](/docs/deployment/databricks-apps)** - Deploy to Databricks | |
| - **[Scale Deployment](/docs/deployment/scale)** - Handle large datasets | |
| - **[Cost Management](/docs/deployment/costs)** - Optimize expenses | |
| ### π» Development | |
| Contributing and development: | |
| - **[Changelog](/docs/development/changelog)** - Version history | |
| - **[Migration Guides](/docs/development/migration-v2)** - Upgrading between versions | |
| - **[Refactoring Summary](/docs/development/refactoring-summary)** - Recent changes | |
| ## Quick Start (TL;DR) | |
| ```bash | |
| # Clone and install | |
| git clone https://github.com/getcommunityone/open-navigator-for-engagement.git | |
| cd oral-health-policy-pulse | |
| ./install.sh | |
| # Install frontend and docs | |
| cd frontend && npm install && cd .. | |
| cd website && npm install && cd .. | |
| # Start all services | |
| ./start-all.sh | |
| # Visit: | |
| # - Main App: http://localhost:5173 | |
| # - API Docs: http://localhost:8000/docs | |
| # - This Site: http://localhost:3000 | |
| ``` | |
| ## Architecture Overview | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β Open Navigator Platform β | |
| βββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β ββββββββββββββββ ββββββββββββββββ β | |
| β β React App β β FastAPI β β | |
| β β (Frontend) ββββΆβ (Backend) β β | |
| β β Port 5173 β β Port 8000 β β | |
| β ββββββββββββββββ ββββββββ¬ββββββββ β | |
| β β β | |
| β ββββββββββββββββββββββββββββΌβββββββββ β | |
| β β Delta Lake (Data Storage) β β | |
| β β β’ Bronze: Raw data β β | |
| β β β’ Silver: Cleaned data β β | |
| β β β’ Gold: Analyzed data β β | |
| β ββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Common Tasks | |
| ### Run Jurisdiction Discovery | |
| ```bash | |
| source .venv/bin/activate | |
| # Test run (100 jurisdictions) | |
| python main.py discover-jurisdictions --limit 100 | |
| # Single state | |
| python main.py discover-jurisdictions --state CA | |
| # Full discovery (~30k jurisdictions) | |
| python main.py discover-jurisdictions | |
| ``` | |
| ### Ingest Reference Data | |
| ```bash | |
| # Census jurisdictions (90,000+ entities) | |
| python -m discovery.census_ingestion | |
| # NCES school districts (13,000+) | |
| python -m discovery.nces_ingestion | |
| # Pre-built datasets | |
| python discovery/meetingbank_ingestion.py | |
| python discovery/city_scrapers_urls.py | |
| python discovery/openstates_sources.py | |
| ``` | |
| ### Scrape Meeting Minutes | |
| ```bash | |
| # Batch scraping from discovered sites | |
| python main.py scrape-batch --source discovered --limit 50 | |
| # Single jurisdiction | |
| python main.py scrape --url "https://chicago.legistar.com" \ | |
| --state "IL" \ | |
| --municipality "Chicago" | |
| ``` | |
| ### Publish to HuggingFace | |
| ```bash | |
| # Requires HUGGINGFACE_TOKEN in .env | |
| python main.py publish-to-hf --dataset all | |
| python main.py publish-to-hf --dataset discovered-urls | |
| python main.py publish-to-hf --dataset census --sample | |
| ``` | |
| ## Technology Stack | |
| ### Backend | |
| - **Python 3.11+** - Core language | |
| - **FastAPI** - REST API framework | |
| - **Delta Lake** - Data lakehouse storage | |
| - **Databricks** - Production data platform | |
| - **OpenAI API** - LLM capabilities | |
| ### Frontend | |
| - **React 18** - UI framework | |
| - **Vite** - Build tool | |
| - **TypeScript** - Type safety | |
| - **Leaflet** - Interactive maps | |
| ### Data Processing | |
| - **Pandas** - Data manipulation | |
| - **BeautifulSoup** - HTML parsing | |
| - **PyPDF2** - PDF extraction | |
| - **Tesseract OCR** - Image to text | |
| ### Deployment | |
| - **Docker** - Containerization | |
| - **tmux** - Session management | |
| - **Databricks Apps** - Production hosting | |
| ## API Reference | |
| ### Start API Server | |
| ```bash | |
| python main.py serve --host 0.0.0.0 --port 8000 | |
| ``` | |
| Visit http://localhost:8000/docs for interactive API documentation. | |
| ### Example: Start Workflow | |
| ```bash | |
| curl -X POST "http://localhost:8000/workflow/start" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "scrape_targets": [ | |
| { | |
| "url": "https://chicago.legistar.com", | |
| "municipality": "Chicago", | |
| "state": "IL", | |
| "platform": "legistar" | |
| } | |
| ] | |
| }' | |
| ``` | |
| ### Example: Query Opportunities | |
| ```bash | |
| curl "http://localhost:8000/opportunities?state=CA&urgency=critical" | |
| ``` | |
| ## Development Workflow | |
| ### 1. Local Development | |
| ```bash | |
| # Terminal 1: API (with hot reload) | |
| source .venv/bin/activate | |
| python main.py serve --reload | |
| # Terminal 2: Frontend (with hot reload) | |
| cd frontend | |
| npm run dev | |
| # Terminal 3: Documentation | |
| cd website | |
| npm start | |
| ``` | |
| ### 2. Testing | |
| ```bash | |
| # Run all tests | |
| pytest | |
| # With coverage | |
| pytest --cov=agents --cov=pipeline --cov=visualization | |
| # Specific test file | |
| pytest tests/test_agents.py | |
| ``` | |
| ### 3. Deployment | |
| ```bash | |
| # Deploy to Databricks | |
| export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com | |
| export DATABRICKS_TOKEN=dapi... | |
| ./scripts/deploy-databricks-app.sh | |
| ``` | |
| ## Data Pipeline | |
| ### Medallion Architecture | |
| ``` | |
| Bronze (Raw) Silver (Cleaned) Gold (Analyzed) | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Scraped PDFs β Extracted text β Classifications | |
| Meeting videos β Transcripts β Sentiment scores | |
| Budget docs β Line items β Budget analysis | |
| Form 990s β Financial data β Spending patterns | |
| ``` | |
| ### File Locations | |
| - **Bronze**: `data/bronze/` - Raw downloaded files | |
| - **Silver**: `data/silver/` - Cleaned and standardized | |
| - **Gold**: `data/gold/` - Enriched with analysis | |
| - **Cache**: `cache/` - Temporary processing files | |
| ## Configuration | |
| ### Environment Variables | |
| Create `.env` file: | |
| ```bash | |
| # Required | |
| OPENAI_API_KEY=sk-... | |
| # Optional (for production) | |
| DATABRICKS_HOST=https://your-workspace.cloud.databricks.com | |
| DATABRICKS_TOKEN=dapi... | |
| # Optional (for publishing) | |
| HUGGINGFACE_TOKEN=hf_... | |
| # Optional (for Harvard Dataverse) | |
| DATAVERSE_API_KEY=... | |
| ``` | |
| ### Settings File | |
| Edit `config/settings.py` for: | |
| - Delta Lake paths | |
| - Scraping rate limits | |
| - Batch sizes | |
| - Model configurations | |
| ## Contributing | |
| ### 1. Fork & Clone | |
| ```bash | |
| git clone https://github.com/YOUR-USERNAME/oral-health-policy-pulse.git | |
| cd oral-health-policy-pulse | |
| git remote add upstream https://github.com/getcommunityone/open-navigator-for-engagement.git | |
| ``` | |
| ### 2. Create Branch | |
| ```bash | |
| git checkout -b feature/your-feature-name | |
| ``` | |
| ### 3. Make Changes | |
| - Add tests for new features | |
| - Update documentation | |
| - Follow existing code style | |
| - Keep commits focused and atomic | |
| ### 4. Submit PR | |
| ```bash | |
| git push origin feature/your-feature-name | |
| # Then create PR on GitHub | |
| ``` | |
| See [CONTRIBUTING.md](https://github.com/getcommunityone/open-navigator-for-engagement/blob/main/CONTRIBUTING.md) for details. | |
| ## Troubleshooting | |
| ### Port Already in Use | |
| ```bash | |
| # Find process using port | |
| lsof -i :8000 | |
| lsof -i :5173 | |
| lsof -i :3000 | |
| # Kill process | |
| kill -9 <PID> | |
| ``` | |
| ### Dependencies Not Installing | |
| ```bash | |
| # Clear cache and reinstall | |
| rm -rf .venv | |
| python3 -m venv .venv | |
| source .venv/bin/activate | |
| pip install --upgrade pip | |
| pip install -r requirements.txt | |
| ``` | |
| ### Scraping Failures | |
| Check logs: | |
| ```bash | |
| tail -f logs/scraper.log | |
| ``` | |
| Adjust rate limits in `config/settings.py`. | |
| ## Next Steps | |
| 1. **Read Architecture** β [System Design](/docs/architecture) | |
| 2. **Set Up Environment** β [Quick Start](/docs/quickstart) | |
| 3. **Run Discovery** β [Jurisdiction Setup](/docs/guides/jurisdiction-setup) | |
| 4. **Deploy to Production** β [Databricks Apps](/docs/deployment/databricks-apps) | |
| 5. **Contribute** β [GitHub Issues](https://github.com/getcommunityone/open-navigator-for-engagement/issues) | |
| ## Support | |
| - **GitHub Issues**: [Report bugs or request features](https://github.com/getcommunityone/open-navigator-for-engagement/issues) | |
| - **Documentation**: Browse the sidebar | |
| - **API Docs**: http://localhost:8000/docs | |
| - **Email**: johnbowyer@communityone.com | |