File size: 13,117 Bytes
61d29fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
---
sidebar_position: 1
displayed_sidebar: developersSidebar
---

# For Developers & Technical Users

Welcome! This section contains **technical documentation** for developers, data scientists, and system administrators working with Open Navigator.

## Platform Scale & Data Volume

Open Navigator processes data at scale across the United States:

| Category | Count | Source |
|----------|-------|--------|
| **Total Jurisdictions** | 90,000+ | Census Bureau Gazetteer 2024 |
| **Counties** | 3,144 | All U.S. counties (FIPS coded) |
| **Municipalities** | 19,500+ | Cities, towns, villages, boroughs |
| **Townships** | 36,000+ | County subdivisions, census divisions |
| **School Districts** | 13,000+ | NCES Common Core of Data |
| **Nonprofit Organizations** | 3,000,000+ | IRS TEOS + ProPublica Nonprofit Explorer |
| **State Legislatures** | 50 | All U.S. states |
| **Video Channels** | 50+ | YouTube state legislature channels |
| **Meeting Datasets** | 1,000+ | MeetingBank, LocalView, City Scrapers |
| **.gov Domains** | 15,000+ | CISA validated government websites |

### Storage & Processing Requirements

**Estimated Data Volumes:**
- **Meeting Minutes**: 10-100 MB per municipality Γ— 1,000+ cities = 10-100 GB
- **Financial Documents**: 5-50 MB per jurisdiction Γ— 90,000 = 450 GB - 4.5 TB
- **Nonprofit 990s**: 1-5 MB per org Γ— 3M = 3-15 TB
- **Video Content**: Variable (streaming recommended over storage)

**Medallion Architecture (Delta Lake):**
- **Bronze Layer**: Raw scraped data (largest storage footprint)
- **Silver Layer**: Cleaned/standardized (50-70% compression)
- **Gold Layer**: Analyzed/aggregated (90%+ compression)

### API Rate Limits & Quotas

**Free Tier (No Cost):**
- Census Bureau: Unlimited downloads
- NCES: Unlimited bulk downloads
- ProPublica API: Respectful use (~1 req/sec suggested)
- IRS TEOS: Bulk data downloads (monthly updates)
- CISA .gov Domains: GitHub dataset (updated daily)

**Paid/Limited:**
- OpenAI API: Pay per token (required for LLM features)
- Harvard Dataverse: API key recommended (free registration)

:::info[Complete Technical Citations & Standards]
For full citations, licenses, API documentation, and technical specifications:

**[Citations & Data Sources](/docs/data-sources/citations)**

Includes:
- **Academic Research**: MeetingBank (ACL 2023), LocalView (Harvard), Council Data Project, City Scrapers
- **Government APIs**: U.S. Census, NCES, IRS, Open States
- **Standards**: OCD-ID (OCDEP 2), Popolo Project, Schema.org, CEDS, OMOP CDM (OHDSI), IATI v2.03
- **Data Models**: Microsoft CDM for Nonprofits, OMOP vocabulary system
- **Fact-Checking**: N/A (not currently integrated)
- **Nonprofit Data**: IRS BMF (43,726 orgs from 5 states)
- **Churches & Faith-Based**: 4,372 congregations from IRS data
- **Enterprise Tech**: Microsoft (Nonprofit CDM), Google (Data Commons), AWS (Open Data), Databricks (Unity Catalog, MLflow), Snowflake, Salesforce (NPSP)
- **BibTeX citations** for academic papers and research use
:::

---

## What You'll Find Here

### πŸš€ Setup & Installation

Get the platform running:
- **[Quick Start](/docs/quickstart)** - Detailed installation instructions
- **[Quick Reference](/docs/quick-reference)** - CLI commands cheat sheet
- **[Architecture](/docs/architecture)** - System design and components

### πŸ“Š Data Sources (Technical)

Technical details on data ingestion:
- **[Jurisdiction Discovery](/docs/data-sources/jurisdiction-discovery)** - Finding 90,000+ government websites
- **[Census Data](/docs/data-sources/census-data)** - Ingesting Census Bureau datasets
- **[HuggingFace Datasets](/docs/data-sources/huggingface-datasets)** - Pre-built meeting collections
- **[YouTube Discovery](/docs/data-sources/youtube-discovery)** - Video channel scraping

### πŸ› οΈ How-To Guides

Step-by-step technical guides:
- **[Jurisdiction Setup](/docs/guides/jurisdiction-setup)** - Configure discovery for your area
- **[HuggingFace Publishing](/docs/guides/huggingface-publishing)** - Publish datasets to HuggingFace Hub
- **[Handling Formats](/docs/guides/handling-formats)** - Process different document types
- **[Scraper Improvements](/docs/guides/scraper-improvements)** - Enhance scraping capabilities

### πŸ”Œ Integrations

Connect external services:
- **[Dataverse Integration](/docs/integrations/dataverse)** - Harvard Dataverse API
- **[Frontend Integration](/docs/integrations/frontend)** - React application setup
- **[LocalView](/docs/integrations/localview)** - LocalView dataset ingestion

### πŸš€ Deployment

Production deployment:
- **[Databricks Apps](/docs/deployment/databricks-apps)** - Deploy to Databricks
- **[Scale Deployment](/docs/deployment/scale)** - Handle large datasets
- **[Cost Management](/docs/deployment/costs)** - Optimize expenses

### πŸ’» Development

Contributing and development:
- **[Changelog](/docs/development/changelog)** - Version history
- **[Migration Guides](/docs/development/migration-v2)** - Upgrading between versions
- **[Refactoring Summary](/docs/development/refactoring-summary)** - Recent changes

## Quick Start (TL;DR)

```bash
# Clone and install
git clone https://github.com/getcommunityone/open-navigator-for-engagement.git
cd oral-health-policy-pulse
./install.sh

# Install frontend and docs
cd frontend && npm install && cd ..
cd website && npm install && cd ..

# Start all services
./start-all.sh

# Visit:
# - Main App:  http://localhost:5173
# - API Docs:  http://localhost:8000/docs
# - This Site: http://localhost:3000
```

## Architecture Overview

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Open Navigator Platform         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  React App   β”‚   β”‚  FastAPI     β”‚  β”‚
β”‚  β”‚  (Frontend)  │──▢│  (Backend)   β”‚  β”‚
β”‚  β”‚  Port 5173   β”‚   β”‚  Port 8000   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                             β”‚           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚      Delta Lake (Data Storage)   β”‚ β”‚
β”‚  β”‚  β€’ Bronze: Raw data              β”‚ β”‚
β”‚  β”‚  β€’ Silver: Cleaned data          β”‚ β”‚
β”‚  β”‚  β€’ Gold: Analyzed data           β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Common Tasks

### Run Jurisdiction Discovery

```bash
source .venv/bin/activate

# Test run (100 jurisdictions)
python main.py discover-jurisdictions --limit 100

# Single state
python main.py discover-jurisdictions --state CA

# Full discovery (~30k jurisdictions)
python main.py discover-jurisdictions
```

### Ingest Reference Data

```bash
# Census jurisdictions (90,000+ entities)
python -m discovery.census_ingestion

# NCES school districts (13,000+)
python -m discovery.nces_ingestion

# Pre-built datasets
python discovery/meetingbank_ingestion.py
python discovery/city_scrapers_urls.py
python discovery/openstates_sources.py
```

### Scrape Meeting Minutes

```bash
# Batch scraping from discovered sites
python main.py scrape-batch --source discovered --limit 50

# Single jurisdiction
python main.py scrape --url "https://chicago.legistar.com" \
                      --state "IL" \
                      --municipality "Chicago"
```

### Publish to HuggingFace

```bash
# Requires HUGGINGFACE_TOKEN in .env
python main.py publish-to-hf --dataset all
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset census --sample
```

## Technology Stack

### Backend
- **Python 3.11+** - Core language
- **FastAPI** - REST API framework
- **Delta Lake** - Data lakehouse storage
- **Databricks** - Production data platform
- **OpenAI API** - LLM capabilities

### Frontend
- **React 18** - UI framework
- **Vite** - Build tool
- **TypeScript** - Type safety
- **Leaflet** - Interactive maps

### Data Processing
- **Pandas** - Data manipulation
- **BeautifulSoup** - HTML parsing
- **PyPDF2** - PDF extraction
- **Tesseract OCR** - Image to text

### Deployment
- **Docker** - Containerization
- **tmux** - Session management
- **Databricks Apps** - Production hosting

## API Reference

### Start API Server

```bash
python main.py serve --host 0.0.0.0 --port 8000
```

Visit http://localhost:8000/docs for interactive API documentation.

### Example: Start Workflow

```bash
curl -X POST "http://localhost:8000/workflow/start" \
     -H "Content-Type: application/json" \
     -d '{
       "scrape_targets": [
         {
           "url": "https://chicago.legistar.com",
           "municipality": "Chicago",
           "state": "IL",
           "platform": "legistar"
         }
       ]
     }'
```

### Example: Query Opportunities

```bash
curl "http://localhost:8000/opportunities?state=CA&urgency=critical"
```

## Development Workflow

### 1. Local Development

```bash
# Terminal 1: API (with hot reload)
source .venv/bin/activate
python main.py serve --reload

# Terminal 2: Frontend (with hot reload)
cd frontend
npm run dev

# Terminal 3: Documentation
cd website
npm start
```

### 2. Testing

```bash
# Run all tests
pytest

# With coverage
pytest --cov=agents --cov=pipeline --cov=visualization

# Specific test file
pytest tests/test_agents.py
```

### 3. Deployment

```bash
# Deploy to Databricks
export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_TOKEN=dapi...
./scripts/deploy-databricks-app.sh
```

## Data Pipeline

### Medallion Architecture

```
Bronze (Raw)          Silver (Cleaned)       Gold (Analyzed)
────────────────────────────────────────────────────────────
Scraped PDFs     β†’    Extracted text    β†’    Classifications
Meeting videos   β†’    Transcripts       β†’    Sentiment scores
Budget docs      β†’    Line items        β†’    Budget analysis
Form 990s        β†’    Financial data    β†’    Spending patterns
```

### File Locations

- **Bronze**: `data/bronze/` - Raw downloaded files
- **Silver**: `data/silver/` - Cleaned and standardized
- **Gold**: `data/gold/` - Enriched with analysis
- **Cache**: `cache/` - Temporary processing files

## Configuration

### Environment Variables

Create `.env` file:

```bash
# Required
OPENAI_API_KEY=sk-...

# Optional (for production)
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=dapi...

# Optional (for publishing)
HUGGINGFACE_TOKEN=hf_...

# Optional (for Harvard Dataverse)
DATAVERSE_API_KEY=...
```

### Settings File

Edit `config/settings.py` for:
- Delta Lake paths
- Scraping rate limits
- Batch sizes
- Model configurations

## Contributing

### 1. Fork & Clone

```bash
git clone https://github.com/YOUR-USERNAME/oral-health-policy-pulse.git
cd oral-health-policy-pulse
git remote add upstream https://github.com/getcommunityone/open-navigator-for-engagement.git
```

### 2. Create Branch

```bash
git checkout -b feature/your-feature-name
```

### 3. Make Changes

- Add tests for new features
- Update documentation
- Follow existing code style
- Keep commits focused and atomic

### 4. Submit PR

```bash
git push origin feature/your-feature-name
# Then create PR on GitHub
```

See [CONTRIBUTING.md](https://github.com/getcommunityone/open-navigator-for-engagement/blob/main/CONTRIBUTING.md) for details.

## Troubleshooting

### Port Already in Use

```bash
# Find process using port
lsof -i :8000
lsof -i :5173
lsof -i :3000

# Kill process
kill -9 <PID>
```

### Dependencies Not Installing

```bash
# Clear cache and reinstall
rm -rf .venv
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
```

### Scraping Failures

Check logs:
```bash
tail -f logs/scraper.log
```

Adjust rate limits in `config/settings.py`.

## Next Steps

1. **Read Architecture** β†’ [System Design](/docs/architecture)
2. **Set Up Environment** β†’ [Quick Start](/docs/quickstart)
3. **Run Discovery** β†’ [Jurisdiction Setup](/docs/guides/jurisdiction-setup)
4. **Deploy to Production** β†’ [Databricks Apps](/docs/deployment/databricks-apps)
5. **Contribute** β†’ [GitHub Issues](https://github.com/getcommunityone/open-navigator-for-engagement/issues)

## Support

- **GitHub Issues**: [Report bugs or request features](https://github.com/getcommunityone/open-navigator-for-engagement/issues)
- **Documentation**: Browse the sidebar
- **API Docs**: http://localhost:8000/docs
- **Email**: johnbowyer@communityone.com