open-navigator / website /docs /deployment /d-drive-configuration.md
jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
---
sidebar_position: 10
---
# D Drive Configuration for Large Datasets
Configure Open Navigator to store large datasets (ACS census data, IRS 990s, etc.) on an external drive or secondary volume to avoid filling your primary disk.
## Why Use External Storage?
Open Navigator downloads and caches large datasets:
| Dataset | Size | Records |
|---------|------|---------|
| **ACS 5-Year (All States)** | ~15 GB | 85,000 tracts |
| **IRS Form 990s** | ~100 GB | 1.8M nonprofits |
| **Meeting Minutes (PDFs)** | ~500 GB | 90,000 jurisdictions |
| **Legislative Bills** | ~50 GB | Millions of bills |
**Total potential storage**: Up to 1 TB of data!
## ๐Ÿ“ Recommended Directory Structure
Create this structure on your D drive (or external storage):
```
D:/open-navigator-data/
โ”œโ”€โ”€ acs/ # American Community Survey demographic data
โ”‚ โ”œโ”€โ”€ B19013_county_*_2022.parquet
โ”‚ โ”œโ”€โ”€ B27010_county_*_2022.parquet
โ”‚ โ””โ”€โ”€ acs_2022_ALL/ # Bulk downloads
โ”œโ”€โ”€ irs/ # IRS nonprofit data
โ”‚ โ”œโ”€โ”€ bmf/ # Business Master File
โ”‚ โ””โ”€โ”€ 990s/ # Form 990 PDFs and XML
โ”œโ”€โ”€ legislative/ # Legislative data
โ”‚ โ”œโ”€โ”€ bills/
โ”‚ โ”œโ”€โ”€ votes/
โ”‚ โ””โ”€โ”€ legislators/
โ”œโ”€โ”€ meetings/ # Meeting minutes and agendas
โ”‚ โ”œโ”€โ”€ pdfs/
โ”‚ โ””โ”€โ”€ transcripts/
โ””โ”€โ”€ cache/ # General cache
```
## ๐Ÿ–ฅ๏ธ Platform-Specific Setup
### Windows (D Drive)
**1. Create the data directory:**
```powershell
# PowerShell
New-Item -Path "D:\open-navigator-data" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\acs" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\irs" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\cache" -ItemType Directory -Force
```
**2. Set permissions (if needed):**
```powershell
# Give yourself full control
icacls "D:\open-navigator-data" /grant "${env:USERNAME}:(OI)(CI)F" /T
```
**3. Use in Python scripts:**
```python
from pathlib import Path
# Option 1: Explicit Windows path
data_dir = Path("D:/open-navigator-data/acs")
# Option 2: Using environment variable
import os
data_dir = Path(os.environ.get("OPEN_NAV_DATA_DIR", "D:/open-navigator-data/acs"))
```
### Linux (Mounted Drive)
**1. Mount the drive:**
```bash
# Find your drive
lsblk
# Create mount point
sudo mkdir -p /mnt/d
# Mount (replace /dev/sdb1 with your drive)
sudo mount /dev/sdb1 /mnt/d
# Auto-mount on boot (add to /etc/fstab)
echo "/dev/sdb1 /mnt/d ext4 defaults 0 0" | sudo tee -a /etc/fstab
```
**2. Create directories:**
```bash
mkdir -p /mnt/d/open-navigator-data/{acs,irs,legislative,meetings,cache}
```
**3. Set ownership:**
```bash
# Give yourself ownership
sudo chown -R $USER:$USER /mnt/d/open-navigator-data
chmod -R 755 /mnt/d/open-navigator-data
```
**4. Use in scripts:**
```python
from pathlib import Path
data_dir = Path("/mnt/d/open-navigator-data/acs")
```
### macOS (External Drive)
**1. External drives auto-mount to `/Volumes`:**
```bash
# Your drive might be at:
ls /Volumes/
# Example: /Volumes/MyExternalDrive
```
**2. Create directories:**
```bash
mkdir -p "/Volumes/MyExternalDrive/open-navigator-data"/{acs,irs,legislative,meetings,cache}
```
**3. Use in scripts:**
```python
from pathlib import Path
data_dir = Path("/Volumes/MyExternalDrive/open-navigator-data/acs")
```
### WSL (Windows Subsystem for Linux)
**Windows drives are auto-mounted in WSL:**
```bash
# D drive is at /mnt/d
cd /mnt/d
# Create directories
mkdir -p /mnt/d/open-navigator-data/{acs,irs,legislative,meetings,cache}
```
**Use in scripts (WSL path):**
```python
from pathlib import Path
# WSL path to Windows D drive
data_dir = Path("/mnt/d/open-navigator-data/acs")
```
## ๐Ÿ”ง Configuration Methods
### Method 1: Environment Variable (Recommended)
**Advantages**: Single configuration, all scripts respect it
**Setup:**
Add to `.env` file in project root:
```bash
# .env
OPEN_NAV_DATA_DIR=D:/open-navigator-data
# Or Linux/Mac:
# OPEN_NAV_DATA_DIR=/mnt/d/open-navigator-data
```
**Update scripts to use it:**
```python
# scripts/datasources/census/acs_ingestion.py
import os
from pathlib import Path
# Use environment variable with fallback
base_data_dir = os.environ.get("OPEN_NAV_DATA_DIR", "data/cache")
acs_dir = Path(base_data_dir) / "acs"
acs = ACSDataIngestion(data_dir=acs_dir)
```
### Method 2: Config File
**Create `config/paths.py`:**
```python
from pathlib import Path
import os
# Base data directory (customizable)
BASE_DATA_DIR = Path(os.environ.get("OPEN_NAV_DATA_DIR", "data"))
# Subdirectories
ACS_DATA_DIR = BASE_DATA_DIR / "acs"
IRS_DATA_DIR = BASE_DATA_DIR / "irs"
LEGISLATIVE_DATA_DIR = BASE_DATA_DIR / "legislative"
MEETINGS_DATA_DIR = BASE_DATA_DIR / "meetings"
CACHE_DIR = BASE_DATA_DIR / "cache"
# Create directories if they don't exist
for directory in [ACS_DATA_DIR, IRS_DATA_DIR, LEGISLATIVE_DATA_DIR,
MEETINGS_DATA_DIR, CACHE_DIR]:
directory.mkdir(parents=True, exist_ok=True)
```
**Use in scripts:**
```python
from config.paths import ACS_DATA_DIR
acs = ACSDataIngestion(data_dir=ACS_DATA_DIR)
```
### Method 3: Command-Line Argument
**For one-off downloads:**
```python
# scripts/datasources/census/download_acs.py
import argparse
from pathlib import Path
parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=Path, default="data/cache/acs",
help="Directory to store ACS data")
args = parser.parse_args()
acs = ACSDataIngestion(data_dir=args.data_dir)
```
**Usage:**
```bash
# Use D drive
python download_acs.py --data-dir D:/open-navigator-data/acs
# Use default
python download_acs.py
```
## ๐Ÿ—‚๏ธ Example Configurations
### Small Project (Default)
**Storage**: Local project directory
**Size**: < 10 GB
**Speed**: Fastest (same disk as code)
```python
# Uses data/cache/ in project directory
acs = ACSDataIngestion()
```
### Medium Project (D Drive)
**Storage**: D drive or external SSD
**Size**: 10-100 GB
**Speed**: Fast
```python
from pathlib import Path
acs = ACSDataIngestion(data_dir=Path("D:/open-navigator-data/acs"))
```
### Large Project (Network Storage)
**Storage**: NAS or cloud-mounted drive
**Size**: 100+ GB
**Speed**: Slower but shared
```python
from pathlib import Path
# Windows network path
data_dir = Path("//server/open-navigator/acs")
# Or mounted network drive (Linux)
# data_dir = Path("/mnt/nas/open-navigator/acs")
acs = ACSDataIngestion(data_dir=data_dir)
```
## ๐Ÿ“Š Storage Requirements by Dataset
### ACS Census Data
| Download Type | Size | Time (50 Mbps) |
|---------------|------|----------------|
| Single table, single state | ~5 MB | < 1 sec |
| Single table, all states | ~50 MB | ~10 sec |
| All tables, all states (API) | ~500 MB | ~2 min |
| Bulk download (all data) | ~15 GB | ~40 min |
**Recommended**: Start with API downloads (targeted), only use bulk if needed.
### IRS Nonprofit Data
| Dataset | Size | Records |
|---------|------|---------|
| Business Master File (BMF) | ~500 MB | 1.8M nonprofits |
| Form 990 XML (1 year) | ~20 GB | ~300K filings |
| Form 990 PDFs (1 year) | ~50 GB | ~300K PDFs |
| Full 990 archive (10 years) | ~500 GB | 3M+ filings |
**Recommended**: D drive or external storage required.
### Legislative Data
| Dataset | Size | Records |
|---------|------|---------|
| OpenStates bills (1 state, 1 year) | ~100 MB | ~5K bills |
| OpenStates bills (all states, 1 year) | ~5 GB | ~250K bills |
| OpenStates bills (all states, 10 years) | ~50 GB | 2.5M bills |
**Recommended**: D drive for multi-year data.
### Meeting Minutes
| Dataset | Size | Records |
|---------|------|---------|
| Meeting agendas (text) | ~1 GB | 100K meetings |
| Meeting PDFs (cached) | ~500 GB | 1M documents |
**Recommended**: External storage for PDF archive.
## ๐Ÿ” Verify Configuration
**Test your setup:**
```python
from pathlib import Path
import pandas as pd
def test_storage_config(data_dir: Path):
"""Test that we can write/read data."""
# Create test directory
test_dir = data_dir / "test"
test_dir.mkdir(parents=True, exist_ok=True)
# Write test file
test_file = test_dir / "test.parquet"
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.to_parquet(test_file)
# Read test file
df2 = pd.read_parquet(test_file)
# Verify
assert df.equals(df2), "Data mismatch!"
# Cleanup
test_file.unlink()
test_dir.rmdir()
print(f"โœ… Storage configuration verified: {data_dir.absolute()}")
print(f"โœ… Free space: {get_free_space(data_dir):.2f} GB")
def get_free_space(path: Path) -> float:
"""Get free disk space in GB."""
import shutil
stat = shutil.disk_usage(path)
return stat.free / (1024**3)
# Test
test_storage_config(Path("D:/open-navigator-data"))
```
## โšก Performance Tips
### 1. Use SSD for Cache
- **C drive (SSD)**: Small, frequently accessed files
- **D drive (HDD)**: Large, rarely accessed archives
### 2. Organize by Access Frequency
```
D:/open-navigator-data/
โ”œโ”€โ”€ hot/ # Frequently accessed (keep on SSD if possible)
โ”‚ โ””โ”€โ”€ acs/ # Census data for current analysis
โ””โ”€โ”€ cold/ # Rarely accessed (OK on HDD)
โ”œโ”€โ”€ irs/990s/ # Historical 990 PDFs
โ””โ”€โ”€ archive/ # Old datasets
```
### 3. Use Compression for Archives
```python
# Compress old data
import gzip
import shutil
with open("data.parquet", "rb") as f_in:
with gzip.open("data.parquet.gz", "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
```
### 4. Clean Up Old Cache
```python
from pathlib import Path
from datetime import datetime, timedelta
def cleanup_old_cache(cache_dir: Path, days: int = 30):
"""Delete cache files older than N days."""
cutoff = datetime.now() - timedelta(days=days)
for file in cache_dir.rglob("*.parquet"):
if datetime.fromtimestamp(file.stat().st_mtime) < cutoff:
print(f"Deleting old cache: {file}")
file.unlink()
cleanup_old_cache(Path("D:/open-navigator-data/cache"))
```
## ๐Ÿ†˜ Troubleshooting
### "Permission denied" on D drive
**Windows:**
```powershell
# Give yourself full control
icacls "D:\open-navigator-data" /grant "${env:USERNAME}:(OI)(CI)F" /T
```
**Linux:**
```bash
# Change ownership
sudo chown -R $USER:$USER /mnt/d/open-navigator-data
```
### "No space left on device"
**Check disk space:**
```bash
# Windows (PowerShell)
Get-PSDrive D | Select-Object Used,Free
# Linux/Mac
df -h /mnt/d
```
**Free up space:**
1. Delete old cache files
2. Compress archive data
3. Move rarely used data to cloud storage
### Slow read/write performance
**Causes**:
- USB 2.0 external drive (upgrade to USB 3.0+)
- Network drive over slow connection
- HDD vs SSD (SSD is 10-100x faster)
**Solutions**:
- Use SSD for frequently accessed data
- Enable write caching (Windows)
- Use local storage for processing, archive to external
### WSL can't access D drive
**Fix**:
```bash
# Verify drive is mounted
ls /mnt/d
# If not, add to /etc/fstab:
sudo mkdir -p /mnt/d
sudo mount -t drvfs D: /mnt/d
```
## ๐Ÿ”ฎ Next Steps
1. **Create data directory** on your D drive or external storage
2. **Set environment variable** `OPEN_NAV_DATA_DIR` in `.env`
3. **Test configuration** using the verification script above
4. **Download ACS data** to test storage setup
5. **Monitor disk usage** as you add more datasets
## Related Documentation
- [Census ACS Integration](../data-sources/census-acs.md) - Download demographic data
- [IRS 990 Downloads](../data-sources/nonprofit-sources.md) - Nonprofit data
- [Data Architecture](../development/data-architecture.md) - Overall data flow