Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| sidebar_position: 10 | |
| # D Drive Configuration for Large Datasets | |
| Configure Open Navigator to store large datasets (ACS census data, IRS 990s, etc.) on an external drive or secondary volume to avoid filling your primary disk. | |
| ## Why Use External Storage? | |
| Open Navigator downloads and caches large datasets: | |
| | Dataset | Size | Records | | |
| |---------|------|---------| | |
| | **ACS 5-Year (All States)** | ~15 GB | 85,000 tracts | | |
| | **IRS Form 990s** | ~100 GB | 1.8M nonprofits | | |
| | **Meeting Minutes (PDFs)** | ~500 GB | 90,000 jurisdictions | | |
| | **Legislative Bills** | ~50 GB | Millions of bills | | |
| **Total potential storage**: Up to 1 TB of data! | |
| ## ๐ Recommended Directory Structure | |
| Create this structure on your D drive (or external storage): | |
| ``` | |
| D:/open-navigator-data/ | |
| โโโ acs/ # American Community Survey demographic data | |
| โ โโโ B19013_county_*_2022.parquet | |
| โ โโโ B27010_county_*_2022.parquet | |
| โ โโโ acs_2022_ALL/ # Bulk downloads | |
| โโโ irs/ # IRS nonprofit data | |
| โ โโโ bmf/ # Business Master File | |
| โ โโโ 990s/ # Form 990 PDFs and XML | |
| โโโ legislative/ # Legislative data | |
| โ โโโ bills/ | |
| โ โโโ votes/ | |
| โ โโโ legislators/ | |
| โโโ meetings/ # Meeting minutes and agendas | |
| โ โโโ pdfs/ | |
| โ โโโ transcripts/ | |
| โโโ cache/ # General cache | |
| ``` | |
| ## ๐ฅ๏ธ Platform-Specific Setup | |
| ### Windows (D Drive) | |
| **1. Create the data directory:** | |
| ```powershell | |
| # PowerShell | |
| New-Item -Path "D:\open-navigator-data" -ItemType Directory -Force | |
| New-Item -Path "D:\open-navigator-data\acs" -ItemType Directory -Force | |
| New-Item -Path "D:\open-navigator-data\irs" -ItemType Directory -Force | |
| New-Item -Path "D:\open-navigator-data\cache" -ItemType Directory -Force | |
| ``` | |
| **2. Set permissions (if needed):** | |
| ```powershell | |
| # Give yourself full control | |
| icacls "D:\open-navigator-data" /grant "${env:USERNAME}:(OI)(CI)F" /T | |
| ``` | |
| **3. Use in Python scripts:** | |
| ```python | |
| from pathlib import Path | |
| # Option 1: Explicit Windows path | |
| data_dir = Path("D:/open-navigator-data/acs") | |
| # Option 2: Using environment variable | |
| import os | |
| data_dir = Path(os.environ.get("OPEN_NAV_DATA_DIR", "D:/open-navigator-data/acs")) | |
| ``` | |
| ### Linux (Mounted Drive) | |
| **1. Mount the drive:** | |
| ```bash | |
| # Find your drive | |
| lsblk | |
| # Create mount point | |
| sudo mkdir -p /mnt/d | |
| # Mount (replace /dev/sdb1 with your drive) | |
| sudo mount /dev/sdb1 /mnt/d | |
| # Auto-mount on boot (add to /etc/fstab) | |
| echo "/dev/sdb1 /mnt/d ext4 defaults 0 0" | sudo tee -a /etc/fstab | |
| ``` | |
| **2. Create directories:** | |
| ```bash | |
| mkdir -p /mnt/d/open-navigator-data/{acs,irs,legislative,meetings,cache} | |
| ``` | |
| **3. Set ownership:** | |
| ```bash | |
| # Give yourself ownership | |
| sudo chown -R $USER:$USER /mnt/d/open-navigator-data | |
| chmod -R 755 /mnt/d/open-navigator-data | |
| ``` | |
| **4. Use in scripts:** | |
| ```python | |
| from pathlib import Path | |
| data_dir = Path("/mnt/d/open-navigator-data/acs") | |
| ``` | |
| ### macOS (External Drive) | |
| **1. External drives auto-mount to `/Volumes`:** | |
| ```bash | |
| # Your drive might be at: | |
| ls /Volumes/ | |
| # Example: /Volumes/MyExternalDrive | |
| ``` | |
| **2. Create directories:** | |
| ```bash | |
| mkdir -p "/Volumes/MyExternalDrive/open-navigator-data"/{acs,irs,legislative,meetings,cache} | |
| ``` | |
| **3. Use in scripts:** | |
| ```python | |
| from pathlib import Path | |
| data_dir = Path("/Volumes/MyExternalDrive/open-navigator-data/acs") | |
| ``` | |
| ### WSL (Windows Subsystem for Linux) | |
| **Windows drives are auto-mounted in WSL:** | |
| ```bash | |
| # D drive is at /mnt/d | |
| cd /mnt/d | |
| # Create directories | |
| mkdir -p /mnt/d/open-navigator-data/{acs,irs,legislative,meetings,cache} | |
| ``` | |
| **Use in scripts (WSL path):** | |
| ```python | |
| from pathlib import Path | |
| # WSL path to Windows D drive | |
| data_dir = Path("/mnt/d/open-navigator-data/acs") | |
| ``` | |
| ## ๐ง Configuration Methods | |
| ### Method 1: Environment Variable (Recommended) | |
| **Advantages**: Single configuration, all scripts respect it | |
| **Setup:** | |
| Add to `.env` file in project root: | |
| ```bash | |
| # .env | |
| OPEN_NAV_DATA_DIR=D:/open-navigator-data | |
| # Or Linux/Mac: | |
| # OPEN_NAV_DATA_DIR=/mnt/d/open-navigator-data | |
| ``` | |
| **Update scripts to use it:** | |
| ```python | |
| # scripts/datasources/census/acs_ingestion.py | |
| import os | |
| from pathlib import Path | |
| # Use environment variable with fallback | |
| base_data_dir = os.environ.get("OPEN_NAV_DATA_DIR", "data/cache") | |
| acs_dir = Path(base_data_dir) / "acs" | |
| acs = ACSDataIngestion(data_dir=acs_dir) | |
| ``` | |
| ### Method 2: Config File | |
| **Create `config/paths.py`:** | |
| ```python | |
| from pathlib import Path | |
| import os | |
| # Base data directory (customizable) | |
| BASE_DATA_DIR = Path(os.environ.get("OPEN_NAV_DATA_DIR", "data")) | |
| # Subdirectories | |
| ACS_DATA_DIR = BASE_DATA_DIR / "acs" | |
| IRS_DATA_DIR = BASE_DATA_DIR / "irs" | |
| LEGISLATIVE_DATA_DIR = BASE_DATA_DIR / "legislative" | |
| MEETINGS_DATA_DIR = BASE_DATA_DIR / "meetings" | |
| CACHE_DIR = BASE_DATA_DIR / "cache" | |
| # Create directories if they don't exist | |
| for directory in [ACS_DATA_DIR, IRS_DATA_DIR, LEGISLATIVE_DATA_DIR, | |
| MEETINGS_DATA_DIR, CACHE_DIR]: | |
| directory.mkdir(parents=True, exist_ok=True) | |
| ``` | |
| **Use in scripts:** | |
| ```python | |
| from config.paths import ACS_DATA_DIR | |
| acs = ACSDataIngestion(data_dir=ACS_DATA_DIR) | |
| ``` | |
| ### Method 3: Command-Line Argument | |
| **For one-off downloads:** | |
| ```python | |
| # scripts/datasources/census/download_acs.py | |
| import argparse | |
| from pathlib import Path | |
| parser = argparse.ArgumentParser() | |
| parser.add_argument("--data-dir", type=Path, default="data/cache/acs", | |
| help="Directory to store ACS data") | |
| args = parser.parse_args() | |
| acs = ACSDataIngestion(data_dir=args.data_dir) | |
| ``` | |
| **Usage:** | |
| ```bash | |
| # Use D drive | |
| python download_acs.py --data-dir D:/open-navigator-data/acs | |
| # Use default | |
| python download_acs.py | |
| ``` | |
| ## ๐๏ธ Example Configurations | |
| ### Small Project (Default) | |
| **Storage**: Local project directory | |
| **Size**: < 10 GB | |
| **Speed**: Fastest (same disk as code) | |
| ```python | |
| # Uses data/cache/ in project directory | |
| acs = ACSDataIngestion() | |
| ``` | |
| ### Medium Project (D Drive) | |
| **Storage**: D drive or external SSD | |
| **Size**: 10-100 GB | |
| **Speed**: Fast | |
| ```python | |
| from pathlib import Path | |
| acs = ACSDataIngestion(data_dir=Path("D:/open-navigator-data/acs")) | |
| ``` | |
| ### Large Project (Network Storage) | |
| **Storage**: NAS or cloud-mounted drive | |
| **Size**: 100+ GB | |
| **Speed**: Slower but shared | |
| ```python | |
| from pathlib import Path | |
| # Windows network path | |
| data_dir = Path("//server/open-navigator/acs") | |
| # Or mounted network drive (Linux) | |
| # data_dir = Path("/mnt/nas/open-navigator/acs") | |
| acs = ACSDataIngestion(data_dir=data_dir) | |
| ``` | |
| ## ๐ Storage Requirements by Dataset | |
| ### ACS Census Data | |
| | Download Type | Size | Time (50 Mbps) | | |
| |---------------|------|----------------| | |
| | Single table, single state | ~5 MB | < 1 sec | | |
| | Single table, all states | ~50 MB | ~10 sec | | |
| | All tables, all states (API) | ~500 MB | ~2 min | | |
| | Bulk download (all data) | ~15 GB | ~40 min | | |
| **Recommended**: Start with API downloads (targeted), only use bulk if needed. | |
| ### IRS Nonprofit Data | |
| | Dataset | Size | Records | | |
| |---------|------|---------| | |
| | Business Master File (BMF) | ~500 MB | 1.8M nonprofits | | |
| | Form 990 XML (1 year) | ~20 GB | ~300K filings | | |
| | Form 990 PDFs (1 year) | ~50 GB | ~300K PDFs | | |
| | Full 990 archive (10 years) | ~500 GB | 3M+ filings | | |
| **Recommended**: D drive or external storage required. | |
| ### Legislative Data | |
| | Dataset | Size | Records | | |
| |---------|------|---------| | |
| | OpenStates bills (1 state, 1 year) | ~100 MB | ~5K bills | | |
| | OpenStates bills (all states, 1 year) | ~5 GB | ~250K bills | | |
| | OpenStates bills (all states, 10 years) | ~50 GB | 2.5M bills | | |
| **Recommended**: D drive for multi-year data. | |
| ### Meeting Minutes | |
| | Dataset | Size | Records | | |
| |---------|------|---------| | |
| | Meeting agendas (text) | ~1 GB | 100K meetings | | |
| | Meeting PDFs (cached) | ~500 GB | 1M documents | | |
| **Recommended**: External storage for PDF archive. | |
| ## ๐ Verify Configuration | |
| **Test your setup:** | |
| ```python | |
| from pathlib import Path | |
| import pandas as pd | |
| def test_storage_config(data_dir: Path): | |
| """Test that we can write/read data.""" | |
| # Create test directory | |
| test_dir = data_dir / "test" | |
| test_dir.mkdir(parents=True, exist_ok=True) | |
| # Write test file | |
| test_file = test_dir / "test.parquet" | |
| df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) | |
| df.to_parquet(test_file) | |
| # Read test file | |
| df2 = pd.read_parquet(test_file) | |
| # Verify | |
| assert df.equals(df2), "Data mismatch!" | |
| # Cleanup | |
| test_file.unlink() | |
| test_dir.rmdir() | |
| print(f"โ Storage configuration verified: {data_dir.absolute()}") | |
| print(f"โ Free space: {get_free_space(data_dir):.2f} GB") | |
| def get_free_space(path: Path) -> float: | |
| """Get free disk space in GB.""" | |
| import shutil | |
| stat = shutil.disk_usage(path) | |
| return stat.free / (1024**3) | |
| # Test | |
| test_storage_config(Path("D:/open-navigator-data")) | |
| ``` | |
| ## โก Performance Tips | |
| ### 1. Use SSD for Cache | |
| - **C drive (SSD)**: Small, frequently accessed files | |
| - **D drive (HDD)**: Large, rarely accessed archives | |
| ### 2. Organize by Access Frequency | |
| ``` | |
| D:/open-navigator-data/ | |
| โโโ hot/ # Frequently accessed (keep on SSD if possible) | |
| โ โโโ acs/ # Census data for current analysis | |
| โโโ cold/ # Rarely accessed (OK on HDD) | |
| โโโ irs/990s/ # Historical 990 PDFs | |
| โโโ archive/ # Old datasets | |
| ``` | |
| ### 3. Use Compression for Archives | |
| ```python | |
| # Compress old data | |
| import gzip | |
| import shutil | |
| with open("data.parquet", "rb") as f_in: | |
| with gzip.open("data.parquet.gz", "wb") as f_out: | |
| shutil.copyfileobj(f_in, f_out) | |
| ``` | |
| ### 4. Clean Up Old Cache | |
| ```python | |
| from pathlib import Path | |
| from datetime import datetime, timedelta | |
| def cleanup_old_cache(cache_dir: Path, days: int = 30): | |
| """Delete cache files older than N days.""" | |
| cutoff = datetime.now() - timedelta(days=days) | |
| for file in cache_dir.rglob("*.parquet"): | |
| if datetime.fromtimestamp(file.stat().st_mtime) < cutoff: | |
| print(f"Deleting old cache: {file}") | |
| file.unlink() | |
| cleanup_old_cache(Path("D:/open-navigator-data/cache")) | |
| ``` | |
| ## ๐ Troubleshooting | |
| ### "Permission denied" on D drive | |
| **Windows:** | |
| ```powershell | |
| # Give yourself full control | |
| icacls "D:\open-navigator-data" /grant "${env:USERNAME}:(OI)(CI)F" /T | |
| ``` | |
| **Linux:** | |
| ```bash | |
| # Change ownership | |
| sudo chown -R $USER:$USER /mnt/d/open-navigator-data | |
| ``` | |
| ### "No space left on device" | |
| **Check disk space:** | |
| ```bash | |
| # Windows (PowerShell) | |
| Get-PSDrive D | Select-Object Used,Free | |
| # Linux/Mac | |
| df -h /mnt/d | |
| ``` | |
| **Free up space:** | |
| 1. Delete old cache files | |
| 2. Compress archive data | |
| 3. Move rarely used data to cloud storage | |
| ### Slow read/write performance | |
| **Causes**: | |
| - USB 2.0 external drive (upgrade to USB 3.0+) | |
| - Network drive over slow connection | |
| - HDD vs SSD (SSD is 10-100x faster) | |
| **Solutions**: | |
| - Use SSD for frequently accessed data | |
| - Enable write caching (Windows) | |
| - Use local storage for processing, archive to external | |
| ### WSL can't access D drive | |
| **Fix**: | |
| ```bash | |
| # Verify drive is mounted | |
| ls /mnt/d | |
| # If not, add to /etc/fstab: | |
| sudo mkdir -p /mnt/d | |
| sudo mount -t drvfs D: /mnt/d | |
| ``` | |
| ## ๐ฎ Next Steps | |
| 1. **Create data directory** on your D drive or external storage | |
| 2. **Set environment variable** `OPEN_NAV_DATA_DIR` in `.env` | |
| 3. **Test configuration** using the verification script above | |
| 4. **Download ACS data** to test storage setup | |
| 5. **Monitor disk usage** as you add more datasets | |
| ## Related Documentation | |
| - [Census ACS Integration](../data-sources/census-acs.md) - Download demographic data | |
| - [IRS 990 Downloads](../data-sources/nonprofit-sources.md) - Nonprofit data | |
| - [Data Architecture](../development/data-architecture.md) - Overall data flow | |