open-navigator / docs /COST_EFFECTIVE_STORAGE.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified

πŸ’° COST-EFFECTIVE STORAGE STRATEGY (Personal Budget)

TL;DR: Use Hugging Face Datasets - it's FREE and unlimited for public data!


🎯 THE PROBLEM

Challenge:

  • Need to process 22,000+ jurisdictions
  • Each jurisdiction has: agendas, minutes, videos, social media
  • Estimated total: 10-50 TB of raw content
  • Limited local storage + personal budget

Solution: Don't store everything locally!


βœ… RECOMMENDED STRATEGY: HUGGING FACE DATASETS

Why Hugging Face?

  1. πŸ†“ FREE - Unlimited storage for public datasets
  2. 🌐 Cloud-based - No local storage needed
  3. πŸ“Š Versioned - Git-based dataset management
  4. πŸ” Searchable - Built-in search and filtering
  5. 🀝 Shareable - Public datasets help research community
  6. ⚑ Fast - Optimized for large datasets

⚠️ CRITICAL: File Limits

Hugging Face has repository limits:

  • Files per folder: <10,000
  • Total files per repo: <100,000
  • Large datasets: Use Parquet or WebDataset format

Your scale (22M files) exceeds limits!

Solution: Use Parquet format

What to Store

Store ONLY processed/filtered data, not raw content:

βœ… Store:

  • Extracted text from PDFs
  • Meeting metadata (date, title, URL)
  • Oral health-related snippets
  • Social media links
  • Discovery results (JSON)

❌ Don't Store:

  • Full video files (link to YouTube instead)
  • Full PDF files (store text + source URL)
  • Website HTML dumps
  • Duplicate content

πŸ“Š STORAGE ESTIMATES

Raw Content (DON'T download all):

Videos:        5,000 channels Γ— 100 videos Γ— 500 MB = 250 TB  ❌
PDFs:          15,000 jurisdictions Γ— 1,000 docs Γ— 2 MB = 30 TB  ❌
Social media:  18,000 accounts Γ— archives = 5 TB  ❌
TOTAL RAW:     ~285 TB  🚫 TOO EXPENSIVE!

Processed Content (Hugging Face approach):

Discovery data:     22,000 jurisdictions Γ— 50 KB = 1.1 GB  βœ…
Meeting metadata:   500,000 meetings Γ— 5 KB = 2.5 GB  βœ…
Extracted text:     500,000 docs Γ— 50 KB = 25 GB  βœ…
Oral health subset: 50,000 relevant docs Γ— 100 KB = 5 GB  βœ…
TOTAL PROCESSED:    ~34 GB  βœ… TOTALLY FREE on Hugging Face!

Savings: 285 TB β†’ 34 GB = 99.99% reduction!


πŸš€ STEP-BY-STEP: HUGGING FACE WORKFLOW

Step 1: Create Free Hugging Face Account

# Sign up at https://huggingface.co/join
# Create account (FREE)
# Get your access token from https://huggingface.co/settings/tokens

Step 2: Install Hugging Face Libraries

pip install huggingface_hub datasets

Step 3: Create Your Dataset

from huggingface_hub import HfApi, create_repo
from datasets import Dataset
import pandas as pd

# Login
from huggingface_hub import login
login(token="hf_YOUR_TOKEN")  # Get from https://huggingface.co/settings/tokens

# Create dataset repository
repo_name = "oral-health-policy-data"
create_repo(
    repo_id=f"your-username/{repo_name}",
    repo_type="dataset",
    private=False  # Public = FREE unlimited storage!
)

# Upload discovery results
df = pd.read_csv('data/bronze/discovered_sources/discovery_summary_final.csv')
dataset = Dataset.from_pandas(df)
dataset.push_to_hub(f"your-username/{repo_name}", split="discovery")

print("βœ… Dataset uploaded to Hugging Face!")
print(f"View at: https://huggingface.co/datasets/your-username/{repo_name}")

Step 4: Process-and-Upload Pipeline

DON'T download everything locally first!

Instead, use this streaming approach:

import httpx
import tempfile
from pathlib import Path

async def process_jurisdiction_streaming(jurisdiction):
    """
    Process jurisdiction WITHOUT storing locally:
    1. Download agenda PDF
    2. Extract text
    3. Filter for oral health keywords
    4. Upload to Hugging Face
    5. Delete local file
    """
    
    results = []
    
    # Get agenda portal URLs
    agendas = jurisdiction['agenda_portals']
    
    for agenda_url in agendas:
        # Download to temporary file
        with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
            async with httpx.AsyncClient() as client:
                response = await client.get(agenda_url)
                tmp.write(response.content)
                tmp_path = tmp.name
        
        # Extract text (using PyPDF2 or similar)
        text = extract_text_from_pdf(tmp_path)
        
        # Filter for oral health content
        keywords = ['fluoride', 'dental', 'oral health', 'water treatment']
        if any(kw in text.lower() for kw in keywords):
            results.append({
                'jurisdiction': jurisdiction['name'],
                'state': jurisdiction['state'],
                'url': agenda_url,
                'text': text,
                'date': extract_date(text),
                'relevant': True
            })
        
        # Delete local file immediately
        Path(tmp_path).unlink()
    
    # Upload batch to Hugging Face
    if results:
        upload_to_huggingface(results)
    
    return len(results)

πŸ’‘ COST BREAKDOWN: FREE OPTIONS

Option 1: Hugging Face (RECOMMENDED)

Item Cost Storage
Public datasets FREE UNLIMITED
Private datasets FREE 100 GB
Bandwidth FREE Unlimited downloads
Processing FREE Use local computer

Total: $0/month βœ…

Option 2: GitHub + Hugging Face

Item Cost Storage
GitHub (discovery data) FREE 1 GB
Hugging Face (processed text) FREE Unlimited
GitHub LFS (large files) $5/month 50 GB

Total: $0-5/month βœ…

Option 3: Cloud Storage (if needed)

Only for temporary processing:

Provider Free Tier After Free Tier
AWS S3 5 GB for 12 months $0.023/GB/month
Google Cloud 5 GB always free $0.020/GB/month
Azure Blob 5 GB for 12 months $0.018/GB/month

Cost for 34 GB: ~$0.60/month βœ…


🎯 RECOMMENDED WORKFLOW

Phase 1: Discovery (Run Locally)

# Run discovery for all jurisdictions
python discovery/comprehensive_discovery_pipeline.py --all

# Output: ~1 GB of JSON/CSV (fits on laptop!)
# Upload to Hugging Face immediately

Phase 2: Content Processing (Stream & Upload)

# For each jurisdiction:
for jurisdiction in all_jurisdictions:
    # 1. Download one PDF
    pdf = download_pdf(jurisdiction.agenda_url)
    
    # 2. Extract text
    text = extract_text(pdf)
    
    # 3. Check if oral health-related
    if is_relevant(text):
        # 4. Upload to Hugging Face
        upload_to_hf(text, metadata)
    
    # 5. Delete local file
    delete(pdf)
    
    # Local storage stays at ~100 MB (just temp files)!

Your laptop never stores more than a few hundred MB!

Phase 3: Analysis (Cloud or Local)

# Download ONLY relevant subset from Hugging Face
from datasets import load_dataset

# Load just oral health documents
dataset = load_dataset("your-username/oral-health-policy-data", split="relevant")

# This might be only 5 GB (totally manageable!)
print(f"Total documents: {len(dataset)}")

# Analyze locally or in Colab (FREE GPU!)

πŸ†“ FREE RESOURCES YOU CAN USE

1. Hugging Face Datasets

  • Storage: Unlimited (public datasets)
  • Cost: FREE
  • Use: Primary storage for all processed data

2. Google Colab

  • Compute: FREE GPU/TPU (15 GB RAM)
  • Cost: FREE (or $10/month for Pro)
  • Use: Process PDFs, run analysis
  • Storage: 15 GB on Google Drive (FREE)

3. GitHub

  • Storage: 1 GB (100 GB with LFS for $5/month)
  • Cost: FREE for public repos
  • Use: Code + discovery results

4. Internet Archive (archive.org)

  • Storage: Unlimited (for public documents)
  • Cost: FREE
  • Use: Mirror government documents

πŸ“¦ SAMPLE: UPLOAD TO HUGGING FACE

Create Upload Script

#!/usr/bin/env python3
"""
upload_to_huggingface.py - Stream processed data to Hugging Face
"""

from datasets import Dataset, DatasetDict
from huggingface_hub import login
import pandas as pd
from pathlib import Path

# Configuration
HF_TOKEN = "hf_YOUR_TOKEN"  # From https://huggingface.co/settings/tokens
HF_REPO = "your-username/oral-health-policy-data"

def upload_discovery_results():
    """Upload discovery results (JSON/CSV)"""
    
    login(token=HF_TOKEN)
    
    # Load discovery data
    discovery_dir = Path("data/bronze/discovered_sources")
    
    # Load all discovery CSVs
    all_data = []
    for csv_file in discovery_dir.glob("*.csv"):
        df = pd.read_csv(csv_file)
        all_data.append(df)
    
    # Combine and upload
    combined = pd.concat(all_data, ignore_index=True)
    dataset = Dataset.from_pandas(combined)
    
    dataset.push_to_hub(HF_REPO, split="discovery")
    
    print(f"βœ… Uploaded {len(combined)} jurisdictions to Hugging Face")
    print(f"View at: https://huggingface.co/datasets/{HF_REPO}")

def upload_meeting_data(meetings_df):
    """Upload processed meeting data"""
    
    # Convert to dataset
    dataset = Dataset.from_pandas(meetings_df)
    
    # Upload
    dataset.push_to_hub(HF_REPO, split="meetings")
    
    print(f"βœ… Uploaded {len(meetings_df)} meetings")

def upload_oral_health_subset(filtered_df):
    """Upload filtered oral health content"""
    
    dataset = Dataset.from_pandas(filtered_df)
    dataset.push_to_hub(HF_REPO, split="oral_health")
    
    print(f"βœ… Uploaded {len(filtered_df)} oral health documents")

if __name__ == "__main__":
    upload_discovery_results()

Run Upload

# Set your token
export HF_TOKEN="hf_YOUR_TOKEN"

# Upload discovery results
python scripts/upload_to_huggingface.py

# View your dataset
# https://huggingface.co/datasets/your-username/oral-health-policy-data

πŸ’° TOTAL COST ESTIMATE

Personal Budget Approach (RECOMMENDED)

Component Cost Notes
Hugging Face $0/month Public datasets = FREE
Local computer $0/month Use your laptop
Internet $0/month Use existing connection
Google Colab $0/month FREE tier (or $10/month Pro)
GitHub $0/month Public repos FREE
TOTAL $0/month βœ… 100% FREE!

Professional Approach (if scaling up)

Component Cost Notes
Hugging Face Pro $9/month Faster processing
Google Colab Pro $10/month More GPU time
AWS S3 (50 GB) $1/month Temporary storage
TOTAL $20/month Still very affordable

πŸŽ“ REAL EXAMPLE: MeetingBank Dataset

Existing dataset on Hugging Face:

You can do the same for oral health policy!

# Load existing MeetingBank data (FREE)
from datasets import load_dataset

meetingbank = load_dataset("huuuyeah/meetingbank")
print(f"Meetings: {len(meetingbank['train'])}")

# Create YOUR oral health dataset (also FREE!)
your_dataset = create_oral_health_dataset()
your_dataset.push_to_hub("your-username/oral-health-meetings")

βœ… ACTION PLAN FOR YOU

Week 1: Setup (Cost: $0)

  1. βœ… Create Hugging Face account (FREE)
  2. βœ… Get API token
  3. βœ… Install libraries: pip install huggingface_hub datasets
  4. βœ… Create dataset repo: oral-health-policy-data

Week 2: Discovery (Cost: $0)

  1. Run discovery pipeline for all 22,000 jurisdictions
  2. Upload discovery results to Hugging Face (~1 GB)
  3. Free up local storage

Week 3-4: Content Processing (Cost: $0)

  1. Process jurisdictions one at a time (streaming)
  2. Extract text from PDFs
  3. Filter for oral health keywords
  4. Upload to Hugging Face
  5. Delete local files immediately

Local storage never exceeds 1 GB!

Ongoing: Analysis (Cost: $0)

  1. Download relevant subset from Hugging Face
  2. Analyze using Google Colab (FREE GPU)
  3. Publish findings back to Hugging Face

πŸ”‘ KEY PRINCIPLES

1. Process, Don't Store

  • Download β†’ Process β†’ Upload β†’ Delete
  • Never keep raw files locally

2. Filter Early

  • Only save oral health-related content
  • Discard irrelevant documents immediately

3. Use Text, Not Files

  • Store extracted text (KB), not PDFs (MB)
  • Link to original sources instead of duplicating

4. Leverage Free Platforms

  • Hugging Face for datasets (FREE)
  • Google Colab for processing (FREE)
  • GitHub for code (FREE)

5. Make It Public

  • Public datasets = unlimited FREE storage
  • Helps other researchers
  • Builds your portfolio

πŸ“š ADDITIONAL FREE RESOURCES

Processing Tools (FREE)

# PDF text extraction
pip install pypdf2 pdfplumber

# Document processing
pip install beautifulsoup4 lxml

# Data handling
pip install pandas pyarrow

# Upload to Hugging Face
pip install huggingface_hub datasets

Computing (FREE)

  1. Google Colab - FREE GPU/TPU

  2. Kaggle Notebooks - FREE GPU

  3. Hugging Face Spaces - FREE hosting


🎯 BOTTOM LINE

YOU CAN DO THIS FOR $0/MONTH!

βœ… Storage: Hugging Face (FREE, unlimited)
βœ… Processing: Local computer or Google Colab (FREE)
βœ… Code: GitHub (FREE)
βœ… Analysis: Google Colab (FREE GPU)

The entire 22,000-jurisdiction discovery and analysis can be done on a personal budget with ZERO cloud storage costs!


πŸ“ž NEXT STEPS

  1. Create Hugging Face account: https://huggingface.co/join
  2. Create your dataset repo: oral-health-policy-data
  3. Run discovery pipeline (outputs ~1 GB locally)
  4. Upload to Hugging Face (FREE unlimited storage)
  5. Process content streaming (never store >100 MB locally)

Questions? Check Hugging Face docs: https://huggingface.co/docs/datasets/