Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # π° COST-EFFECTIVE STORAGE STRATEGY (Personal Budget) | |
| **TL;DR: Use Hugging Face Datasets - it's FREE and unlimited for public data!** | |
| --- | |
| ## π― THE PROBLEM | |
| **Challenge:** | |
| - Need to process 22,000+ jurisdictions | |
| - Each jurisdiction has: agendas, minutes, videos, social media | |
| - Estimated total: **10-50 TB** of raw content | |
| - Limited local storage + personal budget | |
| **Solution: Don't store everything locally!** | |
| --- | |
| ## β RECOMMENDED STRATEGY: HUGGING FACE DATASETS | |
| ### Why Hugging Face? | |
| 1. **π FREE** - Unlimited storage for public datasets | |
| 2. **π Cloud-based** - No local storage needed | |
| 3. **π Versioned** - Git-based dataset management | |
| 4. **π Searchable** - Built-in search and filtering | |
| 5. **π€ Shareable** - Public datasets help research community | |
| 6. **β‘ Fast** - Optimized for large datasets | |
| ### β οΈ CRITICAL: File Limits | |
| **Hugging Face has repository limits:** | |
| - Files per folder: <10,000 | |
| - Total files per repo: <100,000 | |
| - Large datasets: Use Parquet or WebDataset format | |
| **Your scale (22M files) exceeds limits!** | |
| **Solution: Use Parquet format** | |
| - 22 million PDFs β 50 Parquet files β | |
| - See detailed guide: [HUGGINGFACE_FILE_LIMITS.md](HUGGINGFACE_FILE_LIMITS.md) | |
| ### What to Store | |
| **Store ONLY processed/filtered data, not raw content:** | |
| β **Store:** | |
| - Extracted text from PDFs | |
| - Meeting metadata (date, title, URL) | |
| - Oral health-related snippets | |
| - Social media links | |
| - Discovery results (JSON) | |
| β **Don't Store:** | |
| - Full video files (link to YouTube instead) | |
| - Full PDF files (store text + source URL) | |
| - Website HTML dumps | |
| - Duplicate content | |
| --- | |
| ## π STORAGE ESTIMATES | |
| ### Raw Content (DON'T download all): | |
| ``` | |
| Videos: 5,000 channels Γ 100 videos Γ 500 MB = 250 TB β | |
| PDFs: 15,000 jurisdictions Γ 1,000 docs Γ 2 MB = 30 TB β | |
| Social media: 18,000 accounts Γ archives = 5 TB β | |
| TOTAL RAW: ~285 TB π« TOO EXPENSIVE! | |
| ``` | |
| ### Processed Content (Hugging Face approach): | |
| ``` | |
| Discovery data: 22,000 jurisdictions Γ 50 KB = 1.1 GB β | |
| Meeting metadata: 500,000 meetings Γ 5 KB = 2.5 GB β | |
| Extracted text: 500,000 docs Γ 50 KB = 25 GB β | |
| Oral health subset: 50,000 relevant docs Γ 100 KB = 5 GB β | |
| TOTAL PROCESSED: ~34 GB β TOTALLY FREE on Hugging Face! | |
| ``` | |
| **Savings: 285 TB β 34 GB = 99.99% reduction!** | |
| --- | |
| ## π STEP-BY-STEP: HUGGING FACE WORKFLOW | |
| ### Step 1: Create Free Hugging Face Account | |
| ```bash | |
| # Sign up at https://huggingface.co/join | |
| # Create account (FREE) | |
| # Get your access token from https://huggingface.co/settings/tokens | |
| ``` | |
| ### Step 2: Install Hugging Face Libraries | |
| ```bash | |
| pip install huggingface_hub datasets | |
| ``` | |
| ### Step 3: Create Your Dataset | |
| ```python | |
| from huggingface_hub import HfApi, create_repo | |
| from datasets import Dataset | |
| import pandas as pd | |
| # Login | |
| from huggingface_hub import login | |
| login(token="hf_YOUR_TOKEN") # Get from https://huggingface.co/settings/tokens | |
| # Create dataset repository | |
| repo_name = "oral-health-policy-data" | |
| create_repo( | |
| repo_id=f"your-username/{repo_name}", | |
| repo_type="dataset", | |
| private=False # Public = FREE unlimited storage! | |
| ) | |
| # Upload discovery results | |
| df = pd.read_csv('data/bronze/discovered_sources/discovery_summary_final.csv') | |
| dataset = Dataset.from_pandas(df) | |
| dataset.push_to_hub(f"your-username/{repo_name}", split="discovery") | |
| print("β Dataset uploaded to Hugging Face!") | |
| print(f"View at: https://huggingface.co/datasets/your-username/{repo_name}") | |
| ``` | |
| ### Step 4: Process-and-Upload Pipeline | |
| **DON'T download everything locally first!** | |
| Instead, use this streaming approach: | |
| ```python | |
| import httpx | |
| import tempfile | |
| from pathlib import Path | |
| async def process_jurisdiction_streaming(jurisdiction): | |
| """ | |
| Process jurisdiction WITHOUT storing locally: | |
| 1. Download agenda PDF | |
| 2. Extract text | |
| 3. Filter for oral health keywords | |
| 4. Upload to Hugging Face | |
| 5. Delete local file | |
| """ | |
| results = [] | |
| # Get agenda portal URLs | |
| agendas = jurisdiction['agenda_portals'] | |
| for agenda_url in agendas: | |
| # Download to temporary file | |
| with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp: | |
| async with httpx.AsyncClient() as client: | |
| response = await client.get(agenda_url) | |
| tmp.write(response.content) | |
| tmp_path = tmp.name | |
| # Extract text (using PyPDF2 or similar) | |
| text = extract_text_from_pdf(tmp_path) | |
| # Filter for oral health content | |
| keywords = ['fluoride', 'dental', 'oral health', 'water treatment'] | |
| if any(kw in text.lower() for kw in keywords): | |
| results.append({ | |
| 'jurisdiction': jurisdiction['name'], | |
| 'state': jurisdiction['state'], | |
| 'url': agenda_url, | |
| 'text': text, | |
| 'date': extract_date(text), | |
| 'relevant': True | |
| }) | |
| # Delete local file immediately | |
| Path(tmp_path).unlink() | |
| # Upload batch to Hugging Face | |
| if results: | |
| upload_to_huggingface(results) | |
| return len(results) | |
| ``` | |
| --- | |
| ## π‘ COST BREAKDOWN: FREE OPTIONS | |
| ### Option 1: Hugging Face (RECOMMENDED) | |
| | Item | Cost | Storage | | |
| |------|------|---------| | |
| | **Public datasets** | **FREE** | **UNLIMITED** | | |
| | Private datasets | FREE | 100 GB | | |
| | Bandwidth | FREE | Unlimited downloads | | |
| | Processing | FREE | Use local computer | | |
| **Total: $0/month** β | |
| ### Option 2: GitHub + Hugging Face | |
| | Item | Cost | Storage | | |
| |------|------|---------| | |
| | GitHub (discovery data) | FREE | 1 GB | | |
| | Hugging Face (processed text) | FREE | Unlimited | | |
| | GitHub LFS (large files) | $5/month | 50 GB | | |
| **Total: $0-5/month** β | |
| ### Option 3: Cloud Storage (if needed) | |
| **Only for temporary processing:** | |
| | Provider | Free Tier | After Free Tier | | |
| |----------|-----------|-----------------| | |
| | **AWS S3** | 5 GB for 12 months | $0.023/GB/month | | |
| | **Google Cloud** | 5 GB always free | $0.020/GB/month | | |
| | **Azure Blob** | 5 GB for 12 months | $0.018/GB/month | | |
| **Cost for 34 GB:** ~$0.60/month β | |
| --- | |
| ## π― RECOMMENDED WORKFLOW | |
| ### Phase 1: Discovery (Run Locally) | |
| ```bash | |
| # Run discovery for all jurisdictions | |
| python discovery/comprehensive_discovery_pipeline.py --all | |
| # Output: ~1 GB of JSON/CSV (fits on laptop!) | |
| # Upload to Hugging Face immediately | |
| ``` | |
| ### Phase 2: Content Processing (Stream & Upload) | |
| ```python | |
| # For each jurisdiction: | |
| for jurisdiction in all_jurisdictions: | |
| # 1. Download one PDF | |
| pdf = download_pdf(jurisdiction.agenda_url) | |
| # 2. Extract text | |
| text = extract_text(pdf) | |
| # 3. Check if oral health-related | |
| if is_relevant(text): | |
| # 4. Upload to Hugging Face | |
| upload_to_hf(text, metadata) | |
| # 5. Delete local file | |
| delete(pdf) | |
| # Local storage stays at ~100 MB (just temp files)! | |
| ``` | |
| **Your laptop never stores more than a few hundred MB!** | |
| ### Phase 3: Analysis (Cloud or Local) | |
| ```python | |
| # Download ONLY relevant subset from Hugging Face | |
| from datasets import load_dataset | |
| # Load just oral health documents | |
| dataset = load_dataset("your-username/oral-health-policy-data", split="relevant") | |
| # This might be only 5 GB (totally manageable!) | |
| print(f"Total documents: {len(dataset)}") | |
| # Analyze locally or in Colab (FREE GPU!) | |
| ``` | |
| --- | |
| ## π FREE RESOURCES YOU CAN USE | |
| ### 1. Hugging Face Datasets | |
| - **Storage:** Unlimited (public datasets) | |
| - **Cost:** FREE | |
| - **Use:** Primary storage for all processed data | |
| ### 2. Google Colab | |
| - **Compute:** FREE GPU/TPU (15 GB RAM) | |
| - **Cost:** FREE (or $10/month for Pro) | |
| - **Use:** Process PDFs, run analysis | |
| - **Storage:** 15 GB on Google Drive (FREE) | |
| ### 3. GitHub | |
| - **Storage:** 1 GB (100 GB with LFS for $5/month) | |
| - **Cost:** FREE for public repos | |
| - **Use:** Code + discovery results | |
| ### 4. Internet Archive (archive.org) | |
| - **Storage:** Unlimited (for public documents) | |
| - **Cost:** FREE | |
| - **Use:** Mirror government documents | |
| --- | |
| ## π¦ SAMPLE: UPLOAD TO HUGGING FACE | |
| ### Create Upload Script | |
| ```python | |
| #!/usr/bin/env python3 | |
| """ | |
| upload_to_huggingface.py - Stream processed data to Hugging Face | |
| """ | |
| from datasets import Dataset, DatasetDict | |
| from huggingface_hub import login | |
| import pandas as pd | |
| from pathlib import Path | |
| # Configuration | |
| HF_TOKEN = "hf_YOUR_TOKEN" # From https://huggingface.co/settings/tokens | |
| HF_REPO = "your-username/oral-health-policy-data" | |
| def upload_discovery_results(): | |
| """Upload discovery results (JSON/CSV)""" | |
| login(token=HF_TOKEN) | |
| # Load discovery data | |
| discovery_dir = Path("data/bronze/discovered_sources") | |
| # Load all discovery CSVs | |
| all_data = [] | |
| for csv_file in discovery_dir.glob("*.csv"): | |
| df = pd.read_csv(csv_file) | |
| all_data.append(df) | |
| # Combine and upload | |
| combined = pd.concat(all_data, ignore_index=True) | |
| dataset = Dataset.from_pandas(combined) | |
| dataset.push_to_hub(HF_REPO, split="discovery") | |
| print(f"β Uploaded {len(combined)} jurisdictions to Hugging Face") | |
| print(f"View at: https://huggingface.co/datasets/{HF_REPO}") | |
| def upload_meeting_data(meetings_df): | |
| """Upload processed meeting data""" | |
| # Convert to dataset | |
| dataset = Dataset.from_pandas(meetings_df) | |
| # Upload | |
| dataset.push_to_hub(HF_REPO, split="meetings") | |
| print(f"β Uploaded {len(meetings_df)} meetings") | |
| def upload_oral_health_subset(filtered_df): | |
| """Upload filtered oral health content""" | |
| dataset = Dataset.from_pandas(filtered_df) | |
| dataset.push_to_hub(HF_REPO, split="oral_health") | |
| print(f"β Uploaded {len(filtered_df)} oral health documents") | |
| if __name__ == "__main__": | |
| upload_discovery_results() | |
| ``` | |
| ### Run Upload | |
| ```bash | |
| # Set your token | |
| export HF_TOKEN="hf_YOUR_TOKEN" | |
| # Upload discovery results | |
| python scripts/upload_to_huggingface.py | |
| # View your dataset | |
| # https://huggingface.co/datasets/your-username/oral-health-policy-data | |
| ``` | |
| --- | |
| ## π° TOTAL COST ESTIMATE | |
| ### Personal Budget Approach (RECOMMENDED) | |
| | Component | Cost | Notes | | |
| |-----------|------|-------| | |
| | **Hugging Face** | **$0/month** | Public datasets = FREE | | |
| | **Local computer** | $0/month | Use your laptop | | |
| | **Internet** | $0/month | Use existing connection | | |
| | **Google Colab** | $0/month | FREE tier (or $10/month Pro) | | |
| | **GitHub** | $0/month | Public repos FREE | | |
| | **TOTAL** | **$0/month** | β **100% FREE!** | | |
| ### Professional Approach (if scaling up) | |
| | Component | Cost | Notes | | |
| |-----------|------|-------| | |
| | Hugging Face Pro | $9/month | Faster processing | | |
| | Google Colab Pro | $10/month | More GPU time | | |
| | AWS S3 (50 GB) | $1/month | Temporary storage | | |
| | **TOTAL** | **$20/month** | Still very affordable | | |
| --- | |
| ## π REAL EXAMPLE: MeetingBank Dataset | |
| **Existing dataset on Hugging Face:** | |
| - Name: `huuuyeah/meetingbank` | |
| - Size: 1,366 meetings, 121 MB | |
| - Cost: FREE | |
| - Link: https://huggingface.co/datasets/huuuyeah/meetingbank | |
| **You can do the same for oral health policy!** | |
| ```python | |
| # Load existing MeetingBank data (FREE) | |
| from datasets import load_dataset | |
| meetingbank = load_dataset("huuuyeah/meetingbank") | |
| print(f"Meetings: {len(meetingbank['train'])}") | |
| # Create YOUR oral health dataset (also FREE!) | |
| your_dataset = create_oral_health_dataset() | |
| your_dataset.push_to_hub("your-username/oral-health-meetings") | |
| ``` | |
| --- | |
| ## β ACTION PLAN FOR YOU | |
| ### Week 1: Setup (Cost: $0) | |
| 1. β Create Hugging Face account (FREE) | |
| 2. β Get API token | |
| 3. β Install libraries: `pip install huggingface_hub datasets` | |
| 4. β Create dataset repo: `oral-health-policy-data` | |
| ### Week 2: Discovery (Cost: $0) | |
| 1. Run discovery pipeline for all 22,000 jurisdictions | |
| 2. Upload discovery results to Hugging Face (~1 GB) | |
| 3. Free up local storage | |
| ### Week 3-4: Content Processing (Cost: $0) | |
| 1. Process jurisdictions one at a time (streaming) | |
| 2. Extract text from PDFs | |
| 3. Filter for oral health keywords | |
| 4. Upload to Hugging Face | |
| 5. Delete local files immediately | |
| **Local storage never exceeds 1 GB!** | |
| ### Ongoing: Analysis (Cost: $0) | |
| 1. Download relevant subset from Hugging Face | |
| 2. Analyze using Google Colab (FREE GPU) | |
| 3. Publish findings back to Hugging Face | |
| --- | |
| ## π KEY PRINCIPLES | |
| **1. Process, Don't Store** | |
| - Download β Process β Upload β Delete | |
| - Never keep raw files locally | |
| **2. Filter Early** | |
| - Only save oral health-related content | |
| - Discard irrelevant documents immediately | |
| **3. Use Text, Not Files** | |
| - Store extracted text (KB), not PDFs (MB) | |
| - Link to original sources instead of duplicating | |
| **4. Leverage Free Platforms** | |
| - Hugging Face for datasets (FREE) | |
| - Google Colab for processing (FREE) | |
| - GitHub for code (FREE) | |
| **5. Make It Public** | |
| - Public datasets = unlimited FREE storage | |
| - Helps other researchers | |
| - Builds your portfolio | |
| --- | |
| ## π ADDITIONAL FREE RESOURCES | |
| ### Processing Tools (FREE) | |
| ```bash | |
| # PDF text extraction | |
| pip install pypdf2 pdfplumber | |
| # Document processing | |
| pip install beautifulsoup4 lxml | |
| # Data handling | |
| pip install pandas pyarrow | |
| # Upload to Hugging Face | |
| pip install huggingface_hub datasets | |
| ``` | |
| ### Computing (FREE) | |
| 1. **Google Colab** - FREE GPU/TPU | |
| - https://colab.research.google.com/ | |
| - 15 GB RAM, 100 GB disk (temporary) | |
| 2. **Kaggle Notebooks** - FREE GPU | |
| - https://www.kaggle.com/code | |
| - 20 GB RAM, 73 GB disk (temporary) | |
| 3. **Hugging Face Spaces** - FREE hosting | |
| - https://huggingface.co/spaces | |
| - Run demos and apps | |
| --- | |
| ## π― BOTTOM LINE | |
| **YOU CAN DO THIS FOR $0/MONTH!** | |
| β **Storage:** Hugging Face (FREE, unlimited) | |
| β **Processing:** Local computer or Google Colab (FREE) | |
| β **Code:** GitHub (FREE) | |
| β **Analysis:** Google Colab (FREE GPU) | |
| **The entire 22,000-jurisdiction discovery and analysis can be done on a personal budget with ZERO cloud storage costs!** | |
| --- | |
| ## π NEXT STEPS | |
| 1. **Create Hugging Face account:** https://huggingface.co/join | |
| 2. **Create your dataset repo:** `oral-health-policy-data` | |
| 3. **Run discovery pipeline** (outputs ~1 GB locally) | |
| 4. **Upload to Hugging Face** (FREE unlimited storage) | |
| 5. **Process content streaming** (never store >100 MB locally) | |
| **Questions?** Check Hugging Face docs: https://huggingface.co/docs/datasets/ | |