Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # π QUICK START: FREE STORAGE WITH HUGGING FACE | |
| **TL;DR: Store unlimited data for FREE on Hugging Face!** | |
| **β οΈ IMPORTANT: Use Parquet format, NOT individual PDFs! See [file limits guide](HUGGINGFACE_FILE_LIMITS.md)** | |
| --- | |
| ## β‘ 3-MINUTE SETUP | |
| ### 1. Create Hugging Face Account (1 minute) | |
| ```bash | |
| # Go to https://huggingface.co/join | |
| # Sign up (FREE) | |
| # Verify email | |
| ``` | |
| ### 2. Get API Token (1 minute) | |
| ```bash | |
| # Go to https://huggingface.co/settings/tokens | |
| # Click "New token" | |
| # Name it "oral-health-upload" | |
| # Token Type: Write (required for publishing datasets) | |
| # Repository permissions: All repositories | |
| # Copy the token (hf_xxxxxxxxxxxx) | |
| ``` | |
| **β οΈ Important: Token Permissions** | |
| - **Write** access required for publishing datasets | |
| - **Read** access sufficient for downloading public datasets only | |
| - For this project: Use **Write** token to publish your scraped data | |
| ### 3. Install & Login (1 minute) | |
| ```bash | |
| pip install huggingface_hub datasets | |
| # Set your token | |
| export HF_TOKEN="hf_YOUR_TOKEN_HERE" | |
| ``` | |
| --- | |
| ## β οΈ CRITICAL: FILE LIMITS | |
| **Hugging Face Limits:** | |
| - Files per folder: <10,000 | |
| - Total files per repo: <100,000 | |
| - For large datasets: Use Parquet or WebDataset format | |
| **Your Scale:** | |
| - 22,000 jurisdictions Γ 1,000 docs = 22 MILLION files β | |
| **Solution:** | |
| - Extract text from PDFs | |
| - Store in Parquet format | |
| - Result: 50 files instead of 22 million β | |
| **See detailed guide:** [HUGGINGFACE_FILE_LIMITS.md](HUGGINGFACE_FILE_LIMITS.md) | |
| --- | |
| ## π€ UPLOAD YOUR DATA | |
| ### Option 1: Use the Upload Script (Recommended) | |
| **For discovery data:** | |
| ```bash | |
| # Go to your project | |
| cd /home/developer/projects/open-navigator | |
| # Activate environment | |
| source venv/bin/activate | |
| # Upload discovery results | |
| python scripts/upload_to_huggingface.py \ | |
| --repo "YOUR_USERNAME/oral-health-policy-data" \ | |
| --discovery | |
| # View your dataset | |
| # https://huggingface.co/datasets/YOUR_USERNAME/oral-health-policy-data | |
| ``` | |
| **For meeting PDFs (extract text first!):** | |
| ```bash | |
| # DON'T upload individual PDFs! | |
| # Instead, extract text and save as Parquet | |
| # 1. Create a file with PDF URLs (one per line) | |
| cat > pdf_urls.txt << EOF | |
| https://tuscaloosaal.suiteonemedia.com/agenda1.pdf | |
| https://tuscaloosaal.suiteonemedia.com/agenda2.pdf | |
| ... | |
| EOF | |
| # 2. Process PDFs to Parquet (extracts text, deletes PDFs) | |
| python scripts/upload_to_huggingface.py \ | |
| --repo "YOUR_USERNAME/oral-health-policy-data" \ | |
| --process-pdfs pdf_urls.txt | |
| # 3. Upload the Parquet file (1 file, not thousands!) | |
| python scripts/upload_to_huggingface.py \ | |
| --repo "YOUR_USERNAME/oral-health-policy-data" \ | |
| --meetings meetings_processed.parquet | |
| ``` | |
| --- | |
| ```python | |
| from datasets import Dataset | |
| from huggingface_hub import login | |
| import pandas as pd | |
| # Login | |
| login(token="hf_YOUR_TOKEN") | |
| # Load your data | |
| df = pd.read_csv('data/bronze/discovered_sources/discovery_summary_final.csv') | |
| # Convert to dataset | |
| dataset = Dataset.from_pandas(df) | |
| # Upload to Hugging Face (FREE!) | |
| dataset.push_to_hub("YOUR_USERNAME/oral-health-policy-data", split="discovery") | |
| print("β Data uploaded! View at:") | |
| print("https://huggingface.co/datasets/YOUR_USERNAME/oral-health-policy-data") | |
| ``` | |
| --- | |
| ## π° COST BREAKDOWN | |
| | What You Get | Cost | | |
| |--------------|------| | |
| | **Unlimited storage** (public datasets) | **FREE** | | |
| | Unlimited downloads | FREE | | |
| | Built-in viewer | FREE | | |
| | Version control | FREE | | |
| | Search & filtering | FREE | | |
| | API access | FREE | | |
| | **TOTAL** | **$0/month** β | | |
| --- | |
| ## π STORAGE COMPARISON | |
| ### Bad Approach (Expensive) | |
| ``` | |
| β Download all videos: 250 TB = $5,000/month | |
| β Store all PDFs: 30 TB = $600/month | |
| β Total: $5,600/month πΈ | |
| ``` | |
| ### Good Approach (FREE) | |
| ``` | |
| β Store discovery data: 1 GB = FREE | |
| β Store extracted text: 25 GB = FREE | |
| β Store oral health subset: 5 GB = FREE | |
| β Total: $0/month π | |
| ``` | |
| **Savings: $5,600/month β $0/month** | |
| --- | |
| ## π― WHAT TO UPLOAD | |
| ### β Upload These: | |
| 1. **Discovery Results** (~1 GB) | |
| - Jurisdiction websites | |
| - YouTube channels | |
| - Meeting platforms | |
| - Social media links | |
| 2. **Meeting Metadata** (~2 GB) | |
| - Meeting dates/titles | |
| - Agenda item lists | |
| - Source URLs | |
| 3. **Extracted Text** (~25 GB) | |
| - Text from PDFs | |
| - Meeting transcripts | |
| - Filtered for oral health | |
| ### β Don't Upload These: | |
| 1. **Videos** - Link to YouTube instead | |
| 2. **Full PDFs** - Store text + URL to original | |
| 3. **Website HTML** - Just store the data you extracted | |
| 4. **Duplicates** - Filter first | |
| --- | |
| ## π EXAMPLE WORKFLOW | |
| ### Step 1: Run Discovery | |
| ```bash | |
| # Discover all Alabama jurisdictions | |
| python discovery/comprehensive_discovery_pipeline.py --state AL | |
| # Output: data/bronze/discovered_sources/discovery_summary_AL.csv (~50 KB) | |
| ``` | |
| ### Step 2: Upload to Hugging Face | |
| ```bash | |
| # Upload discovery results | |
| python scripts/upload_to_huggingface.py \ | |
| --repo "YOUR_USERNAME/oral-health-policy-data" \ | |
| --discovery | |
| ``` | |
| ### Step 3: Free Up Local Space | |
| ```bash | |
| # Optional: Delete local files (data is safely in cloud) | |
| rm -rf data/bronze/discovered_sources/*.csv | |
| # You can always download from Hugging Face later! | |
| ``` | |
| ### Step 4: Share & Analyze | |
| ```python | |
| # Anyone can now use your data (including you!) | |
| from datasets import load_dataset | |
| data = load_dataset("YOUR_USERNAME/oral-health-policy-data", split="discovery") | |
| alabama = data.filter(lambda x: x['state'] == 'AL') | |
| print(f"Alabama jurisdictions: {len(alabama)}") | |
| ``` | |
| --- | |
| ## π CONTINUOUS WORKFLOW | |
| ### Keep Local Storage Low (~100 MB) | |
| ```python | |
| # Process one jurisdiction at a time | |
| for jurisdiction in all_jurisdictions: | |
| # 1. Download PDF (2 MB) | |
| pdf = download_agenda(jurisdiction) | |
| # 2. Extract text (50 KB) | |
| text = extract_text(pdf) | |
| # 3. Upload to Hugging Face | |
| upload_to_hf(text) | |
| # 4. Delete local file | |
| os.remove(pdf) | |
| # Local storage: Never exceeds 100 MB! β | |
| ``` | |
| --- | |
| ## π HUGGING FACE BASICS | |
| ### Load Your Data Anywhere | |
| ```python | |
| from datasets import load_dataset | |
| # Load on your laptop | |
| data = load_dataset("YOUR_USERNAME/oral-health-policy-data") | |
| # Or in Google Colab (FREE GPU) | |
| # Or on a friend's computer | |
| # Or 5 years from now | |
| # Your data is always available, forever, for FREE! | |
| ``` | |
| ### Search & Filter | |
| ```python | |
| # Find cities with YouTube channels | |
| with_youtube = data.filter(lambda x: x['youtube_channels'] > 0) | |
| # Find high-quality sources | |
| high_quality = data.filter(lambda x: x['completeness'] > 0.8) | |
| # Find specific state | |
| indiana = data.filter(lambda x: x['state'] == 'IN') | |
| ``` | |
| ### Download Subset | |
| ```python | |
| # Only download what you need (save bandwidth) | |
| oral_health_only = load_dataset( | |
| "YOUR_USERNAME/oral-health-policy-data", | |
| split="oral_health" # Only the filtered subset | |
| ) | |
| # Maybe only 5 GB instead of 50 GB! | |
| ``` | |
| --- | |
| ## β BENEFITS | |
| ### 1. **FREE Unlimited Storage** | |
| - No storage limits for public datasets | |
| - No bandwidth limits | |
| - No time limits | |
| ### 2. **Accessible Anywhere** | |
| - Download from any computer | |
| - Share with collaborators | |
| - Use in Google Colab | |
| ### 3. **Version Control** | |
| - Git-based system | |
| - Track all changes | |
| - Revert if needed | |
| ### 4. **Discovery** | |
| - Your dataset appears in Hugging Face search | |
| - Other researchers can use it | |
| - Builds your portfolio | |
| ### 5. **Integration** | |
| - Works with PyTorch, TensorFlow | |
| - Built-in data viewer | |
| - API access | |
| --- | |
| ## π LEARN MORE | |
| ### Official Docs | |
| - **Hugging Face Datasets:** https://huggingface.co/docs/datasets/ | |
| - **Quick Start:** https://huggingface.co/docs/datasets/quickstart | |
| - **Upload Guide:** https://huggingface.co/docs/datasets/upload_dataset | |
| ### Examples | |
| - **MeetingBank:** https://huggingface.co/datasets/huuuyeah/meetingbank | |
| - **Browse Datasets:** https://huggingface.co/datasets | |
| --- | |
| ## π TROUBLESHOOTING | |
| ### "Authentication failed" | |
| ```bash | |
| # Make sure token is set | |
| echo $HF_TOKEN | |
| # If empty, set it | |
| export HF_TOKEN="hf_YOUR_TOKEN" | |
| # Or login interactively | |
| huggingface-cli login | |
| ``` | |
| ### "Permission denied" | |
| ```bash | |
| # Make sure repo name includes your username | |
| # β Correct: "myusername/oral-health-policy-data" | |
| # β Wrong: "oral-health-policy-data" | |
| ``` | |
| ### "Dataset too large" | |
| ```python | |
| # Don't upload raw files! | |
| # Upload processed/filtered data only | |
| # β Bad: Upload 50 GB of PDFs | |
| # β Good: Upload 5 GB of extracted text | |
| ``` | |
| --- | |
| ## π― NEXT STEPS | |
| 1. β Create Hugging Face account | |
| 2. β Get API token | |
| 3. β Run discovery for your state | |
| 4. β Upload to Hugging Face | |
| 5. β Delete local files to free space | |
| 6. β Scale to all 22,000+ jurisdictions! | |
| **Your data is safe in the cloud, FREE, forever!** π | |
| --- | |
| ## π‘ PRO TIP | |
| Make your dataset **public** (not private): | |
| - β FREE unlimited storage | |
| - β Helps research community | |
| - β Builds your portfolio | |
| - β Appears in search results | |
| Private datasets are limited to 100 GB and don't help anyone! | |
| **Public = Win-Win-Win** π | |