Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

File size: 8,817 Bytes

896453f

# 🚀 QUICK START: FREE STORAGE WITH HUGGING FACE

**TL;DR: Store unlimited data for FREE on Hugging Face!**

**⚠️ IMPORTANT: Use Parquet format, NOT individual PDFs! See [file limits guide](HUGGINGFACE_FILE_LIMITS.md)**

---

## ⚡ 3-MINUTE SETUP

### 1. Create Hugging Face Account (1 minute)
```bash
# Go to https://huggingface.co/join
# Sign up (FREE)
# Verify email
```

### 2. Get API Token (1 minute)
```bash
# Go to https://huggingface.co/settings/tokens
# Click "New token"
# Name it "oral-health-upload"
# Token Type: Write (required for publishing datasets)
# Repository permissions: All repositories
# Copy the token (hf_xxxxxxxxxxxx)
```

**⚠️ Important: Token Permissions**
- **Write** access required for publishing datasets
- **Read** access sufficient for downloading public datasets only
- For this project: Use **Write** token to publish your scraped data

### 3. Install & Login (1 minute)
```bash
pip install huggingface_hub datasets

# Set your token
export HF_TOKEN="hf_YOUR_TOKEN_HERE"
```

---

## ⚠️ CRITICAL: FILE LIMITS

**Hugging Face Limits:**
- Files per folder: <10,000
- Total files per repo: <100,000
- For large datasets: Use Parquet or WebDataset format

**Your Scale:**
- 22,000 jurisdictions × 1,000 docs = 22 MILLION files ❌

**Solution:**
- Extract text from PDFs
- Store in Parquet format
- Result: 50 files instead of 22 million ✅

**See detailed guide:** [HUGGINGFACE_FILE_LIMITS.md](HUGGINGFACE_FILE_LIMITS.md)

---

## 📤 UPLOAD YOUR DATA

### Option 1: Use the Upload Script (Recommended)

**For discovery data:**

```bash
# Go to your project
cd /home/developer/projects/open-navigator

# Activate environment
source venv/bin/activate

# Upload discovery results
python scripts/upload_to_huggingface.py \
    --repo "YOUR_USERNAME/oral-health-policy-data" \
    --discovery

# View your dataset
# https://huggingface.co/datasets/YOUR_USERNAME/oral-health-policy-data
```

**For meeting PDFs (extract text first!):**

```bash
# DON'T upload individual PDFs!
# Instead, extract text and save as Parquet

# 1. Create a file with PDF URLs (one per line)
cat > pdf_urls.txt << EOF
https://tuscaloosaal.suiteonemedia.com/agenda1.pdf
https://tuscaloosaal.suiteonemedia.com/agenda2.pdf
...
EOF

# 2. Process PDFs to Parquet (extracts text, deletes PDFs)
python scripts/upload_to_huggingface.py \
    --repo "YOUR_USERNAME/oral-health-policy-data" \
    --process-pdfs pdf_urls.txt

# 3. Upload the Parquet file (1 file, not thousands!)
python scripts/upload_to_huggingface.py \
    --repo "YOUR_USERNAME/oral-health-policy-data" \
    --meetings meetings_processed.parquet
```

---

```python
from datasets import Dataset
from huggingface_hub import login
import pandas as pd

# Login
login(token="hf_YOUR_TOKEN")

# Load your data
df = pd.read_csv('data/bronze/discovered_sources/discovery_summary_final.csv')

# Convert to dataset
dataset = Dataset.from_pandas(df)

# Upload to Hugging Face (FREE!)
dataset.push_to_hub("YOUR_USERNAME/oral-health-policy-data", split="discovery")

print("✅ Data uploaded! View at:")
print("https://huggingface.co/datasets/YOUR_USERNAME/oral-health-policy-data")
```

---

## 💰 COST BREAKDOWN

| What You Get | Cost |
|--------------|------|
| **Unlimited storage** (public datasets) | **FREE** |
| Unlimited downloads | FREE |
| Built-in viewer | FREE |
| Version control | FREE |
| Search & filtering | FREE |
| API access | FREE |
| **TOTAL** | **$0/month** ✅ |

---

## 📊 STORAGE COMPARISON

### Bad Approach (Expensive)
```
❌ Download all videos: 250 TB = $5,000/month
❌ Store all PDFs: 30 TB = $600/month
❌ Total: $5,600/month 💸
```

### Good Approach (FREE)
```
✅ Store discovery data: 1 GB = FREE
✅ Store extracted text: 25 GB = FREE
✅ Store oral health subset: 5 GB = FREE
✅ Total: $0/month 🎉
```

**Savings: $5,600/month → $0/month**

---

## 🎯 WHAT TO UPLOAD

### ✅ Upload These:

1. **Discovery Results** (~1 GB)
   - Jurisdiction websites
   - YouTube channels
   - Meeting platforms
   - Social media links

2. **Meeting Metadata** (~2 GB)
   - Meeting dates/titles
   - Agenda item lists
   - Source URLs

3. **Extracted Text** (~25 GB)
   - Text from PDFs
   - Meeting transcripts
   - Filtered for oral health

### ❌ Don't Upload These:

1. **Videos** - Link to YouTube instead
2. **Full PDFs** - Store text + URL to original
3. **Website HTML** - Just store the data you extracted
4. **Duplicates** - Filter first

---

## 📝 EXAMPLE WORKFLOW

### Step 1: Run Discovery
```bash
# Discover all Alabama jurisdictions
python discovery/comprehensive_discovery_pipeline.py --state AL

# Output: data/bronze/discovered_sources/discovery_summary_AL.csv (~50 KB)
```

### Step 2: Upload to Hugging Face
```bash
# Upload discovery results
python scripts/upload_to_huggingface.py \
    --repo "YOUR_USERNAME/oral-health-policy-data" \
    --discovery
```

### Step 3: Free Up Local Space
```bash
# Optional: Delete local files (data is safely in cloud)
rm -rf data/bronze/discovered_sources/*.csv

# You can always download from Hugging Face later!
```

### Step 4: Share & Analyze
```python
# Anyone can now use your data (including you!)
from datasets import load_dataset

data = load_dataset("YOUR_USERNAME/oral-health-policy-data", split="discovery")
alabama = data.filter(lambda x: x['state'] == 'AL')

print(f"Alabama jurisdictions: {len(alabama)}")
```

---

## 🔄 CONTINUOUS WORKFLOW

### Keep Local Storage Low (~100 MB)

```python
# Process one jurisdiction at a time
for jurisdiction in all_jurisdictions:
    # 1. Download PDF (2 MB)
    pdf = download_agenda(jurisdiction)
    
    # 2. Extract text (50 KB)
    text = extract_text(pdf)
    
    # 3. Upload to Hugging Face
    upload_to_hf(text)
    
    # 4. Delete local file
    os.remove(pdf)
    
    # Local storage: Never exceeds 100 MB! ✅
```

---

## 📚 HUGGING FACE BASICS

### Load Your Data Anywhere

```python
from datasets import load_dataset

# Load on your laptop
data = load_dataset("YOUR_USERNAME/oral-health-policy-data")

# Or in Google Colab (FREE GPU)
# Or on a friend's computer
# Or 5 years from now

# Your data is always available, forever, for FREE!
```

### Search & Filter

```python
# Find cities with YouTube channels
with_youtube = data.filter(lambda x: x['youtube_channels'] > 0)

# Find high-quality sources
high_quality = data.filter(lambda x: x['completeness'] > 0.8)

# Find specific state
indiana = data.filter(lambda x: x['state'] == 'IN')
```

### Download Subset

```python
# Only download what you need (save bandwidth)
oral_health_only = load_dataset(
    "YOUR_USERNAME/oral-health-policy-data",
    split="oral_health"  # Only the filtered subset
)

# Maybe only 5 GB instead of 50 GB!
```

---

## ✅ BENEFITS

### 1. **FREE Unlimited Storage**
- No storage limits for public datasets
- No bandwidth limits
- No time limits

### 2. **Accessible Anywhere**
- Download from any computer
- Share with collaborators
- Use in Google Colab

### 3. **Version Control**
- Git-based system
- Track all changes
- Revert if needed

### 4. **Discovery**
- Your dataset appears in Hugging Face search
- Other researchers can use it
- Builds your portfolio

### 5. **Integration**
- Works with PyTorch, TensorFlow
- Built-in data viewer
- API access

---

## 🎓 LEARN MORE

### Official Docs
- **Hugging Face Datasets:** https://huggingface.co/docs/datasets/
- **Quick Start:** https://huggingface.co/docs/datasets/quickstart
- **Upload Guide:** https://huggingface.co/docs/datasets/upload_dataset

### Examples
- **MeetingBank:** https://huggingface.co/datasets/huuuyeah/meetingbank
- **Browse Datasets:** https://huggingface.co/datasets

---

## 🆘 TROUBLESHOOTING

### "Authentication failed"
```bash
# Make sure token is set
echo $HF_TOKEN

# If empty, set it
export HF_TOKEN="hf_YOUR_TOKEN"

# Or login interactively
huggingface-cli login
```

### "Permission denied"
```bash
# Make sure repo name includes your username
# ✅ Correct: "myusername/oral-health-policy-data"
# ❌ Wrong: "oral-health-policy-data"
```

### "Dataset too large"
```python
# Don't upload raw files!
# Upload processed/filtered data only

# ❌ Bad: Upload 50 GB of PDFs
# ✅ Good: Upload 5 GB of extracted text
```

---

## 🎯 NEXT STEPS

1. ✅ Create Hugging Face account
2. ✅ Get API token
3. ✅ Run discovery for your state
4. ✅ Upload to Hugging Face
5. ✅ Delete local files to free space
6. ✅ Scale to all 22,000+ jurisdictions!

**Your data is safe in the cloud, FREE, forever!** 🎉

---

## 💡 PRO TIP

Make your dataset **public** (not private):
- ✅ FREE unlimited storage
- ✅ Helps research community
- ✅ Builds your portfolio
- ✅ Appears in search results

Private datasets are limited to 100 GB and don't help anyone!

**Public = Win-Win-Win** 🏆