Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

File size: 12,029 Bytes

61d29fc

# ⚠️ HUGGING FACE FILE LIMITS & SOLUTIONS

**IMPORTANT: Don't upload individual PDFs! Use structured formats instead.**

---

## 🚨 THE PROBLEM

### Hugging Face Limits:
```
Files per folder:      < 10,000 recommended
Total files per repo:  < 100,000 recommended
Large-scale handling:  Use WebDataset or Parquet, NOT individual files
```

### Your Scale:
```
22,000 jurisdictions × 1,000 documents each = 22 MILLION files
❌ This would BREAK Hugging Face limits!
```

---

## ✅ THE SOLUTION: PARQUET FORMAT

**Instead of uploading 22 million PDFs, store extracted data in Parquet files.**

### Why Parquet?

1. ✅ **Efficient** - Columnar storage, highly compressed
2. ✅ **Scalable** - Handle millions of rows in single file
3. ✅ **Fast** - Optimized for filtering and querying
4. ✅ **Native** - Hugging Face Datasets uses Parquet internally
5. ✅ **Small** - 10-100x smaller than individual files

### Size Comparison:

```
❌ Bad: 22 million PDF files (30 TB)
   - Exceeds 100k file limit by 220x
   - Slow to upload/download
   - Impossible to manage

✅ Good: 220 Parquet files (25 GB compressed)
   - 1 file per jurisdiction type per state
   - Fast to query
   - Easy to manage
   - Within all limits
```

---

## 📊 RECOMMENDED STRUCTURE

### Option 1: Parquet Files (RECOMMENDED)

**Store all text content in Parquet tables:**

```python
import pandas as pd
from datasets import Dataset

# Instead of storing individual PDFs...
# Store rows in a DataFrame

meetings_data = []

for jurisdiction in all_jurisdictions:
    for meeting in jurisdiction.meetings:
        meetings_data.append({
            'jurisdiction_name': 'Tuscaloosa',
            'state': 'AL',
            'meeting_date': '2025-03-15',
            'meeting_title': 'City Council Regular Meeting',
            'agenda_text': 'extracted text from PDF...',  # ← TEXT, not PDF bytes
            'minutes_text': 'extracted minutes...',
            'video_url': 'https://youtube.com/watch?v=...',  # ← LINK, not video
            'source_url': 'https://tuscaloosaal.suiteonemedia.com/agenda.pdf',
            'keywords_found': ['fluoride', 'dental'],
            'is_oral_health_related': True
        })

# Convert to DataFrame
df = pd.DataFrame(meetings_data)

# Save as Parquet (highly compressed)
df.to_parquet('meetings_all.parquet', compression='snappy')

# Upload to Hugging Face
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("username/oral-health-policy-data", split="meetings")
```

**File structure on Hugging Face:**
```
your-dataset/
├── discovery.parquet          # 1 file, ~1 GB (22k jurisdictions)
├── meetings.parquet           # 1 file, ~10 GB (500k meetings)
├── oral_health.parquet        # 1 file, ~2 GB (50k relevant docs)
└── README.md

Total: 3 files, 13 GB ✅ (vs 22 million files, 30 TB ❌)
```

---

## 🎯 CORRECT WORKFLOW

### ❌ WRONG: Download & Upload PDFs

```python
# DON'T DO THIS!
for jurisdiction in all_jurisdictions:
    for meeting in get_meetings(jurisdiction):
        # Download PDF
        pdf_bytes = download_pdf(meeting.pdf_url)
        
        # Upload to Hugging Face
        upload_file(pdf_bytes, f"pdfs/{jurisdiction}/{meeting.id}.pdf")
        # ❌ Results in 22 million files!
```

### ✅ CORRECT: Extract & Store Text in Parquet

```python
# DO THIS!
import pandas as pd
from PyPDF2 import PdfReader
import io

all_meetings = []

for jurisdiction in all_jurisdictions:
    for meeting in get_meetings(jurisdiction):
        # Download PDF temporarily
        pdf_bytes = download_pdf(meeting.pdf_url)
        
        # Extract text (don't store PDF!)
        pdf_reader = PdfReader(io.BytesIO(pdf_bytes))
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text()
        
        # Store metadata + text (not PDF bytes)
        all_meetings.append({
            'id': f"{jurisdiction.name}_{meeting.date}_{meeting.id}",
            'jurisdiction': jurisdiction.name,
            'state': jurisdiction.state,
            'date': meeting.date,
            'title': meeting.title,
            'text': text,  # ← Extracted text
            'source_pdf_url': meeting.pdf_url,  # ← Link to original
            'file_size_kb': len(pdf_bytes) // 1024,
            'page_count': len(pdf_reader.pages)
        })
        
        # Delete PDF immediately (free memory)
        del pdf_bytes

# Save all to single Parquet file
df = pd.DataFrame(all_meetings)
df.to_parquet('all_meetings.parquet', compression='snappy')

# Upload 1 file instead of 22 million!
from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("username/oral-health-meetings")
```

**Result:**
- ✅ 1 file (not 22 million)
- ✅ 10 GB (not 30 TB)
- ✅ Fast queries
- ✅ Easy downloads

---

## 📦 PARTITIONED PARQUET (For Very Large Datasets)

If you have 100+ GB of data, partition by state:

```python
import pandas as pd
from pathlib import Path

# Process state by state
for state in all_states:
    state_meetings = []
    
    for jurisdiction in get_jurisdictions(state):
        # Extract meetings for this jurisdiction
        meetings = process_jurisdiction(jurisdiction)
        state_meetings.extend(meetings)
    
    # Save one Parquet per state
    df = pd.DataFrame(state_meetings)
    df.to_parquet(f'meetings_{state}.parquet')

# Upload to Hugging Face with state-based splits
from datasets import Dataset, DatasetDict

dataset_dict = {}
for state_file in Path('.').glob('meetings_*.parquet'):
    state = state_file.stem.split('_')[1]
    df = pd.read_parquet(state_file)
    dataset_dict[state] = Dataset.from_pandas(df)

# Upload all states
datasets = DatasetDict(dataset_dict)
datasets.push_to_hub("username/oral-health-meetings")
```

**File structure:**
```
your-dataset/
├── AL/
│   └── data-00000-of-00001.parquet  # Alabama meetings
├── CA/
│   └── data-00000-of-00001.parquet  # California meetings
├── TX/
│   └── data-00000-of-00001.parquet  # Texas meetings
...
└── README.md

Total: 50 files (one per state) ✅
```

**Load specific state:**
```python
# Only download Alabama data
al_data = load_dataset("username/oral-health-meetings", split="AL")
```

---

## 🗜️ COMPRESSION COMPARISON

### Parquet Compression:

```python
# Same data, different compression

df.to_parquet('meetings.parquet', compression='snappy')  # Fast, good compression
# Size: 8 GB

df.to_parquet('meetings.parquet', compression='gzip')    # Slower, better compression
# Size: 5 GB

df.to_parquet('meetings.parquet', compression='brotli')  # Slowest, best compression
# Size: 3 GB
```

**Recommendation:** Use `snappy` (default) - good balance of speed and size.

---

## 🔢 SIZE ESTIMATES

### Real Numbers for 22,000 Jurisdictions:

| Data Type | Storage Method | Files | Size |
|-----------|----------------|-------|------|
| **PDFs (raw)** | Individual files | 22M | 30 TB ❌ |
| **PDFs (text)** | Parquet | 50 | 25 GB ✅ |
| **Oral health subset** | Parquet | 1 | 5 GB ✅ |
| **Discovery results** | Parquet | 1 | 1 GB ✅ |

**Total storage needed: ~30 GB (not 30 TB!)** ✅

---

## 💡 ALTERNATIVE: WebDataset Format

For image-heavy or binary data, use WebDataset `.tar` files:

```python
import webdataset as wds

# Create sharded tar files
sink = wds.ShardWriter("meetings-%06d.tar", maxcount=10000)

for jurisdiction in all_jurisdictions:
    for meeting in jurisdiction.meetings:
        # Extract text from PDF
        text = extract_text(meeting.pdf_url)
        
        sink.write({
            "__key__": f"{jurisdiction.name}_{meeting.id}",
            "txt": text.encode('utf-8'),
            "json": json.dumps(meeting.metadata).encode('utf-8')
        })

sink.close()

# Results in:
# meetings-000000.tar (10k documents)
# meetings-000001.tar (10k documents)
# ...
# meetings-002200.tar (remaining documents)
# Total: ~2,200 tar files ✅ (under 10k file limit per folder)
```

---

## 🎯 RECOMMENDED APPROACH

### For Your Project:

**1. Store Metadata + Text in Parquet (Primary)**
```python
# Structure your data
meetings_df = pd.DataFrame({
    'id': [...],
    'jurisdiction': [...],
    'state': [...],
    'date': [...],
    'title': [...],
    'agenda_text': [...],      # Extracted text
    'minutes_text': [...],     # Extracted text
    'source_url': [...],       # Link to original PDF
    'video_url': [...],        # Link to YouTube
    'oral_health_keywords': [...]
})

# Save as Parquet
meetings_df.to_parquet('meetings.parquet', compression='snappy')

# Upload to Hugging Face (1 file, ~10 GB)
dataset = Dataset.from_pandas(meetings_df)
dataset.push_to_hub("username/oral-health-meetings")
```

**2. Partition by State (If >50 GB)**
```python
# One Parquet per state
for state in all_states:
    state_df = meetings_df[meetings_df['state'] == state]
    state_df.to_parquet(f'meetings_{state}.parquet')

# Upload with splits
dataset_dict = {...}  # Load each state
datasets.push_to_hub("username/oral-health-meetings")

# Total: 50 files (one per state) ✅
```

**3. Never Upload Individual PDFs**
```python
# ❌ NEVER do this
for pdf in all_pdfs:
    upload_file(pdf)  # Results in millions of files

# ✅ ALWAYS do this
text = extract_text(pdf)
df.append({'text': text, 'source_url': pdf_url})
df.to_parquet('data.parquet')  # One file
```

---

## 📚 UPDATED UPLOAD SCRIPT

```python
#!/usr/bin/env python3
"""
Correctly upload large-scale data to Hugging Face using Parquet format.
"""

import pandas as pd
from datasets import Dataset
from huggingface_hub import login
from PyPDF2 import PdfReader
import io

def process_and_upload_correct_way():
    """Process jurisdictions and upload as Parquet (not individual files)."""
    
    all_meetings = []
    
    # Process all jurisdictions
    for jurisdiction in all_jurisdictions:
        print(f"Processing {jurisdiction.name}...")
        
        for agenda_url in jurisdiction.agenda_urls:
            # Download PDF temporarily
            pdf_bytes = download_pdf(agenda_url)
            
            # Extract text
            pdf_reader = PdfReader(io.BytesIO(pdf_bytes))
            text = "\n".join(page.extract_text() for page in pdf_reader.pages)
            
            # Store metadata + text (NOT PDF bytes)
            all_meetings.append({
                'jurisdiction': jurisdiction.name,
                'state': jurisdiction.state,
                'date': extract_date(text),
                'text': text,
                'source_url': agenda_url,
                'page_count': len(pdf_reader.pages)
            })
            
            # Delete PDF immediately
            del pdf_bytes
            
            # Keep local storage low!
    
    # Convert to DataFrame
    df = pd.DataFrame(all_meetings)
    
    # Save as Parquet (compressed)
    df.to_parquet('all_meetings.parquet', compression='snappy')
    
    print(f"Total meetings: {len(df)}")
    print(f"File size: {Path('all_meetings.parquet').stat().st_size / 1e9:.2f} GB")
    
    # Upload to Hugging Face (1 file instead of millions!)
    dataset = Dataset.from_pandas(df)
    dataset.push_to_hub("username/oral-health-meetings")
    
    print("✅ Uploaded 1 Parquet file containing all meetings!")
```

---

## ✅ SUMMARY

### Do This:
1. ✅ Extract text from PDFs (don't store PDF bytes)
2. ✅ Store in Parquet format (1-50 files total)
3. ✅ Link to original sources (not duplicate content)
4. ✅ Compress with snappy
5. ✅ Partition by state if >50 GB

### Don't Do This:
1. ❌ Upload individual PDFs (millions of files)
2. ❌ Store video files (link to YouTube)
3. ❌ Duplicate raw content
4. ❌ Exceed 100k file limit
5. ❌ Use uncompressed formats

### Result:
- **22 million files → 50 files** ✅
- **30 TB → 30 GB** ✅
- **Slow uploads → Fast uploads** ✅
- **Hard to manage → Easy to manage** ✅
- **Expensive → FREE** ✅

**You can store ALL 22,000 jurisdictions in ~50 Parquet files totaling 30 GB!**