Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # β οΈ HUGGING FACE FILE LIMITS & SOLUTIONS | |
| **IMPORTANT: Don't upload individual PDFs! Use structured formats instead.** | |
| --- | |
| ## π¨ THE PROBLEM | |
| ### Hugging Face Limits: | |
| ``` | |
| Files per folder: < 10,000 recommended | |
| Total files per repo: < 100,000 recommended | |
| Large-scale handling: Use WebDataset or Parquet, NOT individual files | |
| ``` | |
| ### Your Scale: | |
| ``` | |
| 22,000 jurisdictions Γ 1,000 documents each = 22 MILLION files | |
| β This would BREAK Hugging Face limits! | |
| ``` | |
| --- | |
| ## β THE SOLUTION: PARQUET FORMAT | |
| **Instead of uploading 22 million PDFs, store extracted data in Parquet files.** | |
| ### Why Parquet? | |
| 1. β **Efficient** - Columnar storage, highly compressed | |
| 2. β **Scalable** - Handle millions of rows in single file | |
| 3. β **Fast** - Optimized for filtering and querying | |
| 4. β **Native** - Hugging Face Datasets uses Parquet internally | |
| 5. β **Small** - 10-100x smaller than individual files | |
| ### Size Comparison: | |
| ``` | |
| β Bad: 22 million PDF files (30 TB) | |
| - Exceeds 100k file limit by 220x | |
| - Slow to upload/download | |
| - Impossible to manage | |
| β Good: 220 Parquet files (25 GB compressed) | |
| - 1 file per jurisdiction type per state | |
| - Fast to query | |
| - Easy to manage | |
| - Within all limits | |
| ``` | |
| --- | |
| ## π RECOMMENDED STRUCTURE | |
| ### Option 1: Parquet Files (RECOMMENDED) | |
| **Store all text content in Parquet tables:** | |
| ```python | |
| import pandas as pd | |
| from datasets import Dataset | |
| # Instead of storing individual PDFs... | |
| # Store rows in a DataFrame | |
| meetings_data = [] | |
| for jurisdiction in all_jurisdictions: | |
| for meeting in jurisdiction.meetings: | |
| meetings_data.append({ | |
| 'jurisdiction_name': 'Tuscaloosa', | |
| 'state': 'AL', | |
| 'meeting_date': '2025-03-15', | |
| 'meeting_title': 'City Council Regular Meeting', | |
| 'agenda_text': 'extracted text from PDF...', # β TEXT, not PDF bytes | |
| 'minutes_text': 'extracted minutes...', | |
| 'video_url': 'https://youtube.com/watch?v=...', # β LINK, not video | |
| 'source_url': 'https://tuscaloosaal.suiteonemedia.com/agenda.pdf', | |
| 'keywords_found': ['fluoride', 'dental'], | |
| 'is_oral_health_related': True | |
| }) | |
| # Convert to DataFrame | |
| df = pd.DataFrame(meetings_data) | |
| # Save as Parquet (highly compressed) | |
| df.to_parquet('meetings_all.parquet', compression='snappy') | |
| # Upload to Hugging Face | |
| dataset = Dataset.from_pandas(df) | |
| dataset.push_to_hub("username/oral-health-policy-data", split="meetings") | |
| ``` | |
| **File structure on Hugging Face:** | |
| ``` | |
| your-dataset/ | |
| βββ discovery.parquet # 1 file, ~1 GB (22k jurisdictions) | |
| βββ meetings.parquet # 1 file, ~10 GB (500k meetings) | |
| βββ oral_health.parquet # 1 file, ~2 GB (50k relevant docs) | |
| βββ README.md | |
| Total: 3 files, 13 GB β (vs 22 million files, 30 TB β) | |
| ``` | |
| --- | |
| ## π― CORRECT WORKFLOW | |
| ### β WRONG: Download & Upload PDFs | |
| ```python | |
| # DON'T DO THIS! | |
| for jurisdiction in all_jurisdictions: | |
| for meeting in get_meetings(jurisdiction): | |
| # Download PDF | |
| pdf_bytes = download_pdf(meeting.pdf_url) | |
| # Upload to Hugging Face | |
| upload_file(pdf_bytes, f"pdfs/{jurisdiction}/{meeting.id}.pdf") | |
| # β Results in 22 million files! | |
| ``` | |
| ### β CORRECT: Extract & Store Text in Parquet | |
| ```python | |
| # DO THIS! | |
| import pandas as pd | |
| from PyPDF2 import PdfReader | |
| import io | |
| all_meetings = [] | |
| for jurisdiction in all_jurisdictions: | |
| for meeting in get_meetings(jurisdiction): | |
| # Download PDF temporarily | |
| pdf_bytes = download_pdf(meeting.pdf_url) | |
| # Extract text (don't store PDF!) | |
| pdf_reader = PdfReader(io.BytesIO(pdf_bytes)) | |
| text = "" | |
| for page in pdf_reader.pages: | |
| text += page.extract_text() | |
| # Store metadata + text (not PDF bytes) | |
| all_meetings.append({ | |
| 'id': f"{jurisdiction.name}_{meeting.date}_{meeting.id}", | |
| 'jurisdiction': jurisdiction.name, | |
| 'state': jurisdiction.state, | |
| 'date': meeting.date, | |
| 'title': meeting.title, | |
| 'text': text, # β Extracted text | |
| 'source_pdf_url': meeting.pdf_url, # β Link to original | |
| 'file_size_kb': len(pdf_bytes) // 1024, | |
| 'page_count': len(pdf_reader.pages) | |
| }) | |
| # Delete PDF immediately (free memory) | |
| del pdf_bytes | |
| # Save all to single Parquet file | |
| df = pd.DataFrame(all_meetings) | |
| df.to_parquet('all_meetings.parquet', compression='snappy') | |
| # Upload 1 file instead of 22 million! | |
| from datasets import Dataset | |
| dataset = Dataset.from_pandas(df) | |
| dataset.push_to_hub("username/oral-health-meetings") | |
| ``` | |
| **Result:** | |
| - β 1 file (not 22 million) | |
| - β 10 GB (not 30 TB) | |
| - β Fast queries | |
| - β Easy downloads | |
| --- | |
| ## π¦ PARTITIONED PARQUET (For Very Large Datasets) | |
| If you have 100+ GB of data, partition by state: | |
| ```python | |
| import pandas as pd | |
| from pathlib import Path | |
| # Process state by state | |
| for state in all_states: | |
| state_meetings = [] | |
| for jurisdiction in get_jurisdictions(state): | |
| # Extract meetings for this jurisdiction | |
| meetings = process_jurisdiction(jurisdiction) | |
| state_meetings.extend(meetings) | |
| # Save one Parquet per state | |
| df = pd.DataFrame(state_meetings) | |
| df.to_parquet(f'meetings_{state}.parquet') | |
| # Upload to Hugging Face with state-based splits | |
| from datasets import Dataset, DatasetDict | |
| dataset_dict = {} | |
| for state_file in Path('.').glob('meetings_*.parquet'): | |
| state = state_file.stem.split('_')[1] | |
| df = pd.read_parquet(state_file) | |
| dataset_dict[state] = Dataset.from_pandas(df) | |
| # Upload all states | |
| datasets = DatasetDict(dataset_dict) | |
| datasets.push_to_hub("username/oral-health-meetings") | |
| ``` | |
| **File structure:** | |
| ``` | |
| your-dataset/ | |
| βββ AL/ | |
| β βββ data-00000-of-00001.parquet # Alabama meetings | |
| βββ CA/ | |
| β βββ data-00000-of-00001.parquet # California meetings | |
| βββ TX/ | |
| β βββ data-00000-of-00001.parquet # Texas meetings | |
| ... | |
| βββ README.md | |
| Total: 50 files (one per state) β | |
| ``` | |
| **Load specific state:** | |
| ```python | |
| # Only download Alabama data | |
| al_data = load_dataset("username/oral-health-meetings", split="AL") | |
| ``` | |
| --- | |
| ## ποΈ COMPRESSION COMPARISON | |
| ### Parquet Compression: | |
| ```python | |
| # Same data, different compression | |
| df.to_parquet('meetings.parquet', compression='snappy') # Fast, good compression | |
| # Size: 8 GB | |
| df.to_parquet('meetings.parquet', compression='gzip') # Slower, better compression | |
| # Size: 5 GB | |
| df.to_parquet('meetings.parquet', compression='brotli') # Slowest, best compression | |
| # Size: 3 GB | |
| ``` | |
| **Recommendation:** Use `snappy` (default) - good balance of speed and size. | |
| --- | |
| ## π’ SIZE ESTIMATES | |
| ### Real Numbers for 22,000 Jurisdictions: | |
| | Data Type | Storage Method | Files | Size | | |
| |-----------|----------------|-------|------| | |
| | **PDFs (raw)** | Individual files | 22M | 30 TB β | | |
| | **PDFs (text)** | Parquet | 50 | 25 GB β | | |
| | **Oral health subset** | Parquet | 1 | 5 GB β | | |
| | **Discovery results** | Parquet | 1 | 1 GB β | | |
| **Total storage needed: ~30 GB (not 30 TB!)** β | |
| --- | |
| ## π‘ ALTERNATIVE: WebDataset Format | |
| For image-heavy or binary data, use WebDataset `.tar` files: | |
| ```python | |
| import webdataset as wds | |
| # Create sharded tar files | |
| sink = wds.ShardWriter("meetings-%06d.tar", maxcount=10000) | |
| for jurisdiction in all_jurisdictions: | |
| for meeting in jurisdiction.meetings: | |
| # Extract text from PDF | |
| text = extract_text(meeting.pdf_url) | |
| sink.write({ | |
| "__key__": f"{jurisdiction.name}_{meeting.id}", | |
| "txt": text.encode('utf-8'), | |
| "json": json.dumps(meeting.metadata).encode('utf-8') | |
| }) | |
| sink.close() | |
| # Results in: | |
| # meetings-000000.tar (10k documents) | |
| # meetings-000001.tar (10k documents) | |
| # ... | |
| # meetings-002200.tar (remaining documents) | |
| # Total: ~2,200 tar files β (under 10k file limit per folder) | |
| ``` | |
| --- | |
| ## π― RECOMMENDED APPROACH | |
| ### For Your Project: | |
| **1. Store Metadata + Text in Parquet (Primary)** | |
| ```python | |
| # Structure your data | |
| meetings_df = pd.DataFrame({ | |
| 'id': [...], | |
| 'jurisdiction': [...], | |
| 'state': [...], | |
| 'date': [...], | |
| 'title': [...], | |
| 'agenda_text': [...], # Extracted text | |
| 'minutes_text': [...], # Extracted text | |
| 'source_url': [...], # Link to original PDF | |
| 'video_url': [...], # Link to YouTube | |
| 'oral_health_keywords': [...] | |
| }) | |
| # Save as Parquet | |
| meetings_df.to_parquet('meetings.parquet', compression='snappy') | |
| # Upload to Hugging Face (1 file, ~10 GB) | |
| dataset = Dataset.from_pandas(meetings_df) | |
| dataset.push_to_hub("username/oral-health-meetings") | |
| ``` | |
| **2. Partition by State (If >50 GB)** | |
| ```python | |
| # One Parquet per state | |
| for state in all_states: | |
| state_df = meetings_df[meetings_df['state'] == state] | |
| state_df.to_parquet(f'meetings_{state}.parquet') | |
| # Upload with splits | |
| dataset_dict = {...} # Load each state | |
| datasets.push_to_hub("username/oral-health-meetings") | |
| # Total: 50 files (one per state) β | |
| ``` | |
| **3. Never Upload Individual PDFs** | |
| ```python | |
| # β NEVER do this | |
| for pdf in all_pdfs: | |
| upload_file(pdf) # Results in millions of files | |
| # β ALWAYS do this | |
| text = extract_text(pdf) | |
| df.append({'text': text, 'source_url': pdf_url}) | |
| df.to_parquet('data.parquet') # One file | |
| ``` | |
| --- | |
| ## π UPDATED UPLOAD SCRIPT | |
| ```python | |
| #!/usr/bin/env python3 | |
| """ | |
| Correctly upload large-scale data to Hugging Face using Parquet format. | |
| """ | |
| import pandas as pd | |
| from datasets import Dataset | |
| from huggingface_hub import login | |
| from PyPDF2 import PdfReader | |
| import io | |
| def process_and_upload_correct_way(): | |
| """Process jurisdictions and upload as Parquet (not individual files).""" | |
| all_meetings = [] | |
| # Process all jurisdictions | |
| for jurisdiction in all_jurisdictions: | |
| print(f"Processing {jurisdiction.name}...") | |
| for agenda_url in jurisdiction.agenda_urls: | |
| # Download PDF temporarily | |
| pdf_bytes = download_pdf(agenda_url) | |
| # Extract text | |
| pdf_reader = PdfReader(io.BytesIO(pdf_bytes)) | |
| text = "\n".join(page.extract_text() for page in pdf_reader.pages) | |
| # Store metadata + text (NOT PDF bytes) | |
| all_meetings.append({ | |
| 'jurisdiction': jurisdiction.name, | |
| 'state': jurisdiction.state, | |
| 'date': extract_date(text), | |
| 'text': text, | |
| 'source_url': agenda_url, | |
| 'page_count': len(pdf_reader.pages) | |
| }) | |
| # Delete PDF immediately | |
| del pdf_bytes | |
| # Keep local storage low! | |
| # Convert to DataFrame | |
| df = pd.DataFrame(all_meetings) | |
| # Save as Parquet (compressed) | |
| df.to_parquet('all_meetings.parquet', compression='snappy') | |
| print(f"Total meetings: {len(df)}") | |
| print(f"File size: {Path('all_meetings.parquet').stat().st_size / 1e9:.2f} GB") | |
| # Upload to Hugging Face (1 file instead of millions!) | |
| dataset = Dataset.from_pandas(df) | |
| dataset.push_to_hub("username/oral-health-meetings") | |
| print("β Uploaded 1 Parquet file containing all meetings!") | |
| ``` | |
| --- | |
| ## β SUMMARY | |
| ### Do This: | |
| 1. β Extract text from PDFs (don't store PDF bytes) | |
| 2. β Store in Parquet format (1-50 files total) | |
| 3. β Link to original sources (not duplicate content) | |
| 4. β Compress with snappy | |
| 5. β Partition by state if >50 GB | |
| ### Don't Do This: | |
| 1. β Upload individual PDFs (millions of files) | |
| 2. β Store video files (link to YouTube) | |
| 3. β Duplicate raw content | |
| 4. β Exceed 100k file limit | |
| 5. β Use uncompressed formats | |
| ### Result: | |
| - **22 million files β 50 files** β | |
| - **30 TB β 30 GB** β | |
| - **Slow uploads β Fast uploads** β | |
| - **Hard to manage β Easy to manage** β | |
| - **Expensive β FREE** β | |
| **You can store ALL 22,000 jurisdictions in ~50 Parquet files totaling 30 GB!** | |