Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / docs /HUGGINGFACE_QUICK_START.md

jcbowyer

Deploy: Consolidated gold tables, fixed nginx docs routing

896453f verified 28 days ago

preview code

raw

history blame contribute delete

8.82 kB

	# 🚀 QUICK START: FREE STORAGE WITH HUGGING FACE

	TL;DR: Store unlimited data for FREE on Hugging Face!

	⚠️ IMPORTANT: Use Parquet format, NOT individual PDFs! See [file limits guide](HUGGINGFACE_FILE_LIMITS.md)

	---

	## ⚡ 3-MINUTE SETUP

	### 1. Create Hugging Face Account (1 minute)
	```bash
	# Go to https://huggingface.co/join
	# Sign up (FREE)
	# Verify email
	```

	### 2. Get API Token (1 minute)
	```bash
	# Go to https://huggingface.co/settings/tokens
	# Click "New token"
	# Name it "oral-health-upload"
	# Token Type: Write (required for publishing datasets)
	# Repository permissions: All repositories
	# Copy the token (hf_xxxxxxxxxxxx)
	```

	⚠️ Important: Token Permissions
	- Write access required for publishing datasets
	- Read access sufficient for downloading public datasets only
	- For this project: Use Write token to publish your scraped data

	### 3. Install & Login (1 minute)
	```bash
	pip install huggingface_hub datasets

	# Set your token
	export HF_TOKEN="hf_YOUR_TOKEN_HERE"
	```

	---

	## ⚠️ CRITICAL: FILE LIMITS

	Hugging Face Limits:
	- Files per folder: <10,000
	- Total files per repo: <100,000
	- For large datasets: Use Parquet or WebDataset format

	Your Scale:
	- 22,000 jurisdictions × 1,000 docs = 22 MILLION files ❌

	Solution:
	- Extract text from PDFs
	- Store in Parquet format
	- Result: 50 files instead of 22 million ✅

	See detailed guide: [HUGGINGFACE_FILE_LIMITS.md](HUGGINGFACE_FILE_LIMITS.md)

	---

	## 📤 UPLOAD YOUR DATA

	### Option 1: Use the Upload Script (Recommended)

	For discovery data:

	```bash
	# Go to your project
	cd /home/developer/projects/open-navigator

	# Activate environment
	source venv/bin/activate

	# Upload discovery results
	python scripts/upload_to_huggingface.py \
	--repo "YOUR_USERNAME/oral-health-policy-data" \
	--discovery

	# View your dataset
	# https://huggingface.co/datasets/YOUR_USERNAME/oral-health-policy-data
	```

	For meeting PDFs (extract text first!):

	```bash
	# DON'T upload individual PDFs!
	# Instead, extract text and save as Parquet

	# 1. Create a file with PDF URLs (one per line)
	cat > pdf_urls.txt << EOF
	https://tuscaloosaal.suiteonemedia.com/agenda1.pdf
	https://tuscaloosaal.suiteonemedia.com/agenda2.pdf
	...
	EOF

	# 2. Process PDFs to Parquet (extracts text, deletes PDFs)
	python scripts/upload_to_huggingface.py \
	--repo "YOUR_USERNAME/oral-health-policy-data" \
	--process-pdfs pdf_urls.txt

	# 3. Upload the Parquet file (1 file, not thousands!)
	python scripts/upload_to_huggingface.py \
	--repo "YOUR_USERNAME/oral-health-policy-data" \
	--meetings meetings_processed.parquet
	```

	---

	```python
	from datasets import Dataset
	from huggingface_hub import login
	import pandas as pd

	# Login
	login(token="hf_YOUR_TOKEN")

	# Load your data
	df = pd.read_csv('data/bronze/discovered_sources/discovery_summary_final.csv')

	# Convert to dataset
	dataset = Dataset.from_pandas(df)

	# Upload to Hugging Face (FREE!)
	dataset.push_to_hub("YOUR_USERNAME/oral-health-policy-data", split="discovery")

	print("✅ Data uploaded! View at:")
	print("https://huggingface.co/datasets/YOUR_USERNAME/oral-health-policy-data")
	```

	---

	## 💰 COST BREAKDOWN

	\| What You Get \| Cost \|
	\|--------------\|------\|
	\| Unlimited storage (public datasets) \| FREE \|
	\| Unlimited downloads \| FREE \|
	\| Built-in viewer \| FREE \|
	\| Version control \| FREE \|
	\| Search & filtering \| FREE \|
	\| API access \| FREE \|
	\| TOTAL \| $0/month ✅ \|

	---

	## 📊 STORAGE COMPARISON

	### Bad Approach (Expensive)
	```
	❌ Download all videos: 250 TB = $5,000/month
	❌ Store all PDFs: 30 TB = $600/month
	❌ Total: $5,600/month 💸
	```

	### Good Approach (FREE)
	```
	✅ Store discovery data: 1 GB = FREE
	✅ Store extracted text: 25 GB = FREE
	✅ Store oral health subset: 5 GB = FREE
	✅ Total: $0/month 🎉
	```

	Savings: $5,600/month → $0/month

	---

	## 🎯 WHAT TO UPLOAD

	### ✅ Upload These:

	1. Discovery Results (~1 GB)
	- Jurisdiction websites
	- YouTube channels
	- Meeting platforms
	- Social media links

	2. Meeting Metadata (~2 GB)
	- Meeting dates/titles
	- Agenda item lists
	- Source URLs

	3. Extracted Text (~25 GB)
	- Text from PDFs
	- Meeting transcripts
	- Filtered for oral health

	### ❌ Don't Upload These:

	1. Videos - Link to YouTube instead
	2. Full PDFs - Store text + URL to original
	3. Website HTML - Just store the data you extracted
	4. Duplicates - Filter first

	---

	## 📝 EXAMPLE WORKFLOW

	### Step 1: Run Discovery
	```bash
	# Discover all Alabama jurisdictions
	python discovery/comprehensive_discovery_pipeline.py --state AL

	# Output: data/bronze/discovered_sources/discovery_summary_AL.csv (~50 KB)
	```

	### Step 2: Upload to Hugging Face
	```bash
	# Upload discovery results
	python scripts/upload_to_huggingface.py \
	--repo "YOUR_USERNAME/oral-health-policy-data" \
	--discovery
	```

	### Step 3: Free Up Local Space
	```bash
	# Optional: Delete local files (data is safely in cloud)
	rm -rf data/bronze/discovered_sources/*.csv

	# You can always download from Hugging Face later!
	```

	### Step 4: Share & Analyze
	```python
	# Anyone can now use your data (including you!)
	from datasets import load_dataset

	data = load_dataset("YOUR_USERNAME/oral-health-policy-data", split="discovery")
	alabama = data.filter(lambda x: x['state'] == 'AL')

	print(f"Alabama jurisdictions: {len(alabama)}")
	```

	---

	## 🔄 CONTINUOUS WORKFLOW

	### Keep Local Storage Low (~100 MB)

	```python
	# Process one jurisdiction at a time
	for jurisdiction in all_jurisdictions:
	# 1. Download PDF (2 MB)
	pdf = download_agenda(jurisdiction)

	# 2. Extract text (50 KB)
	text = extract_text(pdf)

	# 3. Upload to Hugging Face
	upload_to_hf(text)

	# 4. Delete local file
	os.remove(pdf)

	# Local storage: Never exceeds 100 MB! ✅
	```

	---

	## 📚 HUGGING FACE BASICS

	### Load Your Data Anywhere

	```python
	from datasets import load_dataset

	# Load on your laptop
	data = load_dataset("YOUR_USERNAME/oral-health-policy-data")

	# Or in Google Colab (FREE GPU)
	# Or on a friend's computer
	# Or 5 years from now

	# Your data is always available, forever, for FREE!
	```

	### Search & Filter

	```python
	# Find cities with YouTube channels
	with_youtube = data.filter(lambda x: x['youtube_channels'] > 0)

	# Find high-quality sources
	high_quality = data.filter(lambda x: x['completeness'] > 0.8)

	# Find specific state
	indiana = data.filter(lambda x: x['state'] == 'IN')
	```

	### Download Subset

	```python
	# Only download what you need (save bandwidth)
	oral_health_only = load_dataset(
	"YOUR_USERNAME/oral-health-policy-data",
	split="oral_health" # Only the filtered subset
	)

	# Maybe only 5 GB instead of 50 GB!
	```

	---

	## ✅ BENEFITS

	### 1. FREE Unlimited Storage
	- No storage limits for public datasets
	- No bandwidth limits
	- No time limits

	### 2. Accessible Anywhere
	- Download from any computer
	- Share with collaborators
	- Use in Google Colab

	### 3. Version Control
	- Git-based system
	- Track all changes
	- Revert if needed

	### 4. Discovery
	- Your dataset appears in Hugging Face search
	- Other researchers can use it
	- Builds your portfolio

	### 5. Integration
	- Works with PyTorch, TensorFlow
	- Built-in data viewer
	- API access

	---

	## 🎓 LEARN MORE

	### Official Docs
	- Hugging Face Datasets: https://huggingface.co/docs/datasets/
	- Quick Start: https://huggingface.co/docs/datasets/quickstart
	- Upload Guide: https://huggingface.co/docs/datasets/upload_dataset

	### Examples
	- MeetingBank: https://huggingface.co/datasets/huuuyeah/meetingbank
	- Browse Datasets: https://huggingface.co/datasets

	---

	## 🆘 TROUBLESHOOTING

	### "Authentication failed"
	```bash
	# Make sure token is set
	echo $HF_TOKEN

	# If empty, set it
	export HF_TOKEN="hf_YOUR_TOKEN"

	# Or login interactively
	huggingface-cli login
	```

	### "Permission denied"
	```bash
	# Make sure repo name includes your username
	# ✅ Correct: "myusername/oral-health-policy-data"
	# ❌ Wrong: "oral-health-policy-data"
	```

	### "Dataset too large"
	```python
	# Don't upload raw files!
	# Upload processed/filtered data only

	# ❌ Bad: Upload 50 GB of PDFs
	# ✅ Good: Upload 5 GB of extracted text
	```

	---

	## 🎯 NEXT STEPS

	1. ✅ Create Hugging Face account
	2. ✅ Get API token
	3. ✅ Run discovery for your state
	4. ✅ Upload to Hugging Face
	5. ✅ Delete local files to free space
	6. ✅ Scale to all 22,000+ jurisdictions!

	Your data is safe in the cloud, FREE, forever! 🎉

	---

	## 💡 PRO TIP

	Make your dataset public (not private):
	- ✅ FREE unlimited storage
	- ✅ Helps research community
	- ✅ Builds your portfolio
	- ✅ Appears in search results

	Private datasets are limited to 100 GB and don't help anyone!

	Public = Win-Win-Win 🏆