Spaces:

awellis
/

bfh-studadmin-assist

Sleeping

App Files Files Community

bfh-studadmin-assist / DEPLOY_HF.md

awellis

Add HuggingFace Spaces deployment documentation

e27ed46 6 months ago

preview code

raw

history blame contribute delete

2.68 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Deploy to HuggingFace Spaces

What Got Fixed

The HuggingFace Spaces deployment error was caused by the cross-encoder reranker model loading during app initialization. This has been fixed with lazy loading - the model only loads when the first query is processed.

Deployment Steps

1. Push to Main Branch

# Make sure you're on dev
git checkout dev

# Merge to main
git checkout main
git merge dev

# Push to GitHub (which auto-deploys to HF Spaces)
git push origin main

2. HuggingFace Spaces Will Auto-Deploy

Your repo is connected to HF Spaces, so it will automatically:

Pull the latest code
Install requirements from requirements.txt
Start the app via app.py

3. Monitor Deployment

Go to your HF Space and check the logs. You should see:

✅ App initializes without loading reranker
✅ Gradio interface created
Running on: https://your-space.hf.space

The reranker will only load when someone submits the first query.

What the App Does

Mode Toggle Interface

Users can choose between:

Simple (Fast) - Default

Single LLM call
~5-15 seconds
Uses GPT-5-mini with reasoning_effort="minimal"

Multi-Agent (Quality)

Full pipeline: Intent + Compose + Fact-Check
~2 minutes
Async parallelism (15.8% faster)
Higher accuracy

Environment Variables on HF Spaces

Make sure these are set in your Space settings:

OPENAI_API_KEY=your-key-here
LLM_MODEL=gpt-5-mini
USE_PARALLEL=true
AGENT_TIMEOUT=120

Troubleshooting HF Spaces

"Can't load cross-encoder model"

✅ Fixed! Reranker now lazy-loads

"Module not found: _griffe"

✅ Fixed! Lazy imports in __init__.py files

Timeout errors

Increase AGENT_TIMEOUT=180 in Space settings
Or switch to LLM_MODEL=gpt-4o-mini

Out of memory

HF Spaces has limited RAM
The reranker model (~100MB) loads on first query
Consider using smaller RETRIEVAL_TOP_K=3

Cost Optimization

For HF Spaces (free tier):

LLM_MODEL=gpt-4o-mini      # Cheapest, fastest
USE_PARALLEL=false         # Simpler for free tier
SKIP_FACT_CHECK=true       # Save API calls

For best quality:

LLM_MODEL=gpt-5-mini       # Good balance
USE_PARALLEL=true          # 15% faster
AGENT_TIMEOUT=120

Monitoring

Check your Space logs for:

"Loading cross-encoder" = First query processed
"✅ Parallel pipeline completed" = Multi-agent working
Timeout errors = Increase AGENT_TIMEOUT

Next Steps

After deployment:

Test both Simple and Multi-Agent modes
Monitor API costs in OpenAI dashboard
Adjust timeouts based on actual performance
Consider switching models based on usage patterns