bfh-studadmin-assist / DEPLOY_HF.md
awellis's picture
Add HuggingFace Spaces deployment documentation
e27ed46

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Deploy to HuggingFace Spaces

What Got Fixed

The HuggingFace Spaces deployment error was caused by the cross-encoder reranker model loading during app initialization. This has been fixed with lazy loading - the model only loads when the first query is processed.

Deployment Steps

1. Push to Main Branch

# Make sure you're on dev
git checkout dev

# Merge to main
git checkout main
git merge dev

# Push to GitHub (which auto-deploys to HF Spaces)
git push origin main

2. HuggingFace Spaces Will Auto-Deploy

Your repo is connected to HF Spaces, so it will automatically:

  1. Pull the latest code
  2. Install requirements from requirements.txt
  3. Start the app via app.py

3. Monitor Deployment

Go to your HF Space and check the logs. You should see:

✅ App initializes without loading reranker
✅ Gradio interface created
Running on: https://your-space.hf.space

The reranker will only load when someone submits the first query.

What the App Does

Mode Toggle Interface

Users can choose between:

Simple (Fast) - Default

  • Single LLM call
  • ~5-15 seconds
  • Uses GPT-5-mini with reasoning_effort="minimal"

Multi-Agent (Quality)

  • Full pipeline: Intent + Compose + Fact-Check
  • ~2 minutes
  • Async parallelism (15.8% faster)
  • Higher accuracy

Environment Variables on HF Spaces

Make sure these are set in your Space settings:

OPENAI_API_KEY=your-key-here
LLM_MODEL=gpt-5-mini
USE_PARALLEL=true
AGENT_TIMEOUT=120

Troubleshooting HF Spaces

"Can't load cross-encoder model"

  • ✅ Fixed! Reranker now lazy-loads

"Module not found: _griffe"

  • ✅ Fixed! Lazy imports in __init__.py files

Timeout errors

  • Increase AGENT_TIMEOUT=180 in Space settings
  • Or switch to LLM_MODEL=gpt-4o-mini

Out of memory

  • HF Spaces has limited RAM
  • The reranker model (~100MB) loads on first query
  • Consider using smaller RETRIEVAL_TOP_K=3

Cost Optimization

For HF Spaces (free tier):

LLM_MODEL=gpt-4o-mini      # Cheapest, fastest
USE_PARALLEL=false         # Simpler for free tier
SKIP_FACT_CHECK=true       # Save API calls

For best quality:

LLM_MODEL=gpt-5-mini       # Good balance
USE_PARALLEL=true          # 15% faster
AGENT_TIMEOUT=120

Monitoring

Check your Space logs for:

  • "Loading cross-encoder" = First query processed
  • "✅ Parallel pipeline completed" = Multi-agent working
  • Timeout errors = Increase AGENT_TIMEOUT

Next Steps

After deployment:

  1. Test both Simple and Multi-Agent modes
  2. Monitor API costs in OpenAI dashboard
  3. Adjust timeouts based on actual performance
  4. Consider switching models based on usage patterns