Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
Deploy to HuggingFace Spaces
What Got Fixed
The HuggingFace Spaces deployment error was caused by the cross-encoder reranker model loading during app initialization. This has been fixed with lazy loading - the model only loads when the first query is processed.
Deployment Steps
1. Push to Main Branch
# Make sure you're on dev
git checkout dev
# Merge to main
git checkout main
git merge dev
# Push to GitHub (which auto-deploys to HF Spaces)
git push origin main
2. HuggingFace Spaces Will Auto-Deploy
Your repo is connected to HF Spaces, so it will automatically:
- Pull the latest code
- Install requirements from
requirements.txt - Start the app via
app.py
3. Monitor Deployment
Go to your HF Space and check the logs. You should see:
✅ App initializes without loading reranker
✅ Gradio interface created
Running on: https://your-space.hf.space
The reranker will only load when someone submits the first query.
What the App Does
Mode Toggle Interface
Users can choose between:
Simple (Fast) - Default
- Single LLM call
- ~5-15 seconds
- Uses GPT-5-mini with
reasoning_effort="minimal"
Multi-Agent (Quality)
- Full pipeline: Intent + Compose + Fact-Check
- ~2 minutes
- Async parallelism (15.8% faster)
- Higher accuracy
Environment Variables on HF Spaces
Make sure these are set in your Space settings:
OPENAI_API_KEY=your-key-here
LLM_MODEL=gpt-5-mini
USE_PARALLEL=true
AGENT_TIMEOUT=120
Troubleshooting HF Spaces
"Can't load cross-encoder model"
- ✅ Fixed! Reranker now lazy-loads
"Module not found: _griffe"
- ✅ Fixed! Lazy imports in
__init__.pyfiles
Timeout errors
- Increase
AGENT_TIMEOUT=180in Space settings - Or switch to
LLM_MODEL=gpt-4o-mini
Out of memory
- HF Spaces has limited RAM
- The reranker model (~100MB) loads on first query
- Consider using smaller
RETRIEVAL_TOP_K=3
Cost Optimization
For HF Spaces (free tier):
LLM_MODEL=gpt-4o-mini # Cheapest, fastest
USE_PARALLEL=false # Simpler for free tier
SKIP_FACT_CHECK=true # Save API calls
For best quality:
LLM_MODEL=gpt-5-mini # Good balance
USE_PARALLEL=true # 15% faster
AGENT_TIMEOUT=120
Monitoring
Check your Space logs for:
- "Loading cross-encoder" = First query processed
- "✅ Parallel pipeline completed" = Multi-agent working
- Timeout errors = Increase AGENT_TIMEOUT
Next Steps
After deployment:
- Test both Simple and Multi-Agent modes
- Monitor API costs in OpenAI dashboard
- Adjust timeouts based on actual performance
- Consider switching models based on usage patterns