Spaces:
Sleeping
π€ Available Models - Feature Guide
What Changed?
We added 4 different free HuggingFace models that you can switch between in real-time. Each has different trade-offs.
Available Models
1. β‘ Qwen 2.5 7B (Default)
- Speed: β‘β‘β‘ Very Fast
- Quality: βββββ Excellent
- Model:
Qwen/Qwen2.5-7B-Instruct - Best for: Production use, balanced quality & speed
- Size: 7B parameters
- Why it's first: Most reliable on free-tier HuggingFace
2. π Gemma 2 9B
- Speed: β‘β‘β‘ Fast
- Quality: βββββ Excellent
- Model:
google/gemma-2-9b-it - Best for: High-quality responses
- Size: 9B parameters
- Note: Google model, very reliable
3. π Mistral 7B
- Speed: β‘β‘β‘ Fast
- Quality: ββββ Very Good
- Model:
mistralai/Mistral-7B-Instruct-v0.2 - Best for: Balanced option, newer version
- Size: 7B parameters
- Note: Good alternative to Qwen
4. βοΈ TinyLlama 1.1B
- Speed: β‘β‘β‘β‘β‘ Ultra Fast
- Quality: βββ Good
- Model:
TinyLlama/TinyLlama-1.1B-Chat-v1.0 - Best for: Testing, learning, super fast responses
- Size: 1.1B parameters (tiny!)
- Note: Much smaller, faster, but less capable
How Switching Works
User Selects Model:
1. User picks "Gemma 2 9B" from dropdown
2. Frontend sends: {"model": "google/gemma-2-9b-it"}
3. Backend receives request
4. LangGraph agent uses Gemma instead of Qwen
Smart Fallback (NEW!):
If the selected model fails, the agent automatically tries fallback models:
User selected: Qwen
β
Qwen fails
β
Try: Gemma 2 9B
β
Gemma works!
β
Returns result using Gemma
(User sees: "Model switched to Gemma")
Fallback order:
- Gemma 2 9B
- Mistral 7B
- TinyLlama 1.1B
When to Use Each Model
| Scenario | Model | Why |
|---|---|---|
| Production | Qwen 2.5 | Reliable + fast + free-tier |
| Quality matters | Gemma 2 | Highest quality responses |
| Testing | TinyLlama | Instant responses |
| Unsure | Qwen 2.5 | Default is always safe |
Code Changes Made
1. app.py - Added 4 Models
AVAILABLE_MODELS = [
{"id":"Qwen/Qwen2.5-7B-Instruct","name":"Qwen 2.5 7B","badge":"β‘ Fast & Reliable"},
{"id":"google/gemma-2-9b-it","name":"Gemma 2 9B","badge":"π Quality"},
{"id":"mistralai/Mistral-7B-Instruct-v0.2","name":"Mistral 7B","badge":"π Balanced"},
{"id":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","name":"TinyLlama 1.1B","badge":"βοΈ Lightweight"},
]
FALLBACK_MODEL = "Qwen/Qwen2.5-7B-Instruct"
2. agent/llm.py - Added Fallback Logic
def call_llm_with_fallback(client, primary_model, fallback_models, messages, emit_token):
"""Try primary model, then fallback models if it fails"""
# Tries models in order until one works
# Returns (result, model_used)
3. agent/nodes.py - Using Fallback
full_text, used_model = call_llm_with_fallback(
client, state["model_name"], FALLBACK_MODELS, messages, emit_token
)
# Logs if model was switched
if used_model != state["model_name"]:
ev.emit(sid, {"type":"model_switch","from":state["model_name"],"to":used_model})
4. templates/index.html - Already Works!
The dropdown automatically loads all 4 models from backend.
Testing Models
Try each model with the same question to see differences:
Test Question: "What is your return policy?"
- Qwen 2.5 - Fast, concise answer
- Gemma 2 - Detailed, thorough answer
- Mistral 7B - Clear, structured answer
- TinyLlama - Shorter but good answer
Next Steps You Can Do
Add more models from HuggingFace:
{"id":"meta-llama/Llama-2-7b-chat-hf","name":"Llama 2","badge":"π¦"},Compare models on your own use cases
Use TinyLlama for testing (instant responses)
Set Gemma as default if you prefer quality:
AVAILABLE_MODELS = [Gemma, ...] # Move Gemma to top
Important Notes
β All models are free - HuggingFace Inference API free tier
β No API keys needed - Uses your HF_TOKEN (already set)
β Auto-fallback - Never get stuck without response
β Easy to switch - Just select from dropdown
β Not real APIs - All are HuggingFace inference endpoints
β Free tier limits - May have rate limits (but generous)
How Free-Tier HF Works
Your Space
β
HuggingFace Inference API (Free)
β
ββ Qwen 2.5 7B β
ββ Gemma 2 9B β
ββ Mistral 7B β
ββ TinyLlama β
All free, no payment needed!
Your HF_TOKEN gives you access to all these models.
Performance Tips
- Use TinyLlama for testing (1.1B = super fast)
- Use Qwen for production (best balance)
- Use Gemma when you need better quality
- Use Mistral as middle ground
Speed comparison:
- TinyLlama: ~500ms
- Qwen: ~1-2s
- Gemma: ~2-3s
- Mistral: ~2-3s
Enjoy your multi-model support! π