Spaces:

mnoorchenar
/

langgraph-support-agent

Sleeping

1. User picks "Gemma 2 9B" from dropdown
2. Frontend sends: {"model": "google/gemma-2-9b-it"}
3. Backend receives request
4. LangGraph agent uses Gemma instead of Qwen

Smart Fallback (NEW!):

If the selected model fails, the agent automatically tries fallback models:

User selected: Qwen
         ↓
    Qwen fails
         ↓
    Try: Gemma 2 9B
         ↓
    Gemma works!
         ↓
    Returns result using Gemma
    (User sees: "Model switched to Gemma")

Fallback order:

Gemma 2 9B
Mistral 7B
TinyLlama 1.1B

When to Use Each Model

Scenario	Model	Why
Production	Qwen 2.5	Reliable + fast + free-tier
Quality matters	Gemma 2	Highest quality responses
Testing	TinyLlama	Instant responses
Unsure	Qwen 2.5	Default is always safe

Code Changes Made

1. `app.py` - Added 4 Models

AVAILABLE_MODELS = [
    {"id":"Qwen/Qwen2.5-7B-Instruct","name":"Qwen 2.5 7B","badge":"⚡ Fast & Reliable"},
    {"id":"google/gemma-2-9b-it","name":"Gemma 2 9B","badge":"💎 Quality"},
    {"id":"mistralai/Mistral-7B-Instruct-v0.2","name":"Mistral 7B","badge":"🌀 Balanced"},
    {"id":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","name":"TinyLlama 1.1B","badge":"⚙️ Lightweight"},
]

FALLBACK_MODEL = "Qwen/Qwen2.5-7B-Instruct"

2. `agent/llm.py` - Added Fallback Logic

def call_llm_with_fallback(client, primary_model, fallback_models, messages, emit_token):
    """Try primary model, then fallback models if it fails"""
    # Tries models in order until one works
    # Returns (result, model_used)

3. `agent/nodes.py` - Using Fallback

full_text, used_model = call_llm_with_fallback(
    client, state["model_name"], FALLBACK_MODELS, messages, emit_token
)

# Logs if model was switched
if used_model != state["model_name"]:
    ev.emit(sid, {"type":"model_switch","from":state["model_name"],"to":used_model})

4. `templates/index.html` - Already Works!

The dropdown automatically loads all 4 models from backend.

Testing Models

Try each model with the same question to see differences:

Test Question: "What is your return policy?"

Qwen 2.5 - Fast, concise answer
Gemma 2 - Detailed, thorough answer
Mistral 7B - Clear, structured answer
TinyLlama - Shorter but good answer

Next Steps You Can Do

Add more models from HuggingFace:

{"id":"meta-llama/Llama-2-7b-chat-hf","name":"Llama 2","badge":"🦙"},

Compare models on your own use cases
Use TinyLlama for testing (instant responses)

Set Gemma as default if you prefer quality:

AVAILABLE_MODELS = [Gemma, ...]  # Move Gemma to top

Important Notes

✅ All models are free - HuggingFace Inference API free tier

✅ No API keys needed - Uses your HF_TOKEN (already set)

✅ Auto-fallback - Never get stuck without response

✅ Easy to switch - Just select from dropdown

❌ Not real APIs - All are HuggingFace inference endpoints

❌ Free tier limits - May have rate limits (but generous)

How Free-Tier HF Works

Your Space
    ↓
HuggingFace Inference API (Free)
    ↓
    ├─ Qwen 2.5 7B ✓
    ├─ Gemma 2 9B ✓
    ├─ Mistral 7B ✓
    └─ TinyLlama ✓

All free, no payment needed!

Your HF_TOKEN gives you access to all these models.

Performance Tips

Use TinyLlama for testing (1.1B = super fast)
Use Qwen for production (best balance)
Use Gemma when you need better quality
Use Mistral as middle ground

Speed comparison:

TinyLlama: ~500ms
Qwen: ~1-2s
Gemma: ~2-3s
Mistral: ~2-3s

Enjoy your multi-model support! 🚀

🤖 Available Models - Feature Guide

What Changed?

Available Models

1. ⚡ Qwen 2.5 7B (Default)

2. 💎 Gemma 2 9B

3. 🌀 Mistral 7B

4. ⚙️ TinyLlama 1.1B

How Switching Works

User Selects Model: