metadata
title: Cascade - Intelligent LLM Router
emoji: π
colorFrom: purple
colorTo: blue
sdk: streamlit
sdk_version: 1.31.0
app_file: app.py
pinned: false
Cascade π
Intelligent LLM Request Router - Reduce API costs by 60%+ through smart routing and semantic caching.
What is Cascade?
Cascade is an intelligent proxy that automatically routes LLM requests to the most cost-effective model based on query complexity:
- Simple queries β Free local models (Ollama) or GPT-3.5
- Medium queries β GPT-4o-mini ($0.15/1M tokens)
- Complex queries β GPT-4o ($2.50/1M tokens)
Features
- π― ML-Powered Routing - Predicts query complexity in <20ms
- π° 60%+ Cost Savings - Routes simple queries to cheaper models
- β‘ Semantic Caching - Vector similarity search for cached responses
- π Real-Time Analytics - Dashboard showing savings and usage metrics
- π OpenAI Compatible - Drop-in replacement for OpenAI API
Using This Space
This Space provides a demo UI where you can:
- Test Routing - See how different queries get routed to different models
- View Analytics - Track cost savings and cache hit rates
- Interactive Chat - Try example queries and see real-time responses
Configuration
To use this with real LLM APIs, set these environment variables:
OPENAI_API_KEY- Your OpenAI API keyOLLAMA_BASE_URL- Ollama server URL (optional)REDIS_URL- Redis connection for caching (optional)QDRANT_URL- Qdrant server for semantic cache (optional)
Learn More
Architecture
Request β Cache Check β ML Classifier β Route to Model
β β β
Semantic Simple/Med/ Ollama/GPT-4o-mini/
Cache Complex GPT-4o
Tech Stack
- Backend: FastAPI, Python 3.11+
- ML: DistilBERT (ONNX), Sentence Transformers
- Caching: Redis (exact match), Qdrant (semantic)
- UI: Streamlit, Plotly
- Deployment: Docker, Hugging Face Spaces
Built with β€οΈ by ayushm98