Spaces:
Sleeping
title: LLMOpt
emoji: 🚀
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
LLMOpt: The Adaptive Inference Optimization Framework (V2)
Intelligent Routing. Minimal Latency. Maximum ROI.
In the era of sprawling Large Language Models (LLMs), routing every query to a flagship model like GPT-4o or Claude 3.5 Sonnet is financially unsustainable and computationally wasteful.
LLMOpt is an enterprise-grade middleware layer that sits between your application and your LLM providers. By dynamically analyzing the semantic complexity of incoming queries, LLMOpt automatically selects the most cost-effective model capable of handling the request, compresses context windows to reduce token waste, and caches responses—all while giving you full observability into its decision-making process.
Your App → llmopt.generate(query)
→ [Semantic Cache → NLI Analyze → GBR Estimate → Bayesian Optimize → LLMLingua Compress → Route]
→ LLM API → Response
Table of Contents
- The V2 Architecture
- Core ML Components
- Graceful Degradation
- Quick Start & Installation
- Python SDK Usage
- REST API Integration
- Supported Providers & Models
- Explainability & Observability
The V2 Architecture
LLMOpt V2 has transitioned from a static, heuristic-based router to a fully Machine Learning-powered pipeline. The framework acts as an intelligent funnel, progressively optimizing the request before it ever reaches an LLM provider.
flowchart TD
A[Incoming Query] --> B(Semantic Cache)
B -->|Cache Hit| Z[Return Cached Response]
B -->|Cache Miss| C(Query Analyzer)
C --> D(Complexity Estimator)
D --> E(Optimization Engine)
E --> F(Prompt Optimizer)
F --> G(Model Router)
G --> H((LLM Provider))
H --> I[LLM-as-a-Judge Evaluator]
I -->|Feedback Loop| E
I --> Z
Pipeline Stages
- Semantic Cache: Checks Redis for highly similar past queries.
- Query Analyzer: Extracts structural features and semantic domains from the prompt.
- Complexity Estimator: Predicts the cognitive load required to answer the query (0.0 to 1.0).
- Optimization Engine: Minimizes a cost/quality objective function to pick the perfect model.
- Prompt Optimizer: Intelligently compresses the prompt to shed unnecessary tokens.
- Model Router: Dispatches the request via LiteLLM to OpenAI, Anthropic, Google, Ollama, etc.
- Evaluator (Optional): Scores the response quality and feeds it back to the optimization engine.
Core ML Components
The V2 release introduces state-of-the-art machine learning to every layer of the pipeline:
1. Zero-Shot NLI Query Analyzer
Instead of relying on brittle regex patterns to determine if a query is asking for "code" or "math," LLMOpt utilizes HuggingFace's cross-encoder/nli-distilroberta-base. This semantic reasoning engine accurately categorizes query intent on the fly without requiring labeled datasets.
2. Sentence-Transformer Semantic Cache
Before spending API credits, the framework embeds the incoming query using a lightweight, local all-MiniLM-L6-v2 model and compares it against a Redis-backed vector store using cosine similarity. If an existing query matches with >95% similarity, the cached response is served at $0.00 cost and near-zero latency.
3. Gradient Boosting Complexity Estimator
To predict how "hard" a query is, LLMOpt leverages a scikit-learn Gradient Boosting Regressor (GBR) trained on hundreds of annotated examples. It accurately scales the required capability threshold, ensuring that "What is Python?" gets routed to a fast, cheap model, while "Implement a distributed Paxos consensus algorithm" gets routed to a flagship reasoning model.
4. Bayesian Weight Optimization (Optuna)
The Optimization Engine selects models by minimizing the objective function:
J(x) = α·Cost + β·Tokens - γ·Quality
Instead of hardcoding α, β, and γ, LLMOpt integrates Optuna. By processing real-world feedback from the LLM evaluator, Optuna uses Bayesian optimization to continuously learn and adjust these weights to mathematically guarantee the highest quality responses for the lowest possible price.
5. LLMLingua Semantic Compression
Large context windows are expensive. LLMOpt integrates Microsoft's llmlingua-2 to perform semantic token pruning. It identifies and removes non-essential tokens (filler words, redundant context) from the prompt while preserving the core semantic meaning, reducing input costs by up to 40% before the LLM is even called.
6. LLM-as-a-Judge Evaluation Loop
When explicitly requested (evaluate=True), LLMOpt uses a highly efficient judge model (gpt-4o-mini) to score the returned response across Accuracy, Completeness, Clarity, and Conciseness. This score is automatically fed back into the Bayesian Optimizer to improve future routing decisions.
Graceful Degradation
Enterprise systems must be resilient. LLMOpt is designed to never crash if an ML dependency is missing or unavailable.
If you choose not to install the heavy [ml] dependencies (like PyTorch or sentence-transformers), or if your Redis cache goes offline, LLMOpt silently and seamlessly falls back to its robust V1 heuristic rules. This ensures that your application continues to route requests efficiently under all circumstances.
Quick Start & Installation
Requirements
- Python 3.10+
- At least one API key (OpenAI, Anthropic, Google, Mistral, DeepSeek) OR a local Ollama instance.
Installation
# Clone the repository
git clone https://github.com/Shrot101/llmopt.git
cd llmopt
# Install with all Machine Learning capabilities (Highly Recommended for V2)
pip install -e ".[ml]"
# Install Core only (uses V1 heuristic fallbacks)
pip install -e .
# Install with Local Model support
pip install -e ".[ml,local]"
Configuration
Copy the environment template and add your API keys. You only need to provide keys for the providers you intend to use.
cp config/.env.example config/.env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIza...
OLLAMA_API_BASE=http://localhost:11434
# Required for V2 Semantic Caching
REDIS_URL=redis://localhost:6379/0
Python SDK Usage
Basic Generation
from llmopt import LLMOpt
client = LLMOpt()
# The framework handles analysis, optimization, and routing automatically
result = client.generate(
query="Explain the difference between TCP and UDP",
budget_mode="balanced" # Options: "cheap" | "balanced" | "quality"
)
print(result.response)
print(f"Model used : {result.model_used}")
print(f"Cost : ${result.estimated_cost:.6f}")
print(f"Tokens saved: {result.tokens_saved}")
Advanced Constraints & Evaluation
result = client.generate(
query="Design a highly available distributed rate limiter.",
budget_mode="quality",
# Hard cap — never spend more than this per request (USD)
max_cost_per_request=0.01,
# Provider filtering
exclude_providers=["openai"],
only_providers=["anthropic", "google"],
# Opt-in to the LLM-as-a-judge feedback loop
evaluate=True,
# dry_run=True → runs full optimization pipeline but skips the actual API call
dry_run=False,
)
if result.evaluation:
print(f"Quality Score: {result.evaluation.overall}/10")
REST API Integration
LLMOpt includes a built-in FastAPI server for easy integration into non-Python architectures.
Start the server
python run.py --host 0.0.0.0 --port 8000
POST /generate
Request Body:
{
"query": "Explain quantum computing",
"budget_mode": "balanced",
"api_keys": {
"openai": "sk-...",
"anthropic": "sk-ant-...",
"google": "AIza..."
}
}
BYOK (Bring Your Own Key) Mode: This server is configured as a public utility. It provides the Routing Intelligence and Shared Semantic Cache, but you must provide your own provider API keys in the
api_keysobject. The server does not store your keys; they are used only for the duration of the request.
Example cURL
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"query": "Write a recursive Fibonacci function in Rust",
"budget_mode": "balanced",
"evaluate": true
}'
Response payload includes deep insights into the optimization process:
{
"response": "Here is the Rust implementation...",
"model_used": "claude-3-5-haiku-20241022",
"provider": "anthropic",
"input_tokens": 105,
"output_tokens": 342,
"total_tokens": 447,
"estimated_cost": 0.001452,
"tokens_saved": 28,
"compression_ratio": 0.21,
"complexity_score": 0.62,
"complexity_tier": "hard",
"latency_ms": 1140,
"evaluation": {
"overall": 9.5,
"accuracy": 10.0,
"feedback": "The code is idiomatic and correctly implements recursion."
}
}
Supported Providers & Models
The routing engine dynamically compares models across providers based on their unified capability scores and per-token pricing. Add or update models simply by modifying data/model_registry.json.
| Model | Provider | Input $/1k | Output $/1k | Capability | Best For |
|---|---|---|---|---|---|
gpt-4o |
OpenAI | $0.0025 | $0.010 | 0.930 | Complex reasoning |
gpt-4o-mini |
OpenAI | $0.00015 | $0.0006 | 0.784 | Balanced tasks |
claude-3-5-sonnet-20241022 |
Anthropic | $0.003 | $0.015 | 0.934 | Coding, analysis |
claude-3-5-haiku-20241022 |
Anthropic | $0.0008 | $0.004 | 0.794 | Fast tasks |
gemini-1.5-flash |
$0.000075 | $0.0003 | 0.742 | Cheapest cloud | |
mistral-large-latest |
Mistral | $0.003 | $0.009 | 0.852 | EU + quality |
deepseek-chat |
DeepSeek | $0.00014 | $0.00028 | 0.887 | Best value math/code |
llama3.1:70b |
Ollama | FREE | FREE | 0.823 | Local high-quality |
(See the registry file for the complete list of supported models).
Explainability & Observability
Unlike black-box routing systems, LLMOpt is completely transparent. You can ask the framework to explain exactly why it chose a specific model for a specific query without spending any money (dry_run=True or the /explain endpoint).
explanation = client.explain(
query="What is the capital of France?",
budget_mode="cheap"
)
Explanation Output: ```text
LLMOpt Decision Explanation
Query complexity : 0.050 (trivial) Primary domain : factual
Selected model : gemini-1.5-flash (google) Fallback model : gpt-4o-mini Compression : yes System prompt : minimal
Scoring rationale: • model=gemini-1.5-flash • capability=0.742 • cost_norm=0.0042 • J=-0.124 (α=0.6,β=0.3,γ=0.1)
Cost saved : $0.009850 vs GPT-4o baseline