Spaces:
Running
Running
File size: 11,350 Bytes
cec88de eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 cec88de eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 6adbd7a 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c eff2120 3c1db6c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 | ---
title: LLMOpt
emoji: 🚀
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---
# LLMOpt: The Adaptive Inference Optimization Framework (V2)
> **Intelligent Routing. Minimal Latency. Maximum ROI.**
In the era of sprawling Large Language Models (LLMs), routing every query to a flagship model like GPT-4o or Claude 3.5 Sonnet is financially unsustainable and computationally wasteful.
**LLMOpt** is an enterprise-grade middleware layer that sits between your application and your LLM providers. By dynamically analyzing the semantic complexity of incoming queries, LLMOpt automatically selects the most cost-effective model capable of handling the request, compresses context windows to reduce token waste, and caches responses—all while giving you full observability into its decision-making process.
```text
Your App → llmopt.generate(query)
→ [Semantic Cache → NLI Analyze → GBR Estimate → Bayesian Optimize → LLMLingua Compress → Route]
→ LLM API → Response
```
## Table of Contents
- [The V2 Architecture](#the-v2-architecture)
- [Core ML Components](#core-ml-components)
- [Graceful Degradation](#graceful-degradation)
- [Quick Start & Installation](#quick-start--installation)
- [Python SDK Usage](#python-sdk-usage)
- [REST API Integration](#rest-api-integration)
- [Supported Providers & Models](#supported-providers--models)
- [Explainability & Observability](#explainability--observability)
---
## The V2 Architecture
LLMOpt V2 has transitioned from a static, heuristic-based router to a fully **Machine Learning-powered pipeline**. The framework acts as an intelligent funnel, progressively optimizing the request before it ever reaches an LLM provider.
```mermaid
flowchart TD
A[Incoming Query] --> B(Semantic Cache)
B -->|Cache Hit| Z[Return Cached Response]
B -->|Cache Miss| C(Query Analyzer)
C --> D(Complexity Estimator)
D --> E(Optimization Engine)
E --> F(Prompt Optimizer)
F --> G(Model Router)
G --> H((LLM Provider))
H --> I[LLM-as-a-Judge Evaluator]
I -->|Feedback Loop| E
I --> Z
```
### Pipeline Stages
1. **Semantic Cache**: Checks Redis for highly similar past queries.
2. **Query Analyzer**: Extracts structural features and semantic domains from the prompt.
3. **Complexity Estimator**: Predicts the cognitive load required to answer the query (0.0 to 1.0).
4. **Optimization Engine**: Minimizes a cost/quality objective function to pick the perfect model.
5. **Prompt Optimizer**: Intelligently compresses the prompt to shed unnecessary tokens.
6. **Model Router**: Dispatches the request via LiteLLM to OpenAI, Anthropic, Google, Ollama, etc.
7. **Evaluator (Optional)**: Scores the response quality and feeds it back to the optimization engine.
---
## Core ML Components
The V2 release introduces state-of-the-art machine learning to every layer of the pipeline:
### 1. Zero-Shot NLI Query Analyzer
Instead of relying on brittle regex patterns to determine if a query is asking for "code" or "math," LLMOpt utilizes HuggingFace's `cross-encoder/nli-distilroberta-base`. This semantic reasoning engine accurately categorizes query intent on the fly without requiring labeled datasets.
### 2. Sentence-Transformer Semantic Cache
Before spending API credits, the framework embeds the incoming query using a lightweight, local `all-MiniLM-L6-v2` model and compares it against a Redis-backed vector store using cosine similarity. If an existing query matches with >95% similarity, the cached response is served at **$0.00 cost** and near-zero latency.
### 3. Gradient Boosting Complexity Estimator
To predict how "hard" a query is, LLMOpt leverages a `scikit-learn` Gradient Boosting Regressor (GBR) trained on hundreds of annotated examples. It accurately scales the required capability threshold, ensuring that "What is Python?" gets routed to a fast, cheap model, while "Implement a distributed Paxos consensus algorithm" gets routed to a flagship reasoning model.
### 4. Bayesian Weight Optimization (Optuna)
The Optimization Engine selects models by minimizing the objective function:
`J(x) = α·Cost + β·Tokens - γ·Quality`
Instead of hardcoding `α`, `β`, and `γ`, LLMOpt integrates **Optuna**. By processing real-world feedback from the LLM evaluator, Optuna uses Bayesian optimization to continuously learn and adjust these weights to mathematically guarantee the highest quality responses for the lowest possible price.
### 5. LLMLingua Semantic Compression
Large context windows are expensive. LLMOpt integrates Microsoft's `llmlingua-2` to perform semantic token pruning. It identifies and removes non-essential tokens (filler words, redundant context) from the prompt while preserving the core semantic meaning, reducing input costs by up to 40% before the LLM is even called.
### 6. LLM-as-a-Judge Evaluation Loop
When explicitly requested (`evaluate=True`), LLMOpt uses a highly efficient judge model (`gpt-4o-mini`) to score the returned response across Accuracy, Completeness, Clarity, and Conciseness. This score is automatically fed back into the Bayesian Optimizer to improve future routing decisions.
---
## Graceful Degradation
Enterprise systems must be resilient. **LLMOpt is designed to never crash if an ML dependency is missing or unavailable.**
If you choose not to install the heavy `[ml]` dependencies (like PyTorch or sentence-transformers), or if your Redis cache goes offline, LLMOpt silently and seamlessly **falls back to its robust V1 heuristic rules**. This ensures that your application continues to route requests efficiently under all circumstances.
---
## Quick Start & Installation
### Requirements
- Python 3.10+
- At least one API key (OpenAI, Anthropic, Google, Mistral, DeepSeek) OR a local Ollama instance.
### Installation
```bash
# Clone the repository
git clone https://github.com/Shrot101/llmopt.git
cd llmopt
# Install with all Machine Learning capabilities (Highly Recommended for V2)
pip install -e ".[ml]"
# Install Core only (uses V1 heuristic fallbacks)
pip install -e .
# Install with Local Model support
pip install -e ".[ml,local]"
```
### Configuration
Copy the environment template and add your API keys. You only need to provide keys for the providers you intend to use.
```bash
cp config/.env.example config/.env
```
```env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIza...
OLLAMA_API_BASE=http://localhost:11434
# Required for V2 Semantic Caching
REDIS_URL=redis://localhost:6379/0
```
---
## Python SDK Usage
### Basic Generation
```python
from llmopt import LLMOpt
client = LLMOpt()
# The framework handles analysis, optimization, and routing automatically
result = client.generate(
query="Explain the difference between TCP and UDP",
budget_mode="balanced" # Options: "cheap" | "balanced" | "quality"
)
print(result.response)
print(f"Model used : {result.model_used}")
print(f"Cost : ${result.estimated_cost:.6f}")
print(f"Tokens saved: {result.tokens_saved}")
```
### Advanced Constraints & Evaluation
```python
result = client.generate(
query="Design a highly available distributed rate limiter.",
budget_mode="quality",
# Hard cap — never spend more than this per request (USD)
max_cost_per_request=0.01,
# Provider filtering
exclude_providers=["openai"],
only_providers=["anthropic", "google"],
# Opt-in to the LLM-as-a-judge feedback loop
evaluate=True,
# dry_run=True → runs full optimization pipeline but skips the actual API call
dry_run=False,
)
if result.evaluation:
print(f"Quality Score: {result.evaluation.overall}/10")
```
---
## REST API Integration
LLMOpt includes a built-in FastAPI server for easy integration into non-Python architectures.
### Start the server
```bash
python run.py --host 0.0.0.0 --port 8000
```
### `POST /generate`
**Request Body:**
```json
{
"query": "Explain quantum computing",
"budget_mode": "balanced",
"api_keys": {
"openai": "sk-...",
"anthropic": "sk-ant-...",
"google": "AIza..."
}
}
```
> [!TIP]
> **BYOK (Bring Your Own Key) Mode:**
> This server is configured as a public utility. It provides the **Routing Intelligence** and **Shared Semantic Cache**, but you must provide your own provider API keys in the `api_keys` object. The server does not store your keys; they are used only for the duration of the request.
### Example cURL
```bash
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"query": "Write a recursive Fibonacci function in Rust",
"budget_mode": "balanced",
"evaluate": true
}'
```
**Response payload includes deep insights into the optimization process:**
```json
{
"response": "Here is the Rust implementation...",
"model_used": "claude-3-5-haiku-20241022",
"provider": "anthropic",
"input_tokens": 105,
"output_tokens": 342,
"total_tokens": 447,
"estimated_cost": 0.001452,
"tokens_saved": 28,
"compression_ratio": 0.21,
"complexity_score": 0.62,
"complexity_tier": "hard",
"latency_ms": 1140,
"evaluation": {
"overall": 9.5,
"accuracy": 10.0,
"feedback": "The code is idiomatic and correctly implements recursion."
}
}
```
---
## Supported Providers & Models
The routing engine dynamically compares models across providers based on their unified capability scores and per-token pricing. Add or update models simply by modifying `data/model_registry.json`.
| Model | Provider | Input $/1k | Output $/1k | Capability | Best For |
|-------|----------|-----------|------------|------------|----------|
| `gpt-4o` | OpenAI | $0.0025 | $0.010 | 0.930 | Complex reasoning |
| `gpt-4o-mini` | OpenAI | $0.00015 | $0.0006 | 0.784 | Balanced tasks |
| `claude-3-5-sonnet-20241022` | Anthropic | $0.003 | $0.015 | 0.934 | Coding, analysis |
| `claude-3-5-haiku-20241022` | Anthropic | $0.0008 | $0.004 | 0.794 | Fast tasks |
| `gemini-1.5-flash` | Google | $0.000075 | $0.0003 | 0.742 | Cheapest cloud |
| `mistral-large-latest` | Mistral | $0.003 | $0.009 | 0.852 | EU + quality |
| `deepseek-chat` | DeepSeek | $0.00014 | $0.00028 | 0.887 | Best value math/code |
| `llama3.1:70b` | Ollama | FREE | FREE | 0.823 | Local high-quality |
*(See the registry file for the complete list of supported models).*
---
## Explainability & Observability
Unlike black-box routing systems, LLMOpt is completely transparent. You can ask the framework to explain exactly why it chose a specific model for a specific query without spending any money (`dry_run=True` or the `/explain` endpoint).
```python
explanation = client.explain(
query="What is the capital of France?",
budget_mode="cheap"
)
```
**Explanation Output:**
```text
=======================================================
LLMOpt Decision Explanation
=======================================================
Query complexity : 0.050 (trivial)
Primary domain : factual
Selected model : gemini-1.5-flash (google)
Fallback model : gpt-4o-mini
Compression : yes
System prompt : minimal
Scoring rationale:
• model=gemini-1.5-flash
• capability=0.742
• cost_norm=0.0042
• J=-0.124 (α=0.6,β=0.3,γ=0.1)
Cost saved : $0.009850 vs GPT-4o baseline
=======================================================
```
|