llmopt-server / README.md
Shrot101's picture
docs: update README with BYOK usage instructions
6adbd7a
---
title: LLMOpt
emoji: 🚀
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---
# LLMOpt: The Adaptive Inference Optimization Framework (V2)
> **Intelligent Routing. Minimal Latency. Maximum ROI.**
In the era of sprawling Large Language Models (LLMs), routing every query to a flagship model like GPT-4o or Claude 3.5 Sonnet is financially unsustainable and computationally wasteful.
**LLMOpt** is an enterprise-grade middleware layer that sits between your application and your LLM providers. By dynamically analyzing the semantic complexity of incoming queries, LLMOpt automatically selects the most cost-effective model capable of handling the request, compresses context windows to reduce token waste, and caches responses—all while giving you full observability into its decision-making process.
```text
Your App → llmopt.generate(query)
→ [Semantic Cache → NLI Analyze → GBR Estimate → Bayesian Optimize → LLMLingua Compress → Route]
→ LLM API → Response
```
## Table of Contents
- [The V2 Architecture](#the-v2-architecture)
- [Core ML Components](#core-ml-components)
- [Graceful Degradation](#graceful-degradation)
- [Quick Start & Installation](#quick-start--installation)
- [Python SDK Usage](#python-sdk-usage)
- [REST API Integration](#rest-api-integration)
- [Supported Providers & Models](#supported-providers--models)
- [Explainability & Observability](#explainability--observability)
---
## The V2 Architecture
LLMOpt V2 has transitioned from a static, heuristic-based router to a fully **Machine Learning-powered pipeline**. The framework acts as an intelligent funnel, progressively optimizing the request before it ever reaches an LLM provider.
```mermaid
flowchart TD
A[Incoming Query] --> B(Semantic Cache)
B -->|Cache Hit| Z[Return Cached Response]
B -->|Cache Miss| C(Query Analyzer)
C --> D(Complexity Estimator)
D --> E(Optimization Engine)
E --> F(Prompt Optimizer)
F --> G(Model Router)
G --> H((LLM Provider))
H --> I[LLM-as-a-Judge Evaluator]
I -->|Feedback Loop| E
I --> Z
```
### Pipeline Stages
1. **Semantic Cache**: Checks Redis for highly similar past queries.
2. **Query Analyzer**: Extracts structural features and semantic domains from the prompt.
3. **Complexity Estimator**: Predicts the cognitive load required to answer the query (0.0 to 1.0).
4. **Optimization Engine**: Minimizes a cost/quality objective function to pick the perfect model.
5. **Prompt Optimizer**: Intelligently compresses the prompt to shed unnecessary tokens.
6. **Model Router**: Dispatches the request via LiteLLM to OpenAI, Anthropic, Google, Ollama, etc.
7. **Evaluator (Optional)**: Scores the response quality and feeds it back to the optimization engine.
---
## Core ML Components
The V2 release introduces state-of-the-art machine learning to every layer of the pipeline:
### 1. Zero-Shot NLI Query Analyzer
Instead of relying on brittle regex patterns to determine if a query is asking for "code" or "math," LLMOpt utilizes HuggingFace's `cross-encoder/nli-distilroberta-base`. This semantic reasoning engine accurately categorizes query intent on the fly without requiring labeled datasets.
### 2. Sentence-Transformer Semantic Cache
Before spending API credits, the framework embeds the incoming query using a lightweight, local `all-MiniLM-L6-v2` model and compares it against a Redis-backed vector store using cosine similarity. If an existing query matches with >95% similarity, the cached response is served at **$0.00 cost** and near-zero latency.
### 3. Gradient Boosting Complexity Estimator
To predict how "hard" a query is, LLMOpt leverages a `scikit-learn` Gradient Boosting Regressor (GBR) trained on hundreds of annotated examples. It accurately scales the required capability threshold, ensuring that "What is Python?" gets routed to a fast, cheap model, while "Implement a distributed Paxos consensus algorithm" gets routed to a flagship reasoning model.
### 4. Bayesian Weight Optimization (Optuna)
The Optimization Engine selects models by minimizing the objective function:
`J(x) = α·Cost + β·Tokens - γ·Quality`
Instead of hardcoding `α`, `β`, and `γ`, LLMOpt integrates **Optuna**. By processing real-world feedback from the LLM evaluator, Optuna uses Bayesian optimization to continuously learn and adjust these weights to mathematically guarantee the highest quality responses for the lowest possible price.
### 5. LLMLingua Semantic Compression
Large context windows are expensive. LLMOpt integrates Microsoft's `llmlingua-2` to perform semantic token pruning. It identifies and removes non-essential tokens (filler words, redundant context) from the prompt while preserving the core semantic meaning, reducing input costs by up to 40% before the LLM is even called.
### 6. LLM-as-a-Judge Evaluation Loop
When explicitly requested (`evaluate=True`), LLMOpt uses a highly efficient judge model (`gpt-4o-mini`) to score the returned response across Accuracy, Completeness, Clarity, and Conciseness. This score is automatically fed back into the Bayesian Optimizer to improve future routing decisions.
---
## Graceful Degradation
Enterprise systems must be resilient. **LLMOpt is designed to never crash if an ML dependency is missing or unavailable.**
If you choose not to install the heavy `[ml]` dependencies (like PyTorch or sentence-transformers), or if your Redis cache goes offline, LLMOpt silently and seamlessly **falls back to its robust V1 heuristic rules**. This ensures that your application continues to route requests efficiently under all circumstances.
---
## Quick Start & Installation
### Requirements
- Python 3.10+
- At least one API key (OpenAI, Anthropic, Google, Mistral, DeepSeek) OR a local Ollama instance.
### Installation
```bash
# Clone the repository
git clone https://github.com/Shrot101/llmopt.git
cd llmopt
# Install with all Machine Learning capabilities (Highly Recommended for V2)
pip install -e ".[ml]"
# Install Core only (uses V1 heuristic fallbacks)
pip install -e .
# Install with Local Model support
pip install -e ".[ml,local]"
```
### Configuration
Copy the environment template and add your API keys. You only need to provide keys for the providers you intend to use.
```bash
cp config/.env.example config/.env
```
```env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIza...
OLLAMA_API_BASE=http://localhost:11434
# Required for V2 Semantic Caching
REDIS_URL=redis://localhost:6379/0
```
---
## Python SDK Usage
### Basic Generation
```python
from llmopt import LLMOpt
client = LLMOpt()
# The framework handles analysis, optimization, and routing automatically
result = client.generate(
query="Explain the difference between TCP and UDP",
budget_mode="balanced" # Options: "cheap" | "balanced" | "quality"
)
print(result.response)
print(f"Model used : {result.model_used}")
print(f"Cost : ${result.estimated_cost:.6f}")
print(f"Tokens saved: {result.tokens_saved}")
```
### Advanced Constraints & Evaluation
```python
result = client.generate(
query="Design a highly available distributed rate limiter.",
budget_mode="quality",
# Hard cap — never spend more than this per request (USD)
max_cost_per_request=0.01,
# Provider filtering
exclude_providers=["openai"],
only_providers=["anthropic", "google"],
# Opt-in to the LLM-as-a-judge feedback loop
evaluate=True,
# dry_run=True → runs full optimization pipeline but skips the actual API call
dry_run=False,
)
if result.evaluation:
print(f"Quality Score: {result.evaluation.overall}/10")
```
---
## REST API Integration
LLMOpt includes a built-in FastAPI server for easy integration into non-Python architectures.
### Start the server
```bash
python run.py --host 0.0.0.0 --port 8000
```
### `POST /generate`
**Request Body:**
```json
{
"query": "Explain quantum computing",
"budget_mode": "balanced",
"api_keys": {
"openai": "sk-...",
"anthropic": "sk-ant-...",
"google": "AIza..."
}
}
```
> [!TIP]
> **BYOK (Bring Your Own Key) Mode:**
> This server is configured as a public utility. It provides the **Routing Intelligence** and **Shared Semantic Cache**, but you must provide your own provider API keys in the `api_keys` object. The server does not store your keys; they are used only for the duration of the request.
### Example cURL
```bash
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"query": "Write a recursive Fibonacci function in Rust",
"budget_mode": "balanced",
"evaluate": true
}'
```
**Response payload includes deep insights into the optimization process:**
```json
{
"response": "Here is the Rust implementation...",
"model_used": "claude-3-5-haiku-20241022",
"provider": "anthropic",
"input_tokens": 105,
"output_tokens": 342,
"total_tokens": 447,
"estimated_cost": 0.001452,
"tokens_saved": 28,
"compression_ratio": 0.21,
"complexity_score": 0.62,
"complexity_tier": "hard",
"latency_ms": 1140,
"evaluation": {
"overall": 9.5,
"accuracy": 10.0,
"feedback": "The code is idiomatic and correctly implements recursion."
}
}
```
---
## Supported Providers & Models
The routing engine dynamically compares models across providers based on their unified capability scores and per-token pricing. Add or update models simply by modifying `data/model_registry.json`.
| Model | Provider | Input $/1k | Output $/1k | Capability | Best For |
|-------|----------|-----------|------------|------------|----------|
| `gpt-4o` | OpenAI | $0.0025 | $0.010 | 0.930 | Complex reasoning |
| `gpt-4o-mini` | OpenAI | $0.00015 | $0.0006 | 0.784 | Balanced tasks |
| `claude-3-5-sonnet-20241022` | Anthropic | $0.003 | $0.015 | 0.934 | Coding, analysis |
| `claude-3-5-haiku-20241022` | Anthropic | $0.0008 | $0.004 | 0.794 | Fast tasks |
| `gemini-1.5-flash` | Google | $0.000075 | $0.0003 | 0.742 | Cheapest cloud |
| `mistral-large-latest` | Mistral | $0.003 | $0.009 | 0.852 | EU + quality |
| `deepseek-chat` | DeepSeek | $0.00014 | $0.00028 | 0.887 | Best value math/code |
| `llama3.1:70b` | Ollama | FREE | FREE | 0.823 | Local high-quality |
*(See the registry file for the complete list of supported models).*
---
## Explainability & Observability
Unlike black-box routing systems, LLMOpt is completely transparent. You can ask the framework to explain exactly why it chose a specific model for a specific query without spending any money (`dry_run=True` or the `/explain` endpoint).
```python
explanation = client.explain(
query="What is the capital of France?",
budget_mode="cheap"
)
```
**Explanation Output:**
```text
=======================================================
LLMOpt Decision Explanation
=======================================================
Query complexity : 0.050 (trivial)
Primary domain : factual
Selected model : gemini-1.5-flash (google)
Fallback model : gpt-4o-mini
Compression : yes
System prompt : minimal
Scoring rationale:
• model=gemini-1.5-flash
• capability=0.742
• cost_norm=0.0042
• J=-0.124 (α=0.6,β=0.3,γ=0.1)
Cost saved : $0.009850 vs GPT-4o baseline
=======================================================
```