---
title: LLMOpt
emoji: 🚀
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---

# LLMOpt: The Adaptive Inference Optimization Framework (V2)

> **Intelligent Routing. Minimal Latency. Maximum ROI.**

In the era of sprawling Large Language Models (LLMs), routing every query to a flagship model like GPT-4o or Claude 3.5 Sonnet is financially unsustainable and computationally wasteful. 

**LLMOpt** is an enterprise-grade middleware layer that sits between your application and your LLM providers. By dynamically analyzing the semantic complexity of incoming queries, LLMOpt automatically selects the most cost-effective model capable of handling the request, compresses context windows to reduce token waste, and caches responses—all while giving you full observability into its decision-making process.

```text
Your App → llmopt.generate(query) 
    → [Semantic Cache → NLI Analyze → GBR Estimate → Bayesian Optimize → LLMLingua Compress → Route] 
    → LLM API → Response
```

## Table of Contents

- [The V2 Architecture](#the-v2-architecture)
- [Core ML Components](#core-ml-components)
- [Graceful Degradation](#graceful-degradation)
- [Quick Start & Installation](#quick-start--installation)
- [Python SDK Usage](#python-sdk-usage)
- [REST API Integration](#rest-api-integration)
- [Supported Providers & Models](#supported-providers--models)
- [Explainability & Observability](#explainability--observability)

---

## The V2 Architecture

LLMOpt V2 has transitioned from a static, heuristic-based router to a fully **Machine Learning-powered pipeline**. The framework acts as an intelligent funnel, progressively optimizing the request before it ever reaches an LLM provider.

```mermaid
flowchart TD
    A[Incoming Query] --> B(Semantic Cache)
    B -->|Cache Hit| Z[Return Cached Response]
    B -->|Cache Miss| C(Query Analyzer)
    C --> D(Complexity Estimator)
    D --> E(Optimization Engine)
    E --> F(Prompt Optimizer)
    F --> G(Model Router)
    G --> H((LLM Provider))
    H --> I[LLM-as-a-Judge Evaluator]
    I -->|Feedback Loop| E
    I --> Z
```

### Pipeline Stages
1. **Semantic Cache**: Checks Redis for highly similar past queries.
2. **Query Analyzer**: Extracts structural features and semantic domains from the prompt.
3. **Complexity Estimator**: Predicts the cognitive load required to answer the query (0.0 to 1.0).
4. **Optimization Engine**: Minimizes a cost/quality objective function to pick the perfect model.
5. **Prompt Optimizer**: Intelligently compresses the prompt to shed unnecessary tokens.
6. **Model Router**: Dispatches the request via LiteLLM to OpenAI, Anthropic, Google, Ollama, etc.
7. **Evaluator (Optional)**: Scores the response quality and feeds it back to the optimization engine.

---

## Core ML Components

The V2 release introduces state-of-the-art machine learning to every layer of the pipeline:

### 1. Zero-Shot NLI Query Analyzer
Instead of relying on brittle regex patterns to determine if a query is asking for "code" or "math," LLMOpt utilizes HuggingFace's `cross-encoder/nli-distilroberta-base`. This semantic reasoning engine accurately categorizes query intent on the fly without requiring labeled datasets.

### 2. Sentence-Transformer Semantic Cache
Before spending API credits, the framework embeds the incoming query using a lightweight, local `all-MiniLM-L6-v2` model and compares it against a Redis-backed vector store using cosine similarity. If an existing query matches with >95% similarity, the cached response is served at **$0.00 cost** and near-zero latency.

### 3. Gradient Boosting Complexity Estimator
To predict how "hard" a query is, LLMOpt leverages a `scikit-learn` Gradient Boosting Regressor (GBR) trained on hundreds of annotated examples. It accurately scales the required capability threshold, ensuring that "What is Python?" gets routed to a fast, cheap model, while "Implement a distributed Paxos consensus algorithm" gets routed to a flagship reasoning model.

### 4. Bayesian Weight Optimization (Optuna)
The Optimization Engine selects models by minimizing the objective function: 
`J(x) = α·Cost + β·Tokens - γ·Quality`
Instead of hardcoding `α`, `β`, and `γ`, LLMOpt integrates **Optuna**. By processing real-world feedback from the LLM evaluator, Optuna uses Bayesian optimization to continuously learn and adjust these weights to mathematically guarantee the highest quality responses for the lowest possible price.

### 5. LLMLingua Semantic Compression
Large context windows are expensive. LLMOpt integrates Microsoft's `llmlingua-2` to perform semantic token pruning. It identifies and removes non-essential tokens (filler words, redundant context) from the prompt while preserving the core semantic meaning, reducing input costs by up to 40% before the LLM is even called.

### 6. LLM-as-a-Judge Evaluation Loop
When explicitly requested (`evaluate=True`), LLMOpt uses a highly efficient judge model (`gpt-4o-mini`) to score the returned response across Accuracy, Completeness, Clarity, and Conciseness. This score is automatically fed back into the Bayesian Optimizer to improve future routing decisions.

---

## Graceful Degradation

Enterprise systems must be resilient. **LLMOpt is designed to never crash if an ML dependency is missing or unavailable.** 

If you choose not to install the heavy `[ml]` dependencies (like PyTorch or sentence-transformers), or if your Redis cache goes offline, LLMOpt silently and seamlessly **falls back to its robust V1 heuristic rules**. This ensures that your application continues to route requests efficiently under all circumstances.

---

## Quick Start & Installation

### Requirements
- Python 3.10+
- At least one API key (OpenAI, Anthropic, Google, Mistral, DeepSeek) OR a local Ollama instance.

### Installation

```bash
# Clone the repository
git clone https://github.com/Shrot101/llmopt.git
cd llmopt

# Install with all Machine Learning capabilities (Highly Recommended for V2)
pip install -e ".[ml]"

# Install Core only (uses V1 heuristic fallbacks)
pip install -e .

# Install with Local Model support
pip install -e ".[ml,local]"
```

### Configuration
Copy the environment template and add your API keys. You only need to provide keys for the providers you intend to use.

```bash
cp config/.env.example config/.env
```

```env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIza...
OLLAMA_API_BASE=http://localhost:11434

# Required for V2 Semantic Caching
REDIS_URL=redis://localhost:6379/0
```

---

## Python SDK Usage

### Basic Generation

```python
from llmopt import LLMOpt

client = LLMOpt()

# The framework handles analysis, optimization, and routing automatically
result = client.generate(
    query="Explain the difference between TCP and UDP",
    budget_mode="balanced"   # Options: "cheap" | "balanced" | "quality"
)

print(result.response)
print(f"Model used  : {result.model_used}")
print(f"Cost        : ${result.estimated_cost:.6f}")
print(f"Tokens saved: {result.tokens_saved}")
```

### Advanced Constraints & Evaluation

```python
result = client.generate(
    query="Design a highly available distributed rate limiter.",
    budget_mode="quality",

    # Hard cap — never spend more than this per request (USD)
    max_cost_per_request=0.01,

    # Provider filtering
    exclude_providers=["openai"],      
    only_providers=["anthropic", "google"],      

    # Opt-in to the LLM-as-a-judge feedback loop
    evaluate=True,

    # dry_run=True → runs full optimization pipeline but skips the actual API call
    dry_run=False,
)

if result.evaluation:
    print(f"Quality Score: {result.evaluation.overall}/10")
```

---

## REST API Integration

LLMOpt includes a built-in FastAPI server for easy integration into non-Python architectures.

### Start the server
```bash
python run.py --host 0.0.0.0 --port 8000
```

### `POST /generate`

**Request Body:**
```json
{
  "query": "Explain quantum computing",
  "budget_mode": "balanced",
  "api_keys": {
    "openai": "sk-...",
    "anthropic": "sk-ant-...",
    "google": "AIza..."
  }
}
```

> [!TIP]
> **BYOK (Bring Your Own Key) Mode:** 
> This server is configured as a public utility. It provides the **Routing Intelligence** and **Shared Semantic Cache**, but you must provide your own provider API keys in the `api_keys` object. The server does not store your keys; they are used only for the duration of the request.

### Example cURL
```bash
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Write a recursive Fibonacci function in Rust",
    "budget_mode": "balanced",
    "evaluate": true
  }'
```

**Response payload includes deep insights into the optimization process:**
```json
{
  "response": "Here is the Rust implementation...",
  "model_used": "claude-3-5-haiku-20241022",
  "provider": "anthropic",
  "input_tokens": 105,
  "output_tokens": 342,
  "total_tokens": 447,
  "estimated_cost": 0.001452,
  "tokens_saved": 28,
  "compression_ratio": 0.21,
  "complexity_score": 0.62,
  "complexity_tier": "hard",
  "latency_ms": 1140,
  "evaluation": {
      "overall": 9.5,
      "accuracy": 10.0,
      "feedback": "The code is idiomatic and correctly implements recursion."
  }
}
```

---

## Supported Providers & Models

The routing engine dynamically compares models across providers based on their unified capability scores and per-token pricing. Add or update models simply by modifying `data/model_registry.json`.

| Model | Provider | Input $/1k | Output $/1k | Capability | Best For |
|-------|----------|-----------|------------|------------|----------|
| `gpt-4o` | OpenAI | $0.0025 | $0.010 | 0.930 | Complex reasoning |
| `gpt-4o-mini` | OpenAI | $0.00015 | $0.0006 | 0.784 | Balanced tasks |
| `claude-3-5-sonnet-20241022` | Anthropic | $0.003 | $0.015 | 0.934 | Coding, analysis |
| `claude-3-5-haiku-20241022` | Anthropic | $0.0008 | $0.004 | 0.794 | Fast tasks |
| `gemini-1.5-flash` | Google | $0.000075 | $0.0003 | 0.742 | Cheapest cloud |
| `mistral-large-latest` | Mistral | $0.003 | $0.009 | 0.852 | EU + quality |
| `deepseek-chat` | DeepSeek | $0.00014 | $0.00028 | 0.887 | Best value math/code |
| `llama3.1:70b` | Ollama | FREE | FREE | 0.823 | Local high-quality |

*(See the registry file for the complete list of supported models).*

---

## Explainability & Observability

Unlike black-box routing systems, LLMOpt is completely transparent. You can ask the framework to explain exactly why it chose a specific model for a specific query without spending any money (`dry_run=True` or the `/explain` endpoint).

```python
explanation = client.explain(
    query="What is the capital of France?",
    budget_mode="cheap"
)
```

**Explanation Output:**
```text
=======================================================
LLMOpt Decision Explanation
=======================================================
Query complexity : 0.050 (trivial)
Primary domain   : factual

Selected model   : gemini-1.5-flash (google)
Fallback model   : gpt-4o-mini
Compression      : yes
System prompt    : minimal

Scoring rationale:
  • model=gemini-1.5-flash
  • capability=0.742
  • cost_norm=0.0042
  • J=-0.124 (α=0.6,β=0.3,γ=0.1)

Cost saved       : $0.009850 vs GPT-4o baseline
=======================================================
```