Spaces:

Shrot102
/

llmopt-server

Running

File size: 11,350 Bytes

cec88de
 
 
 
 
 
 
 
 
eff2120
3c1db6c
eff2120
3c1db6c
eff2120
3c1db6c
eff2120
3c1db6c
eff2120
 
 
 
 
3c1db6c
 
 
eff2120
 
 
 
3c1db6c
eff2120
3c1db6c
eff2120
3c1db6c
 
 
eff2120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c1db6c
eff2120
3c1db6c
eff2120
3c1db6c
eff2120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c1db6c
 
 
eff2120
 
 
 
 
 
 
 
 
3c1db6c
 
 
eff2120
3c1db6c
eff2120
3c1db6c
 
eff2120
cec88de
eff2120
3c1db6c
eff2120
3c1db6c
 
eff2120
 
3c1db6c
eff2120
 
3c1db6c
 
eff2120
 
3c1db6c
 
 
 
 
 
 
 
 
 
 
eff2120
 
3c1db6c
 
 
 
 
 
 
 
 
 
 
 
 
eff2120
3c1db6c
 
eff2120
3c1db6c
 
 
eff2120
 
3c1db6c
 
 
eff2120
3c1db6c
 
 
eff2120
 
3c1db6c
eff2120
 
3c1db6c
 
eff2120
 
3c1db6c
eff2120
 
3c1db6c
eff2120
3c1db6c
 
 
eff2120
 
3c1db6c
 
 
 
eff2120
3c1db6c
eff2120
3c1db6c
eff2120
3c1db6c
eff2120
3c1db6c
 
eff2120
6adbd7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c1db6c
 
 
 
eff2120
 
 
3c1db6c
 
 
eff2120
3c1db6c
 
eff2120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c1db6c
 
 
 
 
 
 
eff2120
 
3c1db6c
 
eff2120
 
 
 
 
 
 
 
3c1db6c
eff2120
3c1db6c
eff2120
3c1db6c
eff2120
3c1db6c
eff2120
3c1db6c
 
eff2120
 
 
 
3c1db6c
 
eff2120
 
 
 
 
 
 
3c1db6c
eff2120
 
 
 
3c1db6c
eff2120
 
 
 
 
3c1db6c
eff2120
 
3c1db6c

---
title: LLMOpt
emoji: 🚀
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---

# LLMOpt: The Adaptive Inference Optimization Framework (V2)

> **Intelligent Routing. Minimal Latency. Maximum ROI.**

In the era of sprawling Large Language Models (LLMs), routing every query to a flagship model like GPT-4o or Claude 3.5 Sonnet is financially unsustainable and computationally wasteful. 

**LLMOpt** is an enterprise-grade middleware layer that sits between your application and your LLM providers. By dynamically analyzing the semantic complexity of incoming queries, LLMOpt automatically selects the most cost-effective model capable of handling the request, compresses context windows to reduce token waste, and caches responses—all while giving you full observability into its decision-making process.

```text
Your App → llmopt.generate(query) 
    → [Semantic Cache → NLI Analyze → GBR Estimate → Bayesian Optimize → LLMLingua Compress → Route] 
    → LLM API → Response
```

## Table of Contents

- [The V2 Architecture](#the-v2-architecture)
- [Core ML Components](#core-ml-components)
- [Graceful Degradation](#graceful-degradation)
- [Quick Start & Installation](#quick-start--installation)
- [Python SDK Usage](#python-sdk-usage)
- [REST API Integration](#rest-api-integration)
- [Supported Providers & Models](#supported-providers--models)
- [Explainability & Observability](#explainability--observability)

---

## The V2 Architecture

LLMOpt V2 has transitioned from a static, heuristic-based router to a fully **Machine Learning-powered pipeline**. The framework acts as an intelligent funnel, progressively optimizing the request before it ever reaches an LLM provider.

```mermaid
flowchart TD
    A[Incoming Query] --> B(Semantic Cache)
    B -->|Cache Hit| Z[Return Cached Response]
    B -->|Cache Miss| C(Query Analyzer)
    C --> D(Complexity Estimator)
    D --> E(Optimization Engine)
    E --> F(Prompt Optimizer)
    F --> G(Model Router)
    G --> H((LLM Provider))
    H --> I[LLM-as-a-Judge Evaluator]
    I -->|Feedback Loop| E
    I --> Z
```

### Pipeline Stages
1. **Semantic Cache**: Checks Redis for highly similar past queries.
2. **Query Analyzer**: Extracts structural features and semantic domains from the prompt.
3. **Complexity Estimator**: Predicts the cognitive load required to answer the query (0.0 to 1.0).
4. **Optimization Engine**: Minimizes a cost/quality objective function to pick the perfect model.
5. **Prompt Optimizer**: Intelligently compresses the prompt to shed unnecessary tokens.
6. **Model Router**: Dispatches the request via LiteLLM to OpenAI, Anthropic, Google, Ollama, etc.
7. **Evaluator (Optional)**: Scores the response quality and feeds it back to the optimization engine.

---

## Core ML Components

The V2 release introduces state-of-the-art machine learning to every layer of the pipeline:

### 1. Zero-Shot NLI Query Analyzer
Instead of relying on brittle regex patterns to determine if a query is asking for "code" or "math," LLMOpt utilizes HuggingFace's `cross-encoder/nli-distilroberta-base`. This semantic reasoning engine accurately categorizes query intent on the fly without requiring labeled datasets.

### 2. Sentence-Transformer Semantic Cache
Before spending API credits, the framework embeds the incoming query using a lightweight, local `all-MiniLM-L6-v2` model and compares it against a Redis-backed vector store using cosine similarity. If an existing query matches with >95% similarity, the cached response is served at **$0.00 cost** and near-zero latency.

### 3. Gradient Boosting Complexity Estimator
To predict how "hard" a query is, LLMOpt leverages a `scikit-learn` Gradient Boosting Regressor (GBR) trained on hundreds of annotated examples. It accurately scales the required capability threshold, ensuring that "What is Python?" gets routed to a fast, cheap model, while "Implement a distributed Paxos consensus algorithm" gets routed to a flagship reasoning model.

### 4. Bayesian Weight Optimization (Optuna)
The Optimization Engine selects models by minimizing the objective function: 
`J(x) = α·Cost + β·Tokens - γ·Quality`
Instead of hardcoding `α`, `β`, and `γ`, LLMOpt integrates **Optuna**. By processing real-world feedback from the LLM evaluator, Optuna uses Bayesian optimization to continuously learn and adjust these weights to mathematically guarantee the highest quality responses for the lowest possible price.

### 5. LLMLingua Semantic Compression
Large context windows are expensive. LLMOpt integrates Microsoft's `llmlingua-2` to perform semantic token pruning. It identifies and removes non-essential tokens (filler words, redundant context) from the prompt while preserving the core semantic meaning, reducing input costs by up to 40% before the LLM is even called.

### 6. LLM-as-a-Judge Evaluation Loop
When explicitly requested (`evaluate=True`), LLMOpt uses a highly efficient judge model (`gpt-4o-mini`) to score the returned response across Accuracy, Completeness, Clarity, and Conciseness. This score is automatically fed back into the Bayesian Optimizer to improve future routing decisions.

---

## Graceful Degradation

Enterprise systems must be resilient. **LLMOpt is designed to never crash if an ML dependency is missing or unavailable.** 

If you choose not to install the heavy `[ml]` dependencies (like PyTorch or sentence-transformers), or if your Redis cache goes offline, LLMOpt silently and seamlessly **falls back to its robust V1 heuristic rules**. This ensures that your application continues to route requests efficiently under all circumstances.

---

## Quick Start & Installation

### Requirements
- Python 3.10+
- At least one API key (OpenAI, Anthropic, Google, Mistral, DeepSeek) OR a local Ollama instance.

### Installation

```bash
# Clone the repository
git clone https://github.com/Shrot101/llmopt.git
cd llmopt

# Install with all Machine Learning capabilities (Highly Recommended for V2)
pip install -e ".[ml]"

# Install Core only (uses V1 heuristic fallbacks)
pip install -e .

# Install with Local Model support
pip install -e ".[ml,local]"
```

### Configuration
Copy the environment template and add your API keys. You only need to provide keys for the providers you intend to use.

```bash
cp config/.env.example config/.env
```

```env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIza...
OLLAMA_API_BASE=http://localhost:11434

# Required for V2 Semantic Caching
REDIS_URL=redis://localhost:6379/0
```

---

## Python SDK Usage

### Basic Generation

```python
from llmopt import LLMOpt

client = LLMOpt()

# The framework handles analysis, optimization, and routing automatically
result = client.generate(
    query="Explain the difference between TCP and UDP",
    budget_mode="balanced"   # Options: "cheap" | "balanced" | "quality"
)

print(result.response)
print(f"Model used  : {result.model_used}")
print(f"Cost        : ${result.estimated_cost:.6f}")
print(f"Tokens saved: {result.tokens_saved}")
```

### Advanced Constraints & Evaluation

```python
result = client.generate(
    query="Design a highly available distributed rate limiter.",
    budget_mode="quality",

    # Hard cap — never spend more than this per request (USD)
    max_cost_per_request=0.01,

    # Provider filtering
    exclude_providers=["openai"],      
    only_providers=["anthropic", "google"],      

    # Opt-in to the LLM-as-a-judge feedback loop
    evaluate=True,

    # dry_run=True → runs full optimization pipeline but skips the actual API call
    dry_run=False,
)

if result.evaluation:
    print(f"Quality Score: {result.evaluation.overall}/10")
```

---

## REST API Integration

LLMOpt includes a built-in FastAPI server for easy integration into non-Python architectures.

### Start the server
```bash
python run.py --host 0.0.0.0 --port 8000
```

### `POST /generate`

**Request Body:**
```json
{
  "query": "Explain quantum computing",
  "budget_mode": "balanced",
  "api_keys": {
    "openai": "sk-...",
    "anthropic": "sk-ant-...",
    "google": "AIza..."
  }
}
```

> [!TIP]
> **BYOK (Bring Your Own Key) Mode:** 
> This server is configured as a public utility. It provides the **Routing Intelligence** and **Shared Semantic Cache**, but you must provide your own provider API keys in the `api_keys` object. The server does not store your keys; they are used only for the duration of the request.

### Example cURL
```bash
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Write a recursive Fibonacci function in Rust",
    "budget_mode": "balanced",
    "evaluate": true
  }'
```

**Response payload includes deep insights into the optimization process:**
```json
{
  "response": "Here is the Rust implementation...",
  "model_used": "claude-3-5-haiku-20241022",
  "provider": "anthropic",
  "input_tokens": 105,
  "output_tokens": 342,
  "total_tokens": 447,
  "estimated_cost": 0.001452,
  "tokens_saved": 28,
  "compression_ratio": 0.21,
  "complexity_score": 0.62,
  "complexity_tier": "hard",
  "latency_ms": 1140,
  "evaluation": {
      "overall": 9.5,
      "accuracy": 10.0,
      "feedback": "The code is idiomatic and correctly implements recursion."
  }
}
```

---

## Supported Providers & Models

The routing engine dynamically compares models across providers based on their unified capability scores and per-token pricing. Add or update models simply by modifying `data/model_registry.json`.

| Model | Provider | Input $/1k | Output $/1k | Capability | Best For |
|-------|----------|-----------|------------|------------|----------|
| `gpt-4o` | OpenAI | $0.0025 | $0.010 | 0.930 | Complex reasoning |
| `gpt-4o-mini` | OpenAI | $0.00015 | $0.0006 | 0.784 | Balanced tasks |
| `claude-3-5-sonnet-20241022` | Anthropic | $0.003 | $0.015 | 0.934 | Coding, analysis |
| `claude-3-5-haiku-20241022` | Anthropic | $0.0008 | $0.004 | 0.794 | Fast tasks |
| `gemini-1.5-flash` | Google | $0.000075 | $0.0003 | 0.742 | Cheapest cloud |
| `mistral-large-latest` | Mistral | $0.003 | $0.009 | 0.852 | EU + quality |
| `deepseek-chat` | DeepSeek | $0.00014 | $0.00028 | 0.887 | Best value math/code |
| `llama3.1:70b` | Ollama | FREE | FREE | 0.823 | Local high-quality |

*(See the registry file for the complete list of supported models).*

---

## Explainability & Observability

Unlike black-box routing systems, LLMOpt is completely transparent. You can ask the framework to explain exactly why it chose a specific model for a specific query without spending any money (`dry_run=True` or the `/explain` endpoint).

```python
explanation = client.explain(
    query="What is the capital of France?",
    budget_mode="cheap"
)
```

**Explanation Output:**
```text
=======================================================
LLMOpt Decision Explanation
=======================================================
Query complexity : 0.050 (trivial)
Primary domain   : factual

Selected model   : gemini-1.5-flash (google)
Fallback model   : gpt-4o-mini
Compression      : yes
System prompt    : minimal

Scoring rationale:
  • model=gemini-1.5-flash
  • capability=0.742
  • cost_norm=0.0042
  • J=-0.124 (α=0.6,β=0.3,γ=0.1)

Cost saved       : $0.009850 vs GPT-4o baseline
=======================================================
```