llmopt-server / README.md
Shrot101's picture
docs: update README with BYOK usage instructions
6adbd7a
metadata
title: LLMOpt
emoji: 🚀
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false

LLMOpt: The Adaptive Inference Optimization Framework (V2)

Intelligent Routing. Minimal Latency. Maximum ROI.

In the era of sprawling Large Language Models (LLMs), routing every query to a flagship model like GPT-4o or Claude 3.5 Sonnet is financially unsustainable and computationally wasteful.

LLMOpt is an enterprise-grade middleware layer that sits between your application and your LLM providers. By dynamically analyzing the semantic complexity of incoming queries, LLMOpt automatically selects the most cost-effective model capable of handling the request, compresses context windows to reduce token waste, and caches responses—all while giving you full observability into its decision-making process.

Your App → llmopt.generate(query) 
    → [Semantic Cache → NLI Analyze → GBR Estimate → Bayesian Optimize → LLMLingua Compress → Route] 
    → LLM API → Response

Table of Contents


The V2 Architecture

LLMOpt V2 has transitioned from a static, heuristic-based router to a fully Machine Learning-powered pipeline. The framework acts as an intelligent funnel, progressively optimizing the request before it ever reaches an LLM provider.

flowchart TD
    A[Incoming Query] --> B(Semantic Cache)
    B -->|Cache Hit| Z[Return Cached Response]
    B -->|Cache Miss| C(Query Analyzer)
    C --> D(Complexity Estimator)
    D --> E(Optimization Engine)
    E --> F(Prompt Optimizer)
    F --> G(Model Router)
    G --> H((LLM Provider))
    H --> I[LLM-as-a-Judge Evaluator]
    I -->|Feedback Loop| E
    I --> Z

Pipeline Stages

  1. Semantic Cache: Checks Redis for highly similar past queries.
  2. Query Analyzer: Extracts structural features and semantic domains from the prompt.
  3. Complexity Estimator: Predicts the cognitive load required to answer the query (0.0 to 1.0).
  4. Optimization Engine: Minimizes a cost/quality objective function to pick the perfect model.
  5. Prompt Optimizer: Intelligently compresses the prompt to shed unnecessary tokens.
  6. Model Router: Dispatches the request via LiteLLM to OpenAI, Anthropic, Google, Ollama, etc.
  7. Evaluator (Optional): Scores the response quality and feeds it back to the optimization engine.

Core ML Components

The V2 release introduces state-of-the-art machine learning to every layer of the pipeline:

1. Zero-Shot NLI Query Analyzer

Instead of relying on brittle regex patterns to determine if a query is asking for "code" or "math," LLMOpt utilizes HuggingFace's cross-encoder/nli-distilroberta-base. This semantic reasoning engine accurately categorizes query intent on the fly without requiring labeled datasets.

2. Sentence-Transformer Semantic Cache

Before spending API credits, the framework embeds the incoming query using a lightweight, local all-MiniLM-L6-v2 model and compares it against a Redis-backed vector store using cosine similarity. If an existing query matches with >95% similarity, the cached response is served at $0.00 cost and near-zero latency.

3. Gradient Boosting Complexity Estimator

To predict how "hard" a query is, LLMOpt leverages a scikit-learn Gradient Boosting Regressor (GBR) trained on hundreds of annotated examples. It accurately scales the required capability threshold, ensuring that "What is Python?" gets routed to a fast, cheap model, while "Implement a distributed Paxos consensus algorithm" gets routed to a flagship reasoning model.

4. Bayesian Weight Optimization (Optuna)

The Optimization Engine selects models by minimizing the objective function: J(x) = α·Cost + β·Tokens - γ·Quality Instead of hardcoding α, β, and γ, LLMOpt integrates Optuna. By processing real-world feedback from the LLM evaluator, Optuna uses Bayesian optimization to continuously learn and adjust these weights to mathematically guarantee the highest quality responses for the lowest possible price.

5. LLMLingua Semantic Compression

Large context windows are expensive. LLMOpt integrates Microsoft's llmlingua-2 to perform semantic token pruning. It identifies and removes non-essential tokens (filler words, redundant context) from the prompt while preserving the core semantic meaning, reducing input costs by up to 40% before the LLM is even called.

6. LLM-as-a-Judge Evaluation Loop

When explicitly requested (evaluate=True), LLMOpt uses a highly efficient judge model (gpt-4o-mini) to score the returned response across Accuracy, Completeness, Clarity, and Conciseness. This score is automatically fed back into the Bayesian Optimizer to improve future routing decisions.


Graceful Degradation

Enterprise systems must be resilient. LLMOpt is designed to never crash if an ML dependency is missing or unavailable.

If you choose not to install the heavy [ml] dependencies (like PyTorch or sentence-transformers), or if your Redis cache goes offline, LLMOpt silently and seamlessly falls back to its robust V1 heuristic rules. This ensures that your application continues to route requests efficiently under all circumstances.


Quick Start & Installation

Requirements

  • Python 3.10+
  • At least one API key (OpenAI, Anthropic, Google, Mistral, DeepSeek) OR a local Ollama instance.

Installation

# Clone the repository
git clone https://github.com/Shrot101/llmopt.git
cd llmopt

# Install with all Machine Learning capabilities (Highly Recommended for V2)
pip install -e ".[ml]"

# Install Core only (uses V1 heuristic fallbacks)
pip install -e .

# Install with Local Model support
pip install -e ".[ml,local]"

Configuration

Copy the environment template and add your API keys. You only need to provide keys for the providers you intend to use.

cp config/.env.example config/.env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIza...
OLLAMA_API_BASE=http://localhost:11434

# Required for V2 Semantic Caching
REDIS_URL=redis://localhost:6379/0

Python SDK Usage

Basic Generation

from llmopt import LLMOpt

client = LLMOpt()

# The framework handles analysis, optimization, and routing automatically
result = client.generate(
    query="Explain the difference between TCP and UDP",
    budget_mode="balanced"   # Options: "cheap" | "balanced" | "quality"
)

print(result.response)
print(f"Model used  : {result.model_used}")
print(f"Cost        : ${result.estimated_cost:.6f}")
print(f"Tokens saved: {result.tokens_saved}")

Advanced Constraints & Evaluation

result = client.generate(
    query="Design a highly available distributed rate limiter.",
    budget_mode="quality",

    # Hard cap — never spend more than this per request (USD)
    max_cost_per_request=0.01,

    # Provider filtering
    exclude_providers=["openai"],      
    only_providers=["anthropic", "google"],      

    # Opt-in to the LLM-as-a-judge feedback loop
    evaluate=True,

    # dry_run=True → runs full optimization pipeline but skips the actual API call
    dry_run=False,
)

if result.evaluation:
    print(f"Quality Score: {result.evaluation.overall}/10")

REST API Integration

LLMOpt includes a built-in FastAPI server for easy integration into non-Python architectures.

Start the server

python run.py --host 0.0.0.0 --port 8000

POST /generate

Request Body:

{
  "query": "Explain quantum computing",
  "budget_mode": "balanced",
  "api_keys": {
    "openai": "sk-...",
    "anthropic": "sk-ant-...",
    "google": "AIza..."
  }
}

BYOK (Bring Your Own Key) Mode: This server is configured as a public utility. It provides the Routing Intelligence and Shared Semantic Cache, but you must provide your own provider API keys in the api_keys object. The server does not store your keys; they are used only for the duration of the request.

Example cURL

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Write a recursive Fibonacci function in Rust",
    "budget_mode": "balanced",
    "evaluate": true
  }'

Response payload includes deep insights into the optimization process:

{
  "response": "Here is the Rust implementation...",
  "model_used": "claude-3-5-haiku-20241022",
  "provider": "anthropic",
  "input_tokens": 105,
  "output_tokens": 342,
  "total_tokens": 447,
  "estimated_cost": 0.001452,
  "tokens_saved": 28,
  "compression_ratio": 0.21,
  "complexity_score": 0.62,
  "complexity_tier": "hard",
  "latency_ms": 1140,
  "evaluation": {
      "overall": 9.5,
      "accuracy": 10.0,
      "feedback": "The code is idiomatic and correctly implements recursion."
  }
}

Supported Providers & Models

The routing engine dynamically compares models across providers based on their unified capability scores and per-token pricing. Add or update models simply by modifying data/model_registry.json.

Model Provider Input $/1k Output $/1k Capability Best For
gpt-4o OpenAI $0.0025 $0.010 0.930 Complex reasoning
gpt-4o-mini OpenAI $0.00015 $0.0006 0.784 Balanced tasks
claude-3-5-sonnet-20241022 Anthropic $0.003 $0.015 0.934 Coding, analysis
claude-3-5-haiku-20241022 Anthropic $0.0008 $0.004 0.794 Fast tasks
gemini-1.5-flash Google $0.000075 $0.0003 0.742 Cheapest cloud
mistral-large-latest Mistral $0.003 $0.009 0.852 EU + quality
deepseek-chat DeepSeek $0.00014 $0.00028 0.887 Best value math/code
llama3.1:70b Ollama FREE FREE 0.823 Local high-quality

(See the registry file for the complete list of supported models).


Explainability & Observability

Unlike black-box routing systems, LLMOpt is completely transparent. You can ask the framework to explain exactly why it chose a specific model for a specific query without spending any money (dry_run=True or the /explain endpoint).

explanation = client.explain(
    query="What is the capital of France?",
    budget_mode="cheap"
)

Explanation Output: ```text

LLMOpt Decision Explanation

Query complexity : 0.050 (trivial) Primary domain : factual

Selected model : gemini-1.5-flash (google) Fallback model : gpt-4o-mini Compression : yes System prompt : minimal

Scoring rationale: • model=gemini-1.5-flash • capability=0.742 • cost_norm=0.0042 • J=-0.124 (α=0.6,β=0.3,γ=0.1)

Cost saved : $0.009850 vs GPT-4o baseline