--- title: LLMOpt emoji: 🚀 colorFrom: blue colorTo: indigo sdk: docker pinned: false --- # LLMOpt: The Adaptive Inference Optimization Framework (V2) > **Intelligent Routing. Minimal Latency. Maximum ROI.** In the era of sprawling Large Language Models (LLMs), routing every query to a flagship model like GPT-4o or Claude 3.5 Sonnet is financially unsustainable and computationally wasteful. **LLMOpt** is an enterprise-grade middleware layer that sits between your application and your LLM providers. By dynamically analyzing the semantic complexity of incoming queries, LLMOpt automatically selects the most cost-effective model capable of handling the request, compresses context windows to reduce token waste, and caches responses—all while giving you full observability into its decision-making process. ```text Your App → llmopt.generate(query) → [Semantic Cache → NLI Analyze → GBR Estimate → Bayesian Optimize → LLMLingua Compress → Route] → LLM API → Response ``` ## Table of Contents - [The V2 Architecture](#the-v2-architecture) - [Core ML Components](#core-ml-components) - [Graceful Degradation](#graceful-degradation) - [Quick Start & Installation](#quick-start--installation) - [Python SDK Usage](#python-sdk-usage) - [REST API Integration](#rest-api-integration) - [Supported Providers & Models](#supported-providers--models) - [Explainability & Observability](#explainability--observability) --- ## The V2 Architecture LLMOpt V2 has transitioned from a static, heuristic-based router to a fully **Machine Learning-powered pipeline**. The framework acts as an intelligent funnel, progressively optimizing the request before it ever reaches an LLM provider. ```mermaid flowchart TD A[Incoming Query] --> B(Semantic Cache) B -->|Cache Hit| Z[Return Cached Response] B -->|Cache Miss| C(Query Analyzer) C --> D(Complexity Estimator) D --> E(Optimization Engine) E --> F(Prompt Optimizer) F --> G(Model Router) G --> H((LLM Provider)) H --> I[LLM-as-a-Judge Evaluator] I -->|Feedback Loop| E I --> Z ``` ### Pipeline Stages 1. **Semantic Cache**: Checks Redis for highly similar past queries. 2. **Query Analyzer**: Extracts structural features and semantic domains from the prompt. 3. **Complexity Estimator**: Predicts the cognitive load required to answer the query (0.0 to 1.0). 4. **Optimization Engine**: Minimizes a cost/quality objective function to pick the perfect model. 5. **Prompt Optimizer**: Intelligently compresses the prompt to shed unnecessary tokens. 6. **Model Router**: Dispatches the request via LiteLLM to OpenAI, Anthropic, Google, Ollama, etc. 7. **Evaluator (Optional)**: Scores the response quality and feeds it back to the optimization engine. --- ## Core ML Components The V2 release introduces state-of-the-art machine learning to every layer of the pipeline: ### 1. Zero-Shot NLI Query Analyzer Instead of relying on brittle regex patterns to determine if a query is asking for "code" or "math," LLMOpt utilizes HuggingFace's `cross-encoder/nli-distilroberta-base`. This semantic reasoning engine accurately categorizes query intent on the fly without requiring labeled datasets. ### 2. Sentence-Transformer Semantic Cache Before spending API credits, the framework embeds the incoming query using a lightweight, local `all-MiniLM-L6-v2` model and compares it against a Redis-backed vector store using cosine similarity. If an existing query matches with >95% similarity, the cached response is served at **$0.00 cost** and near-zero latency. ### 3. Gradient Boosting Complexity Estimator To predict how "hard" a query is, LLMOpt leverages a `scikit-learn` Gradient Boosting Regressor (GBR) trained on hundreds of annotated examples. It accurately scales the required capability threshold, ensuring that "What is Python?" gets routed to a fast, cheap model, while "Implement a distributed Paxos consensus algorithm" gets routed to a flagship reasoning model. ### 4. Bayesian Weight Optimization (Optuna) The Optimization Engine selects models by minimizing the objective function: `J(x) = α·Cost + β·Tokens - γ·Quality` Instead of hardcoding `α`, `β`, and `γ`, LLMOpt integrates **Optuna**. By processing real-world feedback from the LLM evaluator, Optuna uses Bayesian optimization to continuously learn and adjust these weights to mathematically guarantee the highest quality responses for the lowest possible price. ### 5. LLMLingua Semantic Compression Large context windows are expensive. LLMOpt integrates Microsoft's `llmlingua-2` to perform semantic token pruning. It identifies and removes non-essential tokens (filler words, redundant context) from the prompt while preserving the core semantic meaning, reducing input costs by up to 40% before the LLM is even called. ### 6. LLM-as-a-Judge Evaluation Loop When explicitly requested (`evaluate=True`), LLMOpt uses a highly efficient judge model (`gpt-4o-mini`) to score the returned response across Accuracy, Completeness, Clarity, and Conciseness. This score is automatically fed back into the Bayesian Optimizer to improve future routing decisions. --- ## Graceful Degradation Enterprise systems must be resilient. **LLMOpt is designed to never crash if an ML dependency is missing or unavailable.** If you choose not to install the heavy `[ml]` dependencies (like PyTorch or sentence-transformers), or if your Redis cache goes offline, LLMOpt silently and seamlessly **falls back to its robust V1 heuristic rules**. This ensures that your application continues to route requests efficiently under all circumstances. --- ## Quick Start & Installation ### Requirements - Python 3.10+ - At least one API key (OpenAI, Anthropic, Google, Mistral, DeepSeek) OR a local Ollama instance. ### Installation ```bash # Clone the repository git clone https://github.com/Shrot101/llmopt.git cd llmopt # Install with all Machine Learning capabilities (Highly Recommended for V2) pip install -e ".[ml]" # Install Core only (uses V1 heuristic fallbacks) pip install -e . # Install with Local Model support pip install -e ".[ml,local]" ``` ### Configuration Copy the environment template and add your API keys. You only need to provide keys for the providers you intend to use. ```bash cp config/.env.example config/.env ``` ```env OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... GEMINI_API_KEY=AIza... OLLAMA_API_BASE=http://localhost:11434 # Required for V2 Semantic Caching REDIS_URL=redis://localhost:6379/0 ``` --- ## Python SDK Usage ### Basic Generation ```python from llmopt import LLMOpt client = LLMOpt() # The framework handles analysis, optimization, and routing automatically result = client.generate( query="Explain the difference between TCP and UDP", budget_mode="balanced" # Options: "cheap" | "balanced" | "quality" ) print(result.response) print(f"Model used : {result.model_used}") print(f"Cost : ${result.estimated_cost:.6f}") print(f"Tokens saved: {result.tokens_saved}") ``` ### Advanced Constraints & Evaluation ```python result = client.generate( query="Design a highly available distributed rate limiter.", budget_mode="quality", # Hard cap — never spend more than this per request (USD) max_cost_per_request=0.01, # Provider filtering exclude_providers=["openai"], only_providers=["anthropic", "google"], # Opt-in to the LLM-as-a-judge feedback loop evaluate=True, # dry_run=True → runs full optimization pipeline but skips the actual API call dry_run=False, ) if result.evaluation: print(f"Quality Score: {result.evaluation.overall}/10") ``` --- ## REST API Integration LLMOpt includes a built-in FastAPI server for easy integration into non-Python architectures. ### Start the server ```bash python run.py --host 0.0.0.0 --port 8000 ``` ### `POST /generate` **Request Body:** ```json { "query": "Explain quantum computing", "budget_mode": "balanced", "api_keys": { "openai": "sk-...", "anthropic": "sk-ant-...", "google": "AIza..." } } ``` > [!TIP] > **BYOK (Bring Your Own Key) Mode:** > This server is configured as a public utility. It provides the **Routing Intelligence** and **Shared Semantic Cache**, but you must provide your own provider API keys in the `api_keys` object. The server does not store your keys; they are used only for the duration of the request. ### Example cURL ```bash curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{ "query": "Write a recursive Fibonacci function in Rust", "budget_mode": "balanced", "evaluate": true }' ``` **Response payload includes deep insights into the optimization process:** ```json { "response": "Here is the Rust implementation...", "model_used": "claude-3-5-haiku-20241022", "provider": "anthropic", "input_tokens": 105, "output_tokens": 342, "total_tokens": 447, "estimated_cost": 0.001452, "tokens_saved": 28, "compression_ratio": 0.21, "complexity_score": 0.62, "complexity_tier": "hard", "latency_ms": 1140, "evaluation": { "overall": 9.5, "accuracy": 10.0, "feedback": "The code is idiomatic and correctly implements recursion." } } ``` --- ## Supported Providers & Models The routing engine dynamically compares models across providers based on their unified capability scores and per-token pricing. Add or update models simply by modifying `data/model_registry.json`. | Model | Provider | Input $/1k | Output $/1k | Capability | Best For | |-------|----------|-----------|------------|------------|----------| | `gpt-4o` | OpenAI | $0.0025 | $0.010 | 0.930 | Complex reasoning | | `gpt-4o-mini` | OpenAI | $0.00015 | $0.0006 | 0.784 | Balanced tasks | | `claude-3-5-sonnet-20241022` | Anthropic | $0.003 | $0.015 | 0.934 | Coding, analysis | | `claude-3-5-haiku-20241022` | Anthropic | $0.0008 | $0.004 | 0.794 | Fast tasks | | `gemini-1.5-flash` | Google | $0.000075 | $0.0003 | 0.742 | Cheapest cloud | | `mistral-large-latest` | Mistral | $0.003 | $0.009 | 0.852 | EU + quality | | `deepseek-chat` | DeepSeek | $0.00014 | $0.00028 | 0.887 | Best value math/code | | `llama3.1:70b` | Ollama | FREE | FREE | 0.823 | Local high-quality | *(See the registry file for the complete list of supported models).* --- ## Explainability & Observability Unlike black-box routing systems, LLMOpt is completely transparent. You can ask the framework to explain exactly why it chose a specific model for a specific query without spending any money (`dry_run=True` or the `/explain` endpoint). ```python explanation = client.explain( query="What is the capital of France?", budget_mode="cheap" ) ``` **Explanation Output:** ```text ======================================================= LLMOpt Decision Explanation ======================================================= Query complexity : 0.050 (trivial) Primary domain : factual Selected model : gemini-1.5-flash (google) Fallback model : gpt-4o-mini Compression : yes System prompt : minimal Scoring rationale: • model=gemini-1.5-flash • capability=0.742 • cost_norm=0.0042 • J=-0.124 (α=0.6,β=0.3,γ=0.1) Cost saved : $0.009850 vs GPT-4o baseline ======================================================= ```