Spaces:
Sleeping
Sleeping
| title: LLMOpt | |
| emoji: 🚀 | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| # LLMOpt: The Adaptive Inference Optimization Framework (V2) | |
| > **Intelligent Routing. Minimal Latency. Maximum ROI.** | |
| In the era of sprawling Large Language Models (LLMs), routing every query to a flagship model like GPT-4o or Claude 3.5 Sonnet is financially unsustainable and computationally wasteful. | |
| **LLMOpt** is an enterprise-grade middleware layer that sits between your application and your LLM providers. By dynamically analyzing the semantic complexity of incoming queries, LLMOpt automatically selects the most cost-effective model capable of handling the request, compresses context windows to reduce token waste, and caches responses—all while giving you full observability into its decision-making process. | |
| ```text | |
| Your App → llmopt.generate(query) | |
| → [Semantic Cache → NLI Analyze → GBR Estimate → Bayesian Optimize → LLMLingua Compress → Route] | |
| → LLM API → Response | |
| ``` | |
| ## Table of Contents | |
| - [The V2 Architecture](#the-v2-architecture) | |
| - [Core ML Components](#core-ml-components) | |
| - [Graceful Degradation](#graceful-degradation) | |
| - [Quick Start & Installation](#quick-start--installation) | |
| - [Python SDK Usage](#python-sdk-usage) | |
| - [REST API Integration](#rest-api-integration) | |
| - [Supported Providers & Models](#supported-providers--models) | |
| - [Explainability & Observability](#explainability--observability) | |
| --- | |
| ## The V2 Architecture | |
| LLMOpt V2 has transitioned from a static, heuristic-based router to a fully **Machine Learning-powered pipeline**. The framework acts as an intelligent funnel, progressively optimizing the request before it ever reaches an LLM provider. | |
| ```mermaid | |
| flowchart TD | |
| A[Incoming Query] --> B(Semantic Cache) | |
| B -->|Cache Hit| Z[Return Cached Response] | |
| B -->|Cache Miss| C(Query Analyzer) | |
| C --> D(Complexity Estimator) | |
| D --> E(Optimization Engine) | |
| E --> F(Prompt Optimizer) | |
| F --> G(Model Router) | |
| G --> H((LLM Provider)) | |
| H --> I[LLM-as-a-Judge Evaluator] | |
| I -->|Feedback Loop| E | |
| I --> Z | |
| ``` | |
| ### Pipeline Stages | |
| 1. **Semantic Cache**: Checks Redis for highly similar past queries. | |
| 2. **Query Analyzer**: Extracts structural features and semantic domains from the prompt. | |
| 3. **Complexity Estimator**: Predicts the cognitive load required to answer the query (0.0 to 1.0). | |
| 4. **Optimization Engine**: Minimizes a cost/quality objective function to pick the perfect model. | |
| 5. **Prompt Optimizer**: Intelligently compresses the prompt to shed unnecessary tokens. | |
| 6. **Model Router**: Dispatches the request via LiteLLM to OpenAI, Anthropic, Google, Ollama, etc. | |
| 7. **Evaluator (Optional)**: Scores the response quality and feeds it back to the optimization engine. | |
| --- | |
| ## Core ML Components | |
| The V2 release introduces state-of-the-art machine learning to every layer of the pipeline: | |
| ### 1. Zero-Shot NLI Query Analyzer | |
| Instead of relying on brittle regex patterns to determine if a query is asking for "code" or "math," LLMOpt utilizes HuggingFace's `cross-encoder/nli-distilroberta-base`. This semantic reasoning engine accurately categorizes query intent on the fly without requiring labeled datasets. | |
| ### 2. Sentence-Transformer Semantic Cache | |
| Before spending API credits, the framework embeds the incoming query using a lightweight, local `all-MiniLM-L6-v2` model and compares it against a Redis-backed vector store using cosine similarity. If an existing query matches with >95% similarity, the cached response is served at **$0.00 cost** and near-zero latency. | |
| ### 3. Gradient Boosting Complexity Estimator | |
| To predict how "hard" a query is, LLMOpt leverages a `scikit-learn` Gradient Boosting Regressor (GBR) trained on hundreds of annotated examples. It accurately scales the required capability threshold, ensuring that "What is Python?" gets routed to a fast, cheap model, while "Implement a distributed Paxos consensus algorithm" gets routed to a flagship reasoning model. | |
| ### 4. Bayesian Weight Optimization (Optuna) | |
| The Optimization Engine selects models by minimizing the objective function: | |
| `J(x) = α·Cost + β·Tokens - γ·Quality` | |
| Instead of hardcoding `α`, `β`, and `γ`, LLMOpt integrates **Optuna**. By processing real-world feedback from the LLM evaluator, Optuna uses Bayesian optimization to continuously learn and adjust these weights to mathematically guarantee the highest quality responses for the lowest possible price. | |
| ### 5. LLMLingua Semantic Compression | |
| Large context windows are expensive. LLMOpt integrates Microsoft's `llmlingua-2` to perform semantic token pruning. It identifies and removes non-essential tokens (filler words, redundant context) from the prompt while preserving the core semantic meaning, reducing input costs by up to 40% before the LLM is even called. | |
| ### 6. LLM-as-a-Judge Evaluation Loop | |
| When explicitly requested (`evaluate=True`), LLMOpt uses a highly efficient judge model (`gpt-4o-mini`) to score the returned response across Accuracy, Completeness, Clarity, and Conciseness. This score is automatically fed back into the Bayesian Optimizer to improve future routing decisions. | |
| --- | |
| ## Graceful Degradation | |
| Enterprise systems must be resilient. **LLMOpt is designed to never crash if an ML dependency is missing or unavailable.** | |
| If you choose not to install the heavy `[ml]` dependencies (like PyTorch or sentence-transformers), or if your Redis cache goes offline, LLMOpt silently and seamlessly **falls back to its robust V1 heuristic rules**. This ensures that your application continues to route requests efficiently under all circumstances. | |
| --- | |
| ## Quick Start & Installation | |
| ### Requirements | |
| - Python 3.10+ | |
| - At least one API key (OpenAI, Anthropic, Google, Mistral, DeepSeek) OR a local Ollama instance. | |
| ### Installation | |
| ```bash | |
| # Clone the repository | |
| git clone https://github.com/Shrot101/llmopt.git | |
| cd llmopt | |
| # Install with all Machine Learning capabilities (Highly Recommended for V2) | |
| pip install -e ".[ml]" | |
| # Install Core only (uses V1 heuristic fallbacks) | |
| pip install -e . | |
| # Install with Local Model support | |
| pip install -e ".[ml,local]" | |
| ``` | |
| ### Configuration | |
| Copy the environment template and add your API keys. You only need to provide keys for the providers you intend to use. | |
| ```bash | |
| cp config/.env.example config/.env | |
| ``` | |
| ```env | |
| OPENAI_API_KEY=sk-... | |
| ANTHROPIC_API_KEY=sk-ant-... | |
| GEMINI_API_KEY=AIza... | |
| OLLAMA_API_BASE=http://localhost:11434 | |
| # Required for V2 Semantic Caching | |
| REDIS_URL=redis://localhost:6379/0 | |
| ``` | |
| --- | |
| ## Python SDK Usage | |
| ### Basic Generation | |
| ```python | |
| from llmopt import LLMOpt | |
| client = LLMOpt() | |
| # The framework handles analysis, optimization, and routing automatically | |
| result = client.generate( | |
| query="Explain the difference between TCP and UDP", | |
| budget_mode="balanced" # Options: "cheap" | "balanced" | "quality" | |
| ) | |
| print(result.response) | |
| print(f"Model used : {result.model_used}") | |
| print(f"Cost : ${result.estimated_cost:.6f}") | |
| print(f"Tokens saved: {result.tokens_saved}") | |
| ``` | |
| ### Advanced Constraints & Evaluation | |
| ```python | |
| result = client.generate( | |
| query="Design a highly available distributed rate limiter.", | |
| budget_mode="quality", | |
| # Hard cap — never spend more than this per request (USD) | |
| max_cost_per_request=0.01, | |
| # Provider filtering | |
| exclude_providers=["openai"], | |
| only_providers=["anthropic", "google"], | |
| # Opt-in to the LLM-as-a-judge feedback loop | |
| evaluate=True, | |
| # dry_run=True → runs full optimization pipeline but skips the actual API call | |
| dry_run=False, | |
| ) | |
| if result.evaluation: | |
| print(f"Quality Score: {result.evaluation.overall}/10") | |
| ``` | |
| --- | |
| ## REST API Integration | |
| LLMOpt includes a built-in FastAPI server for easy integration into non-Python architectures. | |
| ### Start the server | |
| ```bash | |
| python run.py --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### `POST /generate` | |
| **Request Body:** | |
| ```json | |
| { | |
| "query": "Explain quantum computing", | |
| "budget_mode": "balanced", | |
| "api_keys": { | |
| "openai": "sk-...", | |
| "anthropic": "sk-ant-...", | |
| "google": "AIza..." | |
| } | |
| } | |
| ``` | |
| > [!TIP] | |
| > **BYOK (Bring Your Own Key) Mode:** | |
| > This server is configured as a public utility. It provides the **Routing Intelligence** and **Shared Semantic Cache**, but you must provide your own provider API keys in the `api_keys` object. The server does not store your keys; they are used only for the duration of the request. | |
| ### Example cURL | |
| ```bash | |
| curl -X POST http://localhost:8000/generate \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "query": "Write a recursive Fibonacci function in Rust", | |
| "budget_mode": "balanced", | |
| "evaluate": true | |
| }' | |
| ``` | |
| **Response payload includes deep insights into the optimization process:** | |
| ```json | |
| { | |
| "response": "Here is the Rust implementation...", | |
| "model_used": "claude-3-5-haiku-20241022", | |
| "provider": "anthropic", | |
| "input_tokens": 105, | |
| "output_tokens": 342, | |
| "total_tokens": 447, | |
| "estimated_cost": 0.001452, | |
| "tokens_saved": 28, | |
| "compression_ratio": 0.21, | |
| "complexity_score": 0.62, | |
| "complexity_tier": "hard", | |
| "latency_ms": 1140, | |
| "evaluation": { | |
| "overall": 9.5, | |
| "accuracy": 10.0, | |
| "feedback": "The code is idiomatic and correctly implements recursion." | |
| } | |
| } | |
| ``` | |
| --- | |
| ## Supported Providers & Models | |
| The routing engine dynamically compares models across providers based on their unified capability scores and per-token pricing. Add or update models simply by modifying `data/model_registry.json`. | |
| | Model | Provider | Input $/1k | Output $/1k | Capability | Best For | | |
| |-------|----------|-----------|------------|------------|----------| | |
| | `gpt-4o` | OpenAI | $0.0025 | $0.010 | 0.930 | Complex reasoning | | |
| | `gpt-4o-mini` | OpenAI | $0.00015 | $0.0006 | 0.784 | Balanced tasks | | |
| | `claude-3-5-sonnet-20241022` | Anthropic | $0.003 | $0.015 | 0.934 | Coding, analysis | | |
| | `claude-3-5-haiku-20241022` | Anthropic | $0.0008 | $0.004 | 0.794 | Fast tasks | | |
| | `gemini-1.5-flash` | Google | $0.000075 | $0.0003 | 0.742 | Cheapest cloud | | |
| | `mistral-large-latest` | Mistral | $0.003 | $0.009 | 0.852 | EU + quality | | |
| | `deepseek-chat` | DeepSeek | $0.00014 | $0.00028 | 0.887 | Best value math/code | | |
| | `llama3.1:70b` | Ollama | FREE | FREE | 0.823 | Local high-quality | | |
| *(See the registry file for the complete list of supported models).* | |
| --- | |
| ## Explainability & Observability | |
| Unlike black-box routing systems, LLMOpt is completely transparent. You can ask the framework to explain exactly why it chose a specific model for a specific query without spending any money (`dry_run=True` or the `/explain` endpoint). | |
| ```python | |
| explanation = client.explain( | |
| query="What is the capital of France?", | |
| budget_mode="cheap" | |
| ) | |
| ``` | |
| **Explanation Output:** | |
| ```text | |
| ======================================================= | |
| LLMOpt Decision Explanation | |
| ======================================================= | |
| Query complexity : 0.050 (trivial) | |
| Primary domain : factual | |
| Selected model : gemini-1.5-flash (google) | |
| Fallback model : gpt-4o-mini | |
| Compression : yes | |
| System prompt : minimal | |
| Scoring rationale: | |
| • model=gemini-1.5-flash | |
| • capability=0.742 | |
| • cost_norm=0.0042 | |
| • J=-0.124 (α=0.6,β=0.3,γ=0.1) | |
| Cost saved : $0.009850 vs GPT-4o baseline | |
| ======================================================= | |
| ``` | |