Spaces:

Shrot102
/

llmopt-server

Sleeping

App Files Files Community

llmopt-server / README.md

Shrot101

docs: update README with BYOK usage instructions

6adbd7a 15 days ago

preview code

raw

history blame contribute delete

11.4 kB

	---
	title: LLMOpt
	emoji: 🚀
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	pinned: false
	---

	# LLMOpt: The Adaptive Inference Optimization Framework (V2)

	> Intelligent Routing. Minimal Latency. Maximum ROI.

	In the era of sprawling Large Language Models (LLMs), routing every query to a flagship model like GPT-4o or Claude 3.5 Sonnet is financially unsustainable and computationally wasteful.

	LLMOpt is an enterprise-grade middleware layer that sits between your application and your LLM providers. By dynamically analyzing the semantic complexity of incoming queries, LLMOpt automatically selects the most cost-effective model capable of handling the request, compresses context windows to reduce token waste, and caches responses—all while giving you full observability into its decision-making process.

	```text
	Your App → llmopt.generate(query)
	→ [Semantic Cache → NLI Analyze → GBR Estimate → Bayesian Optimize → LLMLingua Compress → Route]
	→ LLM API → Response
	```

	## Table of Contents

	- [The V2 Architecture](#the-v2-architecture)
	- [Core ML Components](#core-ml-components)
	- [Graceful Degradation](#graceful-degradation)
	- [Quick Start & Installation](#quick-start--installation)
	- [Python SDK Usage](#python-sdk-usage)
	- [REST API Integration](#rest-api-integration)
	- [Supported Providers & Models](#supported-providers--models)
	- [Explainability & Observability](#explainability--observability)

	---

	## The V2 Architecture

	LLMOpt V2 has transitioned from a static, heuristic-based router to a fully Machine Learning-powered pipeline. The framework acts as an intelligent funnel, progressively optimizing the request before it ever reaches an LLM provider.

	```mermaid
	flowchart TD
	A[Incoming Query] --> B(Semantic Cache)
	B -->\|Cache Hit\| Z[Return Cached Response]
	B -->\|Cache Miss\| C(Query Analyzer)
	C --> D(Complexity Estimator)
	D --> E(Optimization Engine)
	E --> F(Prompt Optimizer)
	F --> G(Model Router)
	G --> H((LLM Provider))
	H --> I[LLM-as-a-Judge Evaluator]
	I -->\|Feedback Loop\| E
	I --> Z
	```

	### Pipeline Stages
	1. Semantic Cache: Checks Redis for highly similar past queries.
	2. Query Analyzer: Extracts structural features and semantic domains from the prompt.
	3. Complexity Estimator: Predicts the cognitive load required to answer the query (0.0 to 1.0).
	4. Optimization Engine: Minimizes a cost/quality objective function to pick the perfect model.
	5. Prompt Optimizer: Intelligently compresses the prompt to shed unnecessary tokens.
	6. Model Router: Dispatches the request via LiteLLM to OpenAI, Anthropic, Google, Ollama, etc.
	7. Evaluator (Optional): Scores the response quality and feeds it back to the optimization engine.

	---

	## Core ML Components

	The V2 release introduces state-of-the-art machine learning to every layer of the pipeline:

	### 1. Zero-Shot NLI Query Analyzer
	Instead of relying on brittle regex patterns to determine if a query is asking for "code" or "math," LLMOpt utilizes HuggingFace's `cross-encoder/nli-distilroberta-base`. This semantic reasoning engine accurately categorizes query intent on the fly without requiring labeled datasets.

	### 2. Sentence-Transformer Semantic Cache
	Before spending API credits, the framework embeds the incoming query using a lightweight, local `all-MiniLM-L6-v2` model and compares it against a Redis-backed vector store using cosine similarity. If an existing query matches with >95% similarity, the cached response is served at $0.00 cost and near-zero latency.

	### 3. Gradient Boosting Complexity Estimator
	To predict how "hard" a query is, LLMOpt leverages a `scikit-learn` Gradient Boosting Regressor (GBR) trained on hundreds of annotated examples. It accurately scales the required capability threshold, ensuring that "What is Python?" gets routed to a fast, cheap model, while "Implement a distributed Paxos consensus algorithm" gets routed to a flagship reasoning model.

	### 4. Bayesian Weight Optimization (Optuna)
	The Optimization Engine selects models by minimizing the objective function:
	`J(x) = α·Cost + β·Tokens - γ·Quality`
	Instead of hardcoding `α`, `β`, and `γ`, LLMOpt integrates Optuna. By processing real-world feedback from the LLM evaluator, Optuna uses Bayesian optimization to continuously learn and adjust these weights to mathematically guarantee the highest quality responses for the lowest possible price.

	### 5. LLMLingua Semantic Compression
	Large context windows are expensive. LLMOpt integrates Microsoft's `llmlingua-2` to perform semantic token pruning. It identifies and removes non-essential tokens (filler words, redundant context) from the prompt while preserving the core semantic meaning, reducing input costs by up to 40% before the LLM is even called.

	### 6. LLM-as-a-Judge Evaluation Loop
	When explicitly requested (`evaluate=True`), LLMOpt uses a highly efficient judge model (`gpt-4o-mini`) to score the returned response across Accuracy, Completeness, Clarity, and Conciseness. This score is automatically fed back into the Bayesian Optimizer to improve future routing decisions.

	---

	## Graceful Degradation

	Enterprise systems must be resilient. LLMOpt is designed to never crash if an ML dependency is missing or unavailable.

	If you choose not to install the heavy `[ml]` dependencies (like PyTorch or sentence-transformers), or if your Redis cache goes offline, LLMOpt silently and seamlessly falls back to its robust V1 heuristic rules. This ensures that your application continues to route requests efficiently under all circumstances.

	---

	## Quick Start & Installation

	### Requirements
	- Python 3.10+
	- At least one API key (OpenAI, Anthropic, Google, Mistral, DeepSeek) OR a local Ollama instance.

	### Installation

	```bash
	# Clone the repository
	git clone https://github.com/Shrot101/llmopt.git
	cd llmopt

	# Install with all Machine Learning capabilities (Highly Recommended for V2)
	pip install -e ".[ml]"

	# Install Core only (uses V1 heuristic fallbacks)
	pip install -e .

	# Install with Local Model support
	pip install -e ".[ml,local]"
	```

	### Configuration
	Copy the environment template and add your API keys. You only need to provide keys for the providers you intend to use.

	```bash
	cp config/.env.example config/.env
	```

	```env
	OPENAI_API_KEY=sk-...
	ANTHROPIC_API_KEY=sk-ant-...
	GEMINI_API_KEY=AIza...
	OLLAMA_API_BASE=http://localhost:11434

	# Required for V2 Semantic Caching
	REDIS_URL=redis://localhost:6379/0
	```

	---

	## Python SDK Usage

	### Basic Generation

	```python
	from llmopt import LLMOpt

	client = LLMOpt()

	# The framework handles analysis, optimization, and routing automatically
	result = client.generate(
	query="Explain the difference between TCP and UDP",
	budget_mode="balanced" # Options: "cheap" \| "balanced" \| "quality"
	)

	print(result.response)
	print(f"Model used : {result.model_used}")
	print(f"Cost : ${result.estimated_cost:.6f}")
	print(f"Tokens saved: {result.tokens_saved}")
	```

	### Advanced Constraints & Evaluation

	```python
	result = client.generate(
	query="Design a highly available distributed rate limiter.",
	budget_mode="quality",

	# Hard cap — never spend more than this per request (USD)
	max_cost_per_request=0.01,

	# Provider filtering
	exclude_providers=["openai"],
	only_providers=["anthropic", "google"],

	# Opt-in to the LLM-as-a-judge feedback loop
	evaluate=True,

	# dry_run=True → runs full optimization pipeline but skips the actual API call
	dry_run=False,
	)

	if result.evaluation:
	print(f"Quality Score: {result.evaluation.overall}/10")
	```

	---

	## REST API Integration

	LLMOpt includes a built-in FastAPI server for easy integration into non-Python architectures.

	### Start the server
	```bash
	python run.py --host 0.0.0.0 --port 8000
	```

	### `POST /generate`

	Request Body:
	```json
	{
	"query": "Explain quantum computing",
	"budget_mode": "balanced",
	"api_keys": {
	"openai": "sk-...",
	"anthropic": "sk-ant-...",
	"google": "AIza..."
	}
	}
	```

	> [!TIP]
	> BYOK (Bring Your Own Key) Mode:
	> This server is configured as a public utility. It provides the Routing Intelligence and Shared Semantic Cache, but you must provide your own provider API keys in the `api_keys` object. The server does not store your keys; they are used only for the duration of the request.

	### Example cURL
	```bash
	curl -X POST http://localhost:8000/generate \
	-H "Content-Type: application/json" \
	-d '{
	"query": "Write a recursive Fibonacci function in Rust",
	"budget_mode": "balanced",
	"evaluate": true
	}'
	```

	Response payload includes deep insights into the optimization process:
	```json
	{
	"response": "Here is the Rust implementation...",
	"model_used": "claude-3-5-haiku-20241022",
	"provider": "anthropic",
	"input_tokens": 105,
	"output_tokens": 342,
	"total_tokens": 447,
	"estimated_cost": 0.001452,
	"tokens_saved": 28,
	"compression_ratio": 0.21,
	"complexity_score": 0.62,
	"complexity_tier": "hard",
	"latency_ms": 1140,
	"evaluation": {
	"overall": 9.5,
	"accuracy": 10.0,
	"feedback": "The code is idiomatic and correctly implements recursion."
	}
	}
	```

	---

	## Supported Providers & Models

	The routing engine dynamically compares models across providers based on their unified capability scores and per-token pricing. Add or update models simply by modifying `data/model_registry.json`.

	\| Model \| Provider \| Input $/1k \| Output $/1k \| Capability \| Best For \|
	\|-------\|----------\|-----------\|------------\|------------\|----------\|
	\| `gpt-4o` \| OpenAI \| $0.0025 \| $0.010 \| 0.930 \| Complex reasoning \|
	\| `gpt-4o-mini` \| OpenAI \| $0.00015 \| $0.0006 \| 0.784 \| Balanced tasks \|
	\| `claude-3-5-sonnet-20241022` \| Anthropic \| $0.003 \| $0.015 \| 0.934 \| Coding, analysis \|
	\| `claude-3-5-haiku-20241022` \| Anthropic \| $0.0008 \| $0.004 \| 0.794 \| Fast tasks \|
	\| `gemini-1.5-flash` \| Google \| $0.000075 \| $0.0003 \| 0.742 \| Cheapest cloud \|
	\| `mistral-large-latest` \| Mistral \| $0.003 \| $0.009 \| 0.852 \| EU + quality \|
	\| `deepseek-chat` \| DeepSeek \| $0.00014 \| $0.00028 \| 0.887 \| Best value math/code \|
	\| `llama3.1:70b` \| Ollama \| FREE \| FREE \| 0.823 \| Local high-quality \|

	(See the registry file for the complete list of supported models).

	---

	## Explainability & Observability

	Unlike black-box routing systems, LLMOpt is completely transparent. You can ask the framework to explain exactly why it chose a specific model for a specific query without spending any money (`dry_run=True` or the `/explain` endpoint).

	```python
	explanation = client.explain(
	query="What is the capital of France?",
	budget_mode="cheap"
	)
	```

	Explanation Output:
	```text
	=======================================================
	LLMOpt Decision Explanation
	=======================================================
	Query complexity : 0.050 (trivial)
	Primary domain : factual

	Selected model : gemini-1.5-flash (google)
	Fallback model : gpt-4o-mini
	Compression : yes
	System prompt : minimal

	Scoring rationale:
	• model=gemini-1.5-flash
	• capability=0.742
	• cost_norm=0.0042
	• J=-0.124 (α=0.6,β=0.3,γ=0.1)

	Cost saved : $0.009850 vs GPT-4o baseline
	=======================================================
	```