Spaces:
Running
Running
| # NLProxy LLM Client Module Reference | |
| This document describes `llm/client.py`. | |
| ## Purpose | |
| The LLM client module provides a unified interface for multiple providers, implements request retry and concurrency control, and normalizes provider responses for the NLProxy pipeline. | |
| ## Core Concepts | |
| ### Provider Abstraction | |
| - `LLMProvider` enumerates supported providers: | |
| - `gemini` | |
| - `claude` | |
| - `openai` | |
| - `deepseek` | |
| - `qwen` | |
| - `kimi` | |
| - `openrouter` | |
| - `custom` | |
| ### Request Model | |
| - `LLMRequest` | |
| - `prompt: str` | |
| - `provider: LLMProvider` | |
| - `model: str` | |
| - `max_tokens: int` | |
| - `temperature: float` | |
| - `top_p: float` | |
| - `top_k: int` | |
| - `stop_sequences: Optional[List[str]]` | |
| - `metadata: Optional[Dict[str, Any]]` | |
| ### Response Model | |
| - `LLMResponse` | |
| - Contains provider text, token counts, latency, cost, and metadata. | |
| ## Provider Infrastructure | |
| ### `RetryConfig` | |
| Controls retry behavior: | |
| - `max_attempts` | |
| - `base_delay` | |
| - `max_delay` | |
| - `exponential_base` | |
| - `jitter` | |
| - `retryable_exceptions` | |
| ### `TimeoutConfig` | |
| Defines connection and read timeouts for all HTTP-based providers. | |
| ### `RateLimitConfig` | |
| Implements token-bucket rate limiting for request pacing. | |
| ### `TokenBucket` | |
| Used internally by `BaseLLMClient` to prevent burst overload. | |
| ### `CircuitBreaker` | |
| Used to mark unhealthy providers and fail fast when repeated errors occur. | |
| ## Base Client | |
| ### `BaseLLMClient` | |
| #### Responsibilities | |
| - Manages provider-specific HTTP connections. | |
| - Applies concurrency limits with `asyncio.Semaphore`. | |
| - Applies rate limiting and retry logic. | |
| - Tracks metrics such as request count, errors, and latency. | |
| - Exposes `generate()` and `generate_stream()`. | |
| #### HTTP Client Reuse | |
| - `_ensure_http_client(base_url, headers) -> httpx.AsyncClient` | |
| - Ensures a single `httpx.AsyncClient` instance is reused per provider instance. | |
| - Prevents socket exhaustion and reduces connection setup overhead. | |
| #### Async Safety | |
| - All operations are async and reentrant. | |
| - Concurrency is controlled per provider using semaphores. | |
| - `close()` shuts down the HTTP client cleanly. | |
| ## Provider Implementations | |
| ### `GeminiClient` | |
| - Uses `google-genai` async client. | |
| - Implements `_generate_internal()` and `_generate_stream_internal()`. | |
| - Uses Gemini-specific token counting where available. | |
| ### `ClaudeClient` | |
| - Sends requests to Anthropic Claude. | |
| - Uses `/v1/messages` endpoint. | |
| - Implements stream parsing for `content_block_delta` chunks. | |
| ### `OpenAIClient` | |
| - Connects to `https://api.openai.com/v1`. | |
| - Uses `/chat/completions` for standard and streaming responses. | |
| ### `GenericAPIClient` | |
| - Supports OpenAI-compatible providers such as DeepSeek, Qwen, Kimi, OpenRouter, and custom targets. | |
| - Requires `base_url` for `CUSTOM` provider. | |
| ## Factory and Orchestration | |
| ### `LLMClientFactory` | |
| - `create(provider, model, api_key, base_url, retry_config, timeout_config, rate_limit_config, circuit_breaker, max_concurrent_requests)` | |
| - `get_or_create(provider, **kwargs)` returns singleton clients keyed by provider and model. | |
| - `close_all()` cleanly closes all managed provider clients. | |
| ### `LLMOrchestrator` | |
| - Balances traffic across default and fallback providers. | |
| - Supports load balancing across provider list. | |
| - Implements provider fallback when the primary provider fails. | |
| - Exposes `generate()`, `generate_stream()`, `health_check_all()`, `get_metrics()`, and `close()`. | |
| ## Utilities | |
| - `validate_prompt(prompt, max_length=100000)` sanitizes input text. | |
| - `estimate_cost(input_tokens, output_tokens, provider, model)` calculates estimated cost using pricing data. | |
| ## Dependencies | |
| - `httpx` | |
| - `pydantic` | |
| - Optional: `google-genai`, `tiktoken` | |
| ## Performance and Scalability | |
| - Request handling is async, suitable for FastAPI. | |
| - Concurrency limits and backoff reduce overload on provider APIs. | |
| - Shared HTTP clients reduce connection churn. | |
| - Token counting is O(L) where L = text length. | |
| ## Edge Cases | |
| - Missing provider credentials raise explicit configuration errors. | |
| - Circuit breaker prevents repeated calls to failing providers. | |
| - `generate_stream()` preserves provider streaming semantics while protecting against retries during stream consumption. | |