Spaces:
Running
Running
File size: 4,190 Bytes
2129c29 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | # NLProxy LLM Client Module Reference
This document describes `llm/client.py`.
## Purpose
The LLM client module provides a unified interface for multiple providers, implements request retry and concurrency control, and normalizes provider responses for the NLProxy pipeline.
## Core Concepts
### Provider Abstraction
- `LLMProvider` enumerates supported providers:
- `gemini`
- `claude`
- `openai`
- `deepseek`
- `qwen`
- `kimi`
- `openrouter`
- `custom`
### Request Model
- `LLMRequest`
- `prompt: str`
- `provider: LLMProvider`
- `model: str`
- `max_tokens: int`
- `temperature: float`
- `top_p: float`
- `top_k: int`
- `stop_sequences: Optional[List[str]]`
- `metadata: Optional[Dict[str, Any]]`
### Response Model
- `LLMResponse`
- Contains provider text, token counts, latency, cost, and metadata.
## Provider Infrastructure
### `RetryConfig`
Controls retry behavior:
- `max_attempts`
- `base_delay`
- `max_delay`
- `exponential_base`
- `jitter`
- `retryable_exceptions`
### `TimeoutConfig`
Defines connection and read timeouts for all HTTP-based providers.
### `RateLimitConfig`
Implements token-bucket rate limiting for request pacing.
### `TokenBucket`
Used internally by `BaseLLMClient` to prevent burst overload.
### `CircuitBreaker`
Used to mark unhealthy providers and fail fast when repeated errors occur.
## Base Client
### `BaseLLMClient`
#### Responsibilities
- Manages provider-specific HTTP connections.
- Applies concurrency limits with `asyncio.Semaphore`.
- Applies rate limiting and retry logic.
- Tracks metrics such as request count, errors, and latency.
- Exposes `generate()` and `generate_stream()`.
#### HTTP Client Reuse
- `_ensure_http_client(base_url, headers) -> httpx.AsyncClient`
- Ensures a single `httpx.AsyncClient` instance is reused per provider instance.
- Prevents socket exhaustion and reduces connection setup overhead.
#### Async Safety
- All operations are async and reentrant.
- Concurrency is controlled per provider using semaphores.
- `close()` shuts down the HTTP client cleanly.
## Provider Implementations
### `GeminiClient`
- Uses `google-genai` async client.
- Implements `_generate_internal()` and `_generate_stream_internal()`.
- Uses Gemini-specific token counting where available.
### `ClaudeClient`
- Sends requests to Anthropic Claude.
- Uses `/v1/messages` endpoint.
- Implements stream parsing for `content_block_delta` chunks.
### `OpenAIClient`
- Connects to `https://api.openai.com/v1`.
- Uses `/chat/completions` for standard and streaming responses.
### `GenericAPIClient`
- Supports OpenAI-compatible providers such as DeepSeek, Qwen, Kimi, OpenRouter, and custom targets.
- Requires `base_url` for `CUSTOM` provider.
## Factory and Orchestration
### `LLMClientFactory`
- `create(provider, model, api_key, base_url, retry_config, timeout_config, rate_limit_config, circuit_breaker, max_concurrent_requests)`
- `get_or_create(provider, **kwargs)` returns singleton clients keyed by provider and model.
- `close_all()` cleanly closes all managed provider clients.
### `LLMOrchestrator`
- Balances traffic across default and fallback providers.
- Supports load balancing across provider list.
- Implements provider fallback when the primary provider fails.
- Exposes `generate()`, `generate_stream()`, `health_check_all()`, `get_metrics()`, and `close()`.
## Utilities
- `validate_prompt(prompt, max_length=100000)` sanitizes input text.
- `estimate_cost(input_tokens, output_tokens, provider, model)` calculates estimated cost using pricing data.
## Dependencies
- `httpx`
- `pydantic`
- Optional: `google-genai`, `tiktoken`
## Performance and Scalability
- Request handling is async, suitable for FastAPI.
- Concurrency limits and backoff reduce overload on provider APIs.
- Shared HTTP clients reduce connection churn.
- Token counting is O(L) where L = text length.
## Edge Cases
- Missing provider credentials raise explicit configuration errors.
- Circuit breaker prevents repeated calls to failing providers.
- `generate_stream()` preserves provider streaming semantics while protecting against retries during stream consumption.
|