Spaces:
Running
Running
A newer version of the Gradio SDK is available: 6.19.0
NLProxy LLM Client Module Reference
This document describes llm/client.py.
Purpose
The LLM client module provides a unified interface for multiple providers, implements request retry and concurrency control, and normalizes provider responses for the NLProxy pipeline.
Core Concepts
Provider Abstraction
LLMProviderenumerates supported providers:geminiclaudeopenaideepseekqwenkimiopenroutercustom
Request Model
LLMRequestprompt: strprovider: LLMProvidermodel: strmax_tokens: inttemperature: floattop_p: floattop_k: intstop_sequences: Optional[List[str]]metadata: Optional[Dict[str, Any]]
Response Model
LLMResponse- Contains provider text, token counts, latency, cost, and metadata.
Provider Infrastructure
RetryConfig
Controls retry behavior:
max_attemptsbase_delaymax_delayexponential_basejitterretryable_exceptions
TimeoutConfig
Defines connection and read timeouts for all HTTP-based providers.
RateLimitConfig
Implements token-bucket rate limiting for request pacing.
TokenBucket
Used internally by BaseLLMClient to prevent burst overload.
CircuitBreaker
Used to mark unhealthy providers and fail fast when repeated errors occur.
Base Client
BaseLLMClient
Responsibilities
- Manages provider-specific HTTP connections.
- Applies concurrency limits with
asyncio.Semaphore. - Applies rate limiting and retry logic.
- Tracks metrics such as request count, errors, and latency.
- Exposes
generate()andgenerate_stream().
HTTP Client Reuse
_ensure_http_client(base_url, headers) -> httpx.AsyncClient- Ensures a single
httpx.AsyncClientinstance is reused per provider instance. - Prevents socket exhaustion and reduces connection setup overhead.
Async Safety
- All operations are async and reentrant.
- Concurrency is controlled per provider using semaphores.
close()shuts down the HTTP client cleanly.
Provider Implementations
GeminiClient
- Uses
google-genaiasync client. - Implements
_generate_internal()and_generate_stream_internal(). - Uses Gemini-specific token counting where available.
ClaudeClient
- Sends requests to Anthropic Claude.
- Uses
/v1/messagesendpoint. - Implements stream parsing for
content_block_deltachunks.
OpenAIClient
- Connects to
https://api.openai.com/v1. - Uses
/chat/completionsfor standard and streaming responses.
GenericAPIClient
- Supports OpenAI-compatible providers such as DeepSeek, Qwen, Kimi, OpenRouter, and custom targets.
- Requires
base_urlforCUSTOMprovider.
Factory and Orchestration
LLMClientFactory
create(provider, model, api_key, base_url, retry_config, timeout_config, rate_limit_config, circuit_breaker, max_concurrent_requests)get_or_create(provider, **kwargs)returns singleton clients keyed by provider and model.close_all()cleanly closes all managed provider clients.
LLMOrchestrator
- Balances traffic across default and fallback providers.
- Supports load balancing across provider list.
- Implements provider fallback when the primary provider fails.
- Exposes
generate(),generate_stream(),health_check_all(),get_metrics(), andclose().
Utilities
validate_prompt(prompt, max_length=100000)sanitizes input text.estimate_cost(input_tokens, output_tokens, provider, model)calculates estimated cost using pricing data.
Dependencies
httpxpydantic- Optional:
google-genai,tiktoken
Performance and Scalability
- Request handling is async, suitable for FastAPI.
- Concurrency limits and backoff reduce overload on provider APIs.
- Shared HTTP clients reduce connection churn.
- Token counting is O(L) where L = text length.
Edge Cases
- Missing provider credentials raise explicit configuration errors.
- Circuit breaker prevents repeated calls to failing providers.
generate_stream()preserves provider streaming semantics while protecting against retries during stream consumption.