Spaces:

IntelliDeep
/

NLProxy

Running

App Files Files Community

NLProxy / nlproxy /docs /llm.md

Luiserb

first commit

2129c29 16 days ago

preview code

Raw

History Blame Contribute Delete

4.19 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

NLProxy LLM Client Module Reference

This document describes llm/client.py.

Purpose

The LLM client module provides a unified interface for multiple providers, implements request retry and concurrency control, and normalizes provider responses for the NLProxy pipeline.

Core Concepts

Provider Abstraction

LLMProvider enumerates supported providers:
- gemini
- claude
- openai
- deepseek
- qwen
- kimi
- openrouter
- custom

Request Model

LLMRequest
- prompt: str
- provider: LLMProvider
- model: str
- max_tokens: int
- temperature: float
- top_p: float
- top_k: int
- stop_sequences: Optional[List[str]]
- metadata: Optional[Dict[str, Any]]

Response Model

LLMResponse
- Contains provider text, token counts, latency, cost, and metadata.

Provider Infrastructure

`RetryConfig`

Controls retry behavior:

max_attempts
base_delay
max_delay
exponential_base
jitter
retryable_exceptions

`TimeoutConfig`

Defines connection and read timeouts for all HTTP-based providers.

`RateLimitConfig`

Implements token-bucket rate limiting for request pacing.

`TokenBucket`

Used internally by BaseLLMClient to prevent burst overload.

`CircuitBreaker`

Used to mark unhealthy providers and fail fast when repeated errors occur.

Base Client

`BaseLLMClient`

Responsibilities

Manages provider-specific HTTP connections.
Applies concurrency limits with asyncio.Semaphore.
Applies rate limiting and retry logic.
Tracks metrics such as request count, errors, and latency.
Exposes generate() and generate_stream().

HTTP Client Reuse

_ensure_http_client(base_url, headers) -> httpx.AsyncClient
Ensures a single httpx.AsyncClient instance is reused per provider instance.
Prevents socket exhaustion and reduces connection setup overhead.

Async Safety

All operations are async and reentrant.
Concurrency is controlled per provider using semaphores.
close() shuts down the HTTP client cleanly.

Provider Implementations

`GeminiClient`

Uses google-genai async client.
Implements _generate_internal() and _generate_stream_internal().
Uses Gemini-specific token counting where available.

`ClaudeClient`

Sends requests to Anthropic Claude.
Uses /v1/messages endpoint.
Implements stream parsing for content_block_delta chunks.

`OpenAIClient`

Connects to https://api.openai.com/v1.
Uses /chat/completions for standard and streaming responses.

`GenericAPIClient`

Supports OpenAI-compatible providers such as DeepSeek, Qwen, Kimi, OpenRouter, and custom targets.
Requires base_url for CUSTOM provider.

Factory and Orchestration

`LLMClientFactory`

create(provider, model, api_key, base_url, retry_config, timeout_config, rate_limit_config, circuit_breaker, max_concurrent_requests)
get_or_create(provider, **kwargs) returns singleton clients keyed by provider and model.
close_all() cleanly closes all managed provider clients.

`LLMOrchestrator`

Balances traffic across default and fallback providers.
Supports load balancing across provider list.
Implements provider fallback when the primary provider fails.
Exposes generate(), generate_stream(), health_check_all(), get_metrics(), and close().

Utilities

validate_prompt(prompt, max_length=100000) sanitizes input text.
estimate_cost(input_tokens, output_tokens, provider, model) calculates estimated cost using pricing data.

Dependencies

httpx
pydantic
Optional: google-genai, tiktoken

Performance and Scalability

Request handling is async, suitable for FastAPI.
Concurrency limits and backoff reduce overload on provider APIs.
Shared HTTP clients reduce connection churn.
Token counting is O(L) where L = text length.

Edge Cases

Missing provider credentials raise explicit configuration errors.
Circuit breaker prevents repeated calls to failing providers.
generate_stream() preserves provider streaming semantics while protecting against retries during stream consumption.