Spaces:

IntelliDeep
/

NLProxy

Running

File size: 4,190 Bytes

2129c29

# NLProxy LLM Client Module Reference

This document describes `llm/client.py`.

## Purpose

The LLM client module provides a unified interface for multiple providers, implements request retry and concurrency control, and normalizes provider responses for the NLProxy pipeline.

## Core Concepts

### Provider Abstraction

- `LLMProvider` enumerates supported providers:
  - `gemini`
  - `claude`
  - `openai`
  - `deepseek`
  - `qwen`
  - `kimi`
  - `openrouter`
  - `custom`

### Request Model

- `LLMRequest`
  - `prompt: str`
  - `provider: LLMProvider`
  - `model: str`
  - `max_tokens: int`
  - `temperature: float`
  - `top_p: float`
  - `top_k: int`
  - `stop_sequences: Optional[List[str]]`
  - `metadata: Optional[Dict[str, Any]]`

### Response Model

- `LLMResponse`
  - Contains provider text, token counts, latency, cost, and metadata.

## Provider Infrastructure

### `RetryConfig`

Controls retry behavior:

- `max_attempts`
- `base_delay`
- `max_delay`
- `exponential_base`
- `jitter`
- `retryable_exceptions`

### `TimeoutConfig`

Defines connection and read timeouts for all HTTP-based providers.

### `RateLimitConfig`

Implements token-bucket rate limiting for request pacing.

### `TokenBucket`

Used internally by `BaseLLMClient` to prevent burst overload.

### `CircuitBreaker`

Used to mark unhealthy providers and fail fast when repeated errors occur.

## Base Client

### `BaseLLMClient`

#### Responsibilities

- Manages provider-specific HTTP connections.
- Applies concurrency limits with `asyncio.Semaphore`.
- Applies rate limiting and retry logic.
- Tracks metrics such as request count, errors, and latency.
- Exposes `generate()` and `generate_stream()`.

#### HTTP Client Reuse

- `_ensure_http_client(base_url, headers) -> httpx.AsyncClient`
- Ensures a single `httpx.AsyncClient` instance is reused per provider instance.
- Prevents socket exhaustion and reduces connection setup overhead.

#### Async Safety

- All operations are async and reentrant.
- Concurrency is controlled per provider using semaphores.
- `close()` shuts down the HTTP client cleanly.

## Provider Implementations

### `GeminiClient`

- Uses `google-genai` async client.
- Implements `_generate_internal()` and `_generate_stream_internal()`.
- Uses Gemini-specific token counting where available.

### `ClaudeClient`

- Sends requests to Anthropic Claude.
- Uses `/v1/messages` endpoint.
- Implements stream parsing for `content_block_delta` chunks.

### `OpenAIClient`

- Connects to `https://api.openai.com/v1`.
- Uses `/chat/completions` for standard and streaming responses.

### `GenericAPIClient`

- Supports OpenAI-compatible providers such as DeepSeek, Qwen, Kimi, OpenRouter, and custom targets.
- Requires `base_url` for `CUSTOM` provider.

## Factory and Orchestration

### `LLMClientFactory`

- `create(provider, model, api_key, base_url, retry_config, timeout_config, rate_limit_config, circuit_breaker, max_concurrent_requests)`
- `get_or_create(provider, **kwargs)` returns singleton clients keyed by provider and model.
- `close_all()` cleanly closes all managed provider clients.

### `LLMOrchestrator`

- Balances traffic across default and fallback providers.
- Supports load balancing across provider list.
- Implements provider fallback when the primary provider fails.
- Exposes `generate()`, `generate_stream()`, `health_check_all()`, `get_metrics()`, and `close()`.

## Utilities

- `validate_prompt(prompt, max_length=100000)` sanitizes input text.
- `estimate_cost(input_tokens, output_tokens, provider, model)` calculates estimated cost using pricing data.

## Dependencies

- `httpx`
- `pydantic`
- Optional: `google-genai`, `tiktoken`

## Performance and Scalability

- Request handling is async, suitable for FastAPI.
- Concurrency limits and backoff reduce overload on provider APIs.
- Shared HTTP clients reduce connection churn.
- Token counting is O(L) where L = text length.

## Edge Cases

- Missing provider credentials raise explicit configuration errors.
- Circuit breaker prevents repeated calls to failing providers.
- `generate_stream()` preserves provider streaming semantics while protecting against retries during stream consumption.