Spaces:

IntelliDeep
/

NLProxy

Running

App Files Files Community

NLProxy / nlproxy /docs /llm.md

Luiserb

first commit

2129c29 17 days ago

preview code

Raw

History Blame Contribute Delete

4.19 kB

	# NLProxy LLM Client Module Reference

	This document describes `llm/client.py`.

	## Purpose

	The LLM client module provides a unified interface for multiple providers, implements request retry and concurrency control, and normalizes provider responses for the NLProxy pipeline.

	## Core Concepts

	### Provider Abstraction

	- `LLMProvider` enumerates supported providers:
	- `gemini`
	- `claude`
	- `openai`
	- `deepseek`
	- `qwen`
	- `kimi`
	- `openrouter`
	- `custom`

	### Request Model

	- `LLMRequest`
	- `prompt: str`
	- `provider: LLMProvider`
	- `model: str`
	- `max_tokens: int`
	- `temperature: float`
	- `top_p: float`
	- `top_k: int`
	- `stop_sequences: Optional[List[str]]`
	- `metadata: Optional[Dict[str, Any]]`

	### Response Model

	- `LLMResponse`
	- Contains provider text, token counts, latency, cost, and metadata.

	## Provider Infrastructure

	### `RetryConfig`

	Controls retry behavior:

	- `max_attempts`
	- `base_delay`
	- `max_delay`
	- `exponential_base`
	- `jitter`
	- `retryable_exceptions`

	### `TimeoutConfig`

	Defines connection and read timeouts for all HTTP-based providers.

	### `RateLimitConfig`

	Implements token-bucket rate limiting for request pacing.

	### `TokenBucket`

	Used internally by `BaseLLMClient` to prevent burst overload.

	### `CircuitBreaker`

	Used to mark unhealthy providers and fail fast when repeated errors occur.

	## Base Client

	### `BaseLLMClient`

	#### Responsibilities

	- Manages provider-specific HTTP connections.
	- Applies concurrency limits with `asyncio.Semaphore`.
	- Applies rate limiting and retry logic.
	- Tracks metrics such as request count, errors, and latency.
	- Exposes `generate()` and `generate_stream()`.

	#### HTTP Client Reuse

	- `_ensure_http_client(base_url, headers) -> httpx.AsyncClient`
	- Ensures a single `httpx.AsyncClient` instance is reused per provider instance.
	- Prevents socket exhaustion and reduces connection setup overhead.

	#### Async Safety

	- All operations are async and reentrant.
	- Concurrency is controlled per provider using semaphores.
	- `close()` shuts down the HTTP client cleanly.

	## Provider Implementations

	### `GeminiClient`

	- Uses `google-genai` async client.
	- Implements `_generate_internal()` and `_generate_stream_internal()`.
	- Uses Gemini-specific token counting where available.

	### `ClaudeClient`

	- Sends requests to Anthropic Claude.
	- Uses `/v1/messages` endpoint.
	- Implements stream parsing for `content_block_delta` chunks.

	### `OpenAIClient`

	- Connects to `https://api.openai.com/v1`.
	- Uses `/chat/completions` for standard and streaming responses.

	### `GenericAPIClient`

	- Supports OpenAI-compatible providers such as DeepSeek, Qwen, Kimi, OpenRouter, and custom targets.
	- Requires `base_url` for `CUSTOM` provider.

	## Factory and Orchestration

	### `LLMClientFactory`

	- `create(provider, model, api_key, base_url, retry_config, timeout_config, rate_limit_config, circuit_breaker, max_concurrent_requests)`
	- `get_or_create(provider, **kwargs)` returns singleton clients keyed by provider and model.
	- `close_all()` cleanly closes all managed provider clients.

	### `LLMOrchestrator`

	- Balances traffic across default and fallback providers.
	- Supports load balancing across provider list.
	- Implements provider fallback when the primary provider fails.
	- Exposes `generate()`, `generate_stream()`, `health_check_all()`, `get_metrics()`, and `close()`.

	## Utilities

	- `validate_prompt(prompt, max_length=100000)` sanitizes input text.
	- `estimate_cost(input_tokens, output_tokens, provider, model)` calculates estimated cost using pricing data.

	## Dependencies

	- `httpx`
	- `pydantic`
	- Optional: `google-genai`, `tiktoken`

	## Performance and Scalability

	- Request handling is async, suitable for FastAPI.
	- Concurrency limits and backoff reduce overload on provider APIs.
	- Shared HTTP clients reduce connection churn.
	- Token counting is O(L) where L = text length.

	## Edge Cases

	- Missing provider credentials raise explicit configuration errors.
	- Circuit breaker prevents repeated calls to failing providers.
	- `generate_stream()` preserves provider streaming semantics while protecting against retries during stream consumption.