NLProxy / nlproxy /docs /llm.md
Luiserb's picture
first commit
2129c29
|
Raw
History Blame Contribute Delete
4.19 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

NLProxy LLM Client Module Reference

This document describes llm/client.py.

Purpose

The LLM client module provides a unified interface for multiple providers, implements request retry and concurrency control, and normalizes provider responses for the NLProxy pipeline.

Core Concepts

Provider Abstraction

  • LLMProvider enumerates supported providers:
    • gemini
    • claude
    • openai
    • deepseek
    • qwen
    • kimi
    • openrouter
    • custom

Request Model

  • LLMRequest
    • prompt: str
    • provider: LLMProvider
    • model: str
    • max_tokens: int
    • temperature: float
    • top_p: float
    • top_k: int
    • stop_sequences: Optional[List[str]]
    • metadata: Optional[Dict[str, Any]]

Response Model

  • LLMResponse
    • Contains provider text, token counts, latency, cost, and metadata.

Provider Infrastructure

RetryConfig

Controls retry behavior:

  • max_attempts
  • base_delay
  • max_delay
  • exponential_base
  • jitter
  • retryable_exceptions

TimeoutConfig

Defines connection and read timeouts for all HTTP-based providers.

RateLimitConfig

Implements token-bucket rate limiting for request pacing.

TokenBucket

Used internally by BaseLLMClient to prevent burst overload.

CircuitBreaker

Used to mark unhealthy providers and fail fast when repeated errors occur.

Base Client

BaseLLMClient

Responsibilities

  • Manages provider-specific HTTP connections.
  • Applies concurrency limits with asyncio.Semaphore.
  • Applies rate limiting and retry logic.
  • Tracks metrics such as request count, errors, and latency.
  • Exposes generate() and generate_stream().

HTTP Client Reuse

  • _ensure_http_client(base_url, headers) -> httpx.AsyncClient
  • Ensures a single httpx.AsyncClient instance is reused per provider instance.
  • Prevents socket exhaustion and reduces connection setup overhead.

Async Safety

  • All operations are async and reentrant.
  • Concurrency is controlled per provider using semaphores.
  • close() shuts down the HTTP client cleanly.

Provider Implementations

GeminiClient

  • Uses google-genai async client.
  • Implements _generate_internal() and _generate_stream_internal().
  • Uses Gemini-specific token counting where available.

ClaudeClient

  • Sends requests to Anthropic Claude.
  • Uses /v1/messages endpoint.
  • Implements stream parsing for content_block_delta chunks.

OpenAIClient

  • Connects to https://api.openai.com/v1.
  • Uses /chat/completions for standard and streaming responses.

GenericAPIClient

  • Supports OpenAI-compatible providers such as DeepSeek, Qwen, Kimi, OpenRouter, and custom targets.
  • Requires base_url for CUSTOM provider.

Factory and Orchestration

LLMClientFactory

  • create(provider, model, api_key, base_url, retry_config, timeout_config, rate_limit_config, circuit_breaker, max_concurrent_requests)
  • get_or_create(provider, **kwargs) returns singleton clients keyed by provider and model.
  • close_all() cleanly closes all managed provider clients.

LLMOrchestrator

  • Balances traffic across default and fallback providers.
  • Supports load balancing across provider list.
  • Implements provider fallback when the primary provider fails.
  • Exposes generate(), generate_stream(), health_check_all(), get_metrics(), and close().

Utilities

  • validate_prompt(prompt, max_length=100000) sanitizes input text.
  • estimate_cost(input_tokens, output_tokens, provider, model) calculates estimated cost using pricing data.

Dependencies

  • httpx
  • pydantic
  • Optional: google-genai, tiktoken

Performance and Scalability

  • Request handling is async, suitable for FastAPI.
  • Concurrency limits and backoff reduce overload on provider APIs.
  • Shared HTTP clients reduce connection churn.
  • Token counting is O(L) where L = text length.

Edge Cases

  • Missing provider credentials raise explicit configuration errors.
  • Circuit breaker prevents repeated calls to failing providers.
  • generate_stream() preserves provider streaming semantics while protecting against retries during stream consumption.