File size: 4,190 Bytes
2129c29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
# NLProxy LLM Client Module Reference

This document describes `llm/client.py`.

## Purpose

The LLM client module provides a unified interface for multiple providers, implements request retry and concurrency control, and normalizes provider responses for the NLProxy pipeline.

## Core Concepts

### Provider Abstraction

- `LLMProvider` enumerates supported providers:
  - `gemini`
  - `claude`
  - `openai`
  - `deepseek`
  - `qwen`
  - `kimi`
  - `openrouter`
  - `custom`

### Request Model

- `LLMRequest`
  - `prompt: str`
  - `provider: LLMProvider`
  - `model: str`
  - `max_tokens: int`
  - `temperature: float`
  - `top_p: float`
  - `top_k: int`
  - `stop_sequences: Optional[List[str]]`
  - `metadata: Optional[Dict[str, Any]]`

### Response Model

- `LLMResponse`
  - Contains provider text, token counts, latency, cost, and metadata.

## Provider Infrastructure

### `RetryConfig`

Controls retry behavior:

- `max_attempts`
- `base_delay`
- `max_delay`
- `exponential_base`
- `jitter`
- `retryable_exceptions`

### `TimeoutConfig`

Defines connection and read timeouts for all HTTP-based providers.

### `RateLimitConfig`

Implements token-bucket rate limiting for request pacing.

### `TokenBucket`

Used internally by `BaseLLMClient` to prevent burst overload.

### `CircuitBreaker`

Used to mark unhealthy providers and fail fast when repeated errors occur.

## Base Client

### `BaseLLMClient`

#### Responsibilities

- Manages provider-specific HTTP connections.
- Applies concurrency limits with `asyncio.Semaphore`.
- Applies rate limiting and retry logic.
- Tracks metrics such as request count, errors, and latency.
- Exposes `generate()` and `generate_stream()`.

#### HTTP Client Reuse

- `_ensure_http_client(base_url, headers) -> httpx.AsyncClient`
- Ensures a single `httpx.AsyncClient` instance is reused per provider instance.
- Prevents socket exhaustion and reduces connection setup overhead.

#### Async Safety

- All operations are async and reentrant.
- Concurrency is controlled per provider using semaphores.
- `close()` shuts down the HTTP client cleanly.

## Provider Implementations

### `GeminiClient`

- Uses `google-genai` async client.
- Implements `_generate_internal()` and `_generate_stream_internal()`.
- Uses Gemini-specific token counting where available.

### `ClaudeClient`

- Sends requests to Anthropic Claude.
- Uses `/v1/messages` endpoint.
- Implements stream parsing for `content_block_delta` chunks.

### `OpenAIClient`

- Connects to `https://api.openai.com/v1`.
- Uses `/chat/completions` for standard and streaming responses.

### `GenericAPIClient`

- Supports OpenAI-compatible providers such as DeepSeek, Qwen, Kimi, OpenRouter, and custom targets.
- Requires `base_url` for `CUSTOM` provider.

## Factory and Orchestration

### `LLMClientFactory`

- `create(provider, model, api_key, base_url, retry_config, timeout_config, rate_limit_config, circuit_breaker, max_concurrent_requests)`
- `get_or_create(provider, **kwargs)` returns singleton clients keyed by provider and model.
- `close_all()` cleanly closes all managed provider clients.

### `LLMOrchestrator`

- Balances traffic across default and fallback providers.
- Supports load balancing across provider list.
- Implements provider fallback when the primary provider fails.
- Exposes `generate()`, `generate_stream()`, `health_check_all()`, `get_metrics()`, and `close()`.

## Utilities

- `validate_prompt(prompt, max_length=100000)` sanitizes input text.
- `estimate_cost(input_tokens, output_tokens, provider, model)` calculates estimated cost using pricing data.

## Dependencies

- `httpx`
- `pydantic`
- Optional: `google-genai`, `tiktoken`

## Performance and Scalability

- Request handling is async, suitable for FastAPI.
- Concurrency limits and backoff reduce overload on provider APIs.
- Shared HTTP clients reduce connection churn.
- Token counting is O(L) where L = text length.

## Edge Cases

- Missing provider credentials raise explicit configuration errors.
- Circuit breaker prevents repeated calls to failing providers.
- `generate_stream()` preserves provider streaming semantics while protecting against retries during stream consumption.