Spaces:

elmerzole
/

llm-api-proxy

Paused

App Files Files Community

llm-api-proxy / DOCUMENTATION.md

Mirrowel

feat: Add build workflow, proxy application, and update documentation for executable usage

5bfdc95 8 months ago

preview code

raw

history blame

8.79 kB

	# Technical Documentation: `rotating-api-key-client`

	This document provides a detailed technical explanation of the `rotating-api-key-client` library, its components, and its internal workings. The library has evolved into a sophisticated, asynchronous client for managing LLM API keys with a strong focus on concurrency, resilience, and state management.

	## 1. `client.py` - The `RotatingClient`

	The `RotatingClient` is the central component, orchestrating API calls, key management, and error handling. It is designed as a long-lived, async-native object.

	### Core Responsibilities
	- Managing an `httpx.AsyncClient` for non-blocking HTTP requests.
	- Interfacing with the `UsageManager` to acquire and release API keys.
	- Handling provider-specific request modifications.
	- Executing API calls via `litellm` with a robust retry and rotation strategy.
	- Providing a safe wrapper for streaming responses.

	### Request Lifecycle (`acompletion`)

	When `acompletion` is called, it follows these steps:

	1. Provider and Key Validation: It extracts the provider from the `model` name and ensures keys are configured for it.

	2. Key Acquisition Loop: The client enters a loop to find a valid key and complete the request. It iterates through all keys for the provider until one succeeds or all have been tried.
	a. Acquire Best Key: It calls `self.usage_manager.acquire_key()`. This is a blocking call that waits until a suitable key is available, based on the manager's tiered locking strategy (see `UsageManager` section).
	b. Prepare Request: It prepares the `litellm` keyword arguments. This includes:
	- Request Sanitization: Calling `sanitize_request_payload()` to remove parameters that might be unsupported by the target model, preventing errors.
	- Provider-Specific Logic: Applying special handling for providers like Gemini (safety settings), Gemma (system prompts), and Chutes.ai (`api_base` and model name remapping).

	3. Retry Loop: Once a key is acquired, it enters an inner retry loop (`for attempt in range(self.max_retries)`):
	a. API Call: It calls `litellm.acompletion` with the acquired key.
	b. Success (Non-Streaming):
	- It calls `self.usage_manager.record_success()` to update usage stats and clear any cooldowns for the key-model pair.
	- It calls `self.usage_manager.release_key()` to release the lock on the key for this model.
	- It returns the response, and the process ends.
	c. Success (Streaming):
	- It returns a `_safe_streaming_wrapper` async generator. This wrapper is critical:
	- It yields SSE-formatted chunks to the consumer.
	- After the stream is fully consumed, its `finally` block ensures that `record_success()` and `release_key()` are called. This guarantees that the key lock is held for the entire duration of the stream and released correctly, even if the consumer abandons the stream.
	d. Failure: If an exception occurs:
	- The failure is logged in detail by `log_failure()`.
	- The exception is passed to `classify_error()` to get a structured `ClassifiedError` object.
	- Server Error: If the error type is `server_error`, it waits with exponential backoff and retries the request with the same key.
	- Rotation Error (Rate Limit, Auth, etc.): For any other error, it's considered a rotation trigger. `self.usage_manager.record_failure()` is called to apply an escalating cooldown, and `self.usage_manager.release_key()` releases the lock. The inner `attempt` loop is broken, and the outer `while` loop continues, acquiring a new key.

	## 2. `usage_manager.py` - Stateful Concurrency & Usage Management

	This class is the heart of the library's state management and concurrency control. It is a stateful, async-native service that ensures keys are used efficiently and safely across multiple concurrent requests.

	### Key Concepts

	- Asynchronous Design & Lazy Loading: The entire class is asynchronous, using `aiofiles` for non-blocking file I/O and a `_lazy_init` pattern. The usage data from the JSON file is loaded only when the first request is made.
	- Concurrency Primitives:
	- `filelock`: A file-level lock (`.json.lock`) prevents race conditions if multiple processes are running and sharing the same usage file.
	- `asyncio.Lock` & `asyncio.Condition`: Each key has its own `asyncio.Lock` and `asyncio.Condition` object. This enables the fine-grained, model-aware locking strategy.

	### Tiered Key Acquisition (`acquire_key`)

	This method implements the core logic for selecting a key. It is a "smart" blocking call.

	1. Filtering: It first filters out any keys that are on a global or model-specific cooldown.
	2. Tiering: It categorizes the remaining, valid keys into two tiers:
	- Tier 1 (Ideal): Keys that are completely free (not being used by any model).
	- Tier 2 (Acceptable): Keys that are currently in use, but for different models than the one being requested.
	3. Selection: It attempts to acquire a lock on a key, prioritizing Tier 1 over Tier 2. Within each tier, it prioritizes the least-used key.
	4. Waiting: If no keys in Tier 1 or Tier 2 can be locked, it means all eligible keys are currently handling requests for the same model. The method then `await`s on the `asyncio.Condition` of the best available key, waiting until it is notified that the key has been released.

	### Failure Handling & Cooldowns (`record_failure`)

	- Escalating Backoff: When a failure is recorded, it applies a cooldown that increases with the number of consecutive failures for a specific key-model pair (e.g., 10s, 30s, 60s, up to 2 hours).
	- Authentication Errors: These are treated more severely, applying an immediate 5-minute key-level lockout.
	- Key-Level Lockouts: If a single key accumulates 3 or more long-term (2-hour) cooldowns across different models, the manager assumes the key is compromised or disabled and applies a 5-minute global lockout on the key.

	### Data Structure

	The `key_usage.json` file has a more complex structure to store this detailed state:
	```json
	{
	"api_key_hash": {
	"daily": {
	"date": "YYYY-MM-DD",
	"models": {
	"gemini/gemini-1.5-pro": {
	"success_count": 10,
	"prompt_tokens": 5000,
	"completion_tokens": 10000,
	"approx_cost": 0.075
	}
	}
	},
	"global": { /* ... similar to daily, but accumulates over time ... */ },
	"model_cooldowns": {
	"gemini/gemini-1.5-flash": 1719987600.0
	},
	"failures": {
	"gemini/gemini-1.5-flash": {
	"consecutive_failures": 2
	}
	},
	"key_cooldown_until": null,
	"last_daily_reset": "YYYY-MM-DD"
	}
	}
	```

	## 3. `error_handler.py`

	This module provides a centralized function, `classify_error`, which is a significant improvement over the previous boolean checks.

	- It takes a raw exception from `litellm` and returns a `ClassifiedError` data object.
	- This object contains the `error_type` (e.g., `'rate_limit'`, `'authentication'`, `'server_error'`), the original exception, the status code, and any `retry_after` information extracted from the error message.
	- This structured classification allows the `RotatingClient` to make more intelligent decisions about whether to retry with the same key or rotate to a new one.

	## 4. `request_sanitizer.py` (New Module)

	- This module's purpose is to prevent `InvalidRequestError` exceptions from `litellm` that occur when a payload contains parameters not supported by the target model (e.g., sending a `thinking` parameter to a model that doesn't support it).
	- The `sanitize_request_payload` function is called just before `litellm.acompletion` to strip out any such unsupported parameters, making the system more robust.

	## 5. `providers/` - Provider Plugins

	The provider plugin system remains for fetching model lists. The interface now correctly specifies that the `get_models` method receives an `httpx.AsyncClient` instance, which it should use to make its API calls. This ensures all HTTP traffic goes through the client's managed session.

	## 6. `proxy_app/` - The Proxy Application

	The `proxy_app` directory contains the FastAPI application that serves the rotating client.

	### `main.py` - The FastAPI App

	This file contains the FastAPI application that exposes the `RotatingClient` through an OpenAI-compatible API.

	#### Command-Line Arguments

	- `--enable-request-logging`: This flag enables logging of all incoming requests and outgoing responses to the `logs/` directory. This is useful for debugging and monitoring the proxy's activity. By default, this is disabled.