Spaces:

elmerzole
/

llm-api-proxy

Paused

App Files Files Community

llm-api-proxy / DOCUMENTATION.md

Mirrowel

feat: Add build workflow, proxy application, and update documentation for executable usage

5bfdc95 8 months ago

preview code

raw

history blame

8.79 kB

Technical Documentation: `rotating-api-key-client`

This document provides a detailed technical explanation of the rotating-api-key-client library, its components, and its internal workings. The library has evolved into a sophisticated, asynchronous client for managing LLM API keys with a strong focus on concurrency, resilience, and state management.

1. `client.py` - The `RotatingClient`

The RotatingClient is the central component, orchestrating API calls, key management, and error handling. It is designed as a long-lived, async-native object.

Core Responsibilities

Managing an httpx.AsyncClient for non-blocking HTTP requests.
Interfacing with the UsageManager to acquire and release API keys.
Handling provider-specific request modifications.
Executing API calls via litellm with a robust retry and rotation strategy.
Providing a safe wrapper for streaming responses.

Request Lifecycle (`acompletion`)

When acompletion is called, it follows these steps:

Provider and Key Validation: It extracts the provider from the model name and ensures keys are configured for it.
Key Acquisition Loop: The client enters a loop to find a valid key and complete the request. It iterates through all keys for the provider until one succeeds or all have been tried. a. Acquire Best Key: It calls self.usage_manager.acquire_key(). This is a blocking call that waits until a suitable key is available, based on the manager's tiered locking strategy (see UsageManager section). b. Prepare Request: It prepares the litellm keyword arguments. This includes: - Request Sanitization: Calling sanitize_request_payload() to remove parameters that might be unsupported by the target model, preventing errors. - Provider-Specific Logic: Applying special handling for providers like Gemini (safety settings), Gemma (system prompts), and Chutes.ai (api_base and model name remapping).
Retry Loop: Once a key is acquired, it enters an inner retry loop (for attempt in range(self.max_retries)): a. API Call: It calls litellm.acompletion with the acquired key. b. Success (Non-Streaming): - It calls self.usage_manager.record_success() to update usage stats and clear any cooldowns for the key-model pair. - It calls self.usage_manager.release_key() to release the lock on the key for this model. - It returns the response, and the process ends. c. Success (Streaming): - It returns a _safe_streaming_wrapper async generator. This wrapper is critical: - It yields SSE-formatted chunks to the consumer. - After the stream is fully consumed, its finally block ensures that record_success() and release_key() are called. This guarantees that the key lock is held for the entire duration of the stream and released correctly, even if the consumer abandons the stream. d. Failure: If an exception occurs: - The failure is logged in detail by log_failure(). - The exception is passed to classify_error() to get a structured ClassifiedError object. - Server Error: If the error type is server_error, it waits with exponential backoff and retries the request with the same key. - Rotation Error (Rate Limit, Auth, etc.): For any other error, it's considered a rotation trigger. self.usage_manager.record_failure() is called to apply an escalating cooldown, and self.usage_manager.release_key() releases the lock. The inner attempt loop is broken, and the outer while loop continues, acquiring a new key.

2. `usage_manager.py` - Stateful Concurrency & Usage Management

This class is the heart of the library's state management and concurrency control. It is a stateful, async-native service that ensures keys are used efficiently and safely across multiple concurrent requests.

Key Concepts

Asynchronous Design & Lazy Loading: The entire class is asynchronous, using aiofiles for non-blocking file I/O and a _lazy_init pattern. The usage data from the JSON file is loaded only when the first request is made.
Concurrency Primitives:
- filelock: A file-level lock (.json.lock) prevents race conditions if multiple processes are running and sharing the same usage file.
- asyncio.Lock & asyncio.Condition: Each key has its own asyncio.Lock and asyncio.Condition object. This enables the fine-grained, model-aware locking strategy.

Tiered Key Acquisition (`acquire_key`)

This method implements the core logic for selecting a key. It is a "smart" blocking call.

Filtering: It first filters out any keys that are on a global or model-specific cooldown.
Tiering: It categorizes the remaining, valid keys into two tiers:
- Tier 1 (Ideal): Keys that are completely free (not being used by any model).
- Tier 2 (Acceptable): Keys that are currently in use, but for different models than the one being requested.
Selection: It attempts to acquire a lock on a key, prioritizing Tier 1 over Tier 2. Within each tier, it prioritizes the least-used key.
Waiting: If no keys in Tier 1 or Tier 2 can be locked, it means all eligible keys are currently handling requests for the same model. The method then awaits on the asyncio.Condition of the best available key, waiting until it is notified that the key has been released.

Failure Handling & Cooldowns (`record_failure`)

Escalating Backoff: When a failure is recorded, it applies a cooldown that increases with the number of consecutive failures for a specific key-model pair (e.g., 10s, 30s, 60s, up to 2 hours).
Authentication Errors: These are treated more severely, applying an immediate 5-minute key-level lockout.
Key-Level Lockouts: If a single key accumulates 3 or more long-term (2-hour) cooldowns across different models, the manager assumes the key is compromised or disabled and applies a 5-minute global lockout on the key.

Data Structure

The key_usage.json file has a more complex structure to store this detailed state:

{
  "api_key_hash": {
    "daily": {
      "date": "YYYY-MM-DD",
      "models": {
        "gemini/gemini-1.5-pro": {
          "success_count": 10,
          "prompt_tokens": 5000,
          "completion_tokens": 10000,
          "approx_cost": 0.075
        }
      }
    },
    "global": { /* ... similar to daily, but accumulates over time ... */ },
    "model_cooldowns": {
      "gemini/gemini-1.5-flash": 1719987600.0
    },
    "failures": {
      "gemini/gemini-1.5-flash": {
        "consecutive_failures": 2
      }
    },
    "key_cooldown_until": null,
    "last_daily_reset": "YYYY-MM-DD"
  }
}

3. `error_handler.py`

This module provides a centralized function, classify_error, which is a significant improvement over the previous boolean checks.

It takes a raw exception from litellm and returns a ClassifiedError data object.
This object contains the error_type (e.g., 'rate_limit', 'authentication', 'server_error'), the original exception, the status code, and any retry_after information extracted from the error message.
This structured classification allows the RotatingClient to make more intelligent decisions about whether to retry with the same key or rotate to a new one.

4. `request_sanitizer.py` (New Module)

This module's purpose is to prevent InvalidRequestError exceptions from litellm that occur when a payload contains parameters not supported by the target model (e.g., sending a thinking parameter to a model that doesn't support it).
The sanitize_request_payload function is called just before litellm.acompletion to strip out any such unsupported parameters, making the system more robust.

5. `providers/` - Provider Plugins

The provider plugin system remains for fetching model lists. The interface now correctly specifies that the get_models method receives an httpx.AsyncClient instance, which it should use to make its API calls. This ensures all HTTP traffic goes through the client's managed session.

6. `proxy_app/` - The Proxy Application

The proxy_app directory contains the FastAPI application that serves the rotating client.

`main.py` - The FastAPI App

This file contains the FastAPI application that exposes the RotatingClient through an OpenAI-compatible API.

Command-Line Arguments

--enable-request-logging: This flag enables logging of all incoming requests and outgoing responses to the logs/ directory. This is useful for debugging and monitoring the proxy's activity. By default, this is disabled.

Technical Documentation: rotating-api-key-client

1. client.py - The RotatingClient

Core Responsibilities

Request Lifecycle (acompletion)

2. usage_manager.py - Stateful Concurrency & Usage Management

Key Concepts

Tiered Key Acquisition (acquire_key)

Failure Handling & Cooldowns (record_failure)