Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- .python-version +1 -0
- README.md +5 -3
- litellm-config-auto.yaml +1 -1
- trim_messages_hook.py +31 -1
.python-version
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
3.9
|
README.md
CHANGED
|
@@ -47,15 +47,17 @@ If the env vars are set on the server and you send the master key in the Authori
|
|
| 47 |
|
| 48 |
**Hugging Face: "does not have sufficient permissions to call Inference Providers"** β Your `HF_TOKEN` must be allowed to use the Inference API (serverless). Create or edit the token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and enable the scope that allows Inference / Inference Providers. Without it, `my-hf-models` fallbacks will fail; the proxy will still use other model groups.
|
| 49 |
|
| 50 |
-
**503 "The model is overloaded"** β Returned by the provider (e.g. Gemini). The proxy retries and then follows fallbacks. Retrying the request later usually works.
|
| 51 |
|
| 52 |
-
**Context length exceeded (e.g. 139k > 128k)** β The config uses a large-context model group (`my-large-context`
|
| 53 |
|
| 54 |
**Hugging Face: "property 'prefix' is unsupported"** β Some clients (e.g. agent mode) send assistant messages with a `prefix` field. The proxy strips that field from every message in a pre-call hook before any provider sees it, so all providers (including HF on fallback) receive messages without prefix. If you still see this error, ensure the trim-messages hook is loaded in config (callbacks) and the proxy was restarted after config changes.
|
| 55 |
|
| 56 |
**Hugging Face: "Credit balance is depleted"** β Your HF Inference credits are used up. Add pre-paid credits or subscribe to PRO; until then, `my-hf-models` fallbacks will fail and the proxy will try other groups (e.g. `my-large-context`, `my-free-models`).
|
| 57 |
|
| 58 |
-
**Request fails after many fallbacks / "tokens lost"** β If the primary model and every fallback fail (context too long, HF credits depleted, Gemini 503, etc.), the request returns an error and no completion is streamed. To avoid wasting tokens: (1) keep conversations or tool payloads within context limits where possible, (2) ensure at least one fallback provider (OpenRouter, HF, or Gemini) has quota and supports your message format, (3) retry during peak overload (e.g. Gemini 503).
|
|
|
|
|
|
|
| 59 |
|
| 60 |
**Agent mode (tool calls)** β The proxy strips prefix from every message and trims input only when the request has no tool_calls or tool-role messages, so agent flows keep valid tool sequences. If agent mode still fails, check context length (set LITELLM_TRIM_MAX_INPUT_TOKENS or keep tool payloads smaller), cooldown (No deployments available), and that at least one fallback has quota.
|
| 61 |
|
|
|
|
| 47 |
|
| 48 |
**Hugging Face: "does not have sufficient permissions to call Inference Providers"** β Your `HF_TOKEN` must be allowed to use the Inference API (serverless). Create or edit the token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and enable the scope that allows Inference / Inference Providers. Without it, `my-hf-models` fallbacks will fail; the proxy will still use other model groups.
|
| 49 |
|
| 50 |
+
**503 "The model is overloaded"** β Returned by the provider (e.g. Gemini). The proxy retries and then follows fallbacks. Clients receive a short error message only (no traceback). Retrying the request later usually works.
|
| 51 |
|
| 52 |
+
**Context length exceeded (e.g. 139k > 128k)** β The config uses a large-context model group (`my-large-context`) as the first fallback when context is exceeded. Ensure `OPENROUTER_API_KEY` is set so that fallback can be used. If the request is very long (e.g. 140k+ tokens including tool input), consider trimming history or using a client that supports prompt compression so fallbacks have a chance to succeed.
|
| 53 |
|
| 54 |
**Hugging Face: "property 'prefix' is unsupported"** β Some clients (e.g. agent mode) send assistant messages with a `prefix` field. The proxy strips that field from every message in a pre-call hook before any provider sees it, so all providers (including HF on fallback) receive messages without prefix. If you still see this error, ensure the trim-messages hook is loaded in config (callbacks) and the proxy was restarted after config changes.
|
| 55 |
|
| 56 |
**Hugging Face: "Credit balance is depleted"** β Your HF Inference credits are used up. Add pre-paid credits or subscribe to PRO; until then, `my-hf-models` fallbacks will fail and the proxy will try other groups (e.g. `my-large-context`, `my-free-models`).
|
| 57 |
|
| 58 |
+
**Request fails after many fallbacks / "tokens lost"** β If the primary model and every fallback fail (context too long, HF credits depleted, Gemini 503, etc.), the request returns an error and no completion is streamed. Clients see a plain error message only. To avoid wasting tokens: (1) keep conversations or tool payloads within context limits where possible, (2) ensure at least one fallback provider (OpenRouter, HF, or Gemini) has quota and supports your message format, (3) retry during peak overload (e.g. Gemini 503).
|
| 59 |
+
|
| 60 |
+
**Client cancels or disconnects** β If the client closes the connection or cancels the request, the proxy may still continue calling the provider until the request finishes. That can use extra tokens until LiteLLM supports cancelling upstream on disconnect.
|
| 61 |
|
| 62 |
**Agent mode (tool calls)** β The proxy strips prefix from every message and trims input only when the request has no tool_calls or tool-role messages, so agent flows keep valid tool sequences. If agent mode still fails, check context length (set LITELLM_TRIM_MAX_INPUT_TOKENS or keep tool payloads smaller), cooldown (No deployments available), and that at least one fallback has quota.
|
| 63 |
|
litellm-config-auto.yaml
CHANGED
|
@@ -5,7 +5,7 @@ litellm_settings:
|
|
| 5 |
drop_params: True
|
| 6 |
modify_params: True
|
| 7 |
additional_drop_params: ["messages[*].prefix"]
|
| 8 |
-
callbacks: ["trim_messages_hook.proxy_handler_instance"]
|
| 9 |
set_verbose: False
|
| 10 |
request_timeout: 300
|
| 11 |
|
|
|
|
| 5 |
drop_params: True
|
| 6 |
modify_params: True
|
| 7 |
additional_drop_params: ["messages[*].prefix"]
|
| 8 |
+
callbacks: ["trim_messages_hook.proxy_handler_instance", "trim_messages_hook.error_handler_instance"]
|
| 9 |
set_verbose: False
|
| 10 |
request_timeout: 300
|
| 11 |
|
trim_messages_hook.py
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
"""
|
| 2 |
Pre-call hook that trims messages to stay under a token limit.
|
| 3 |
-
|
| 4 |
"""
|
| 5 |
from __future__ import annotations
|
| 6 |
|
|
@@ -8,6 +8,7 @@ import logging
|
|
| 8 |
import os
|
| 9 |
from typing import TYPE_CHECKING, Any, Literal, Optional
|
| 10 |
|
|
|
|
| 11 |
from litellm.integrations.custom_logger import CustomLogger
|
| 12 |
|
| 13 |
if TYPE_CHECKING:
|
|
@@ -82,4 +83,33 @@ class TrimMessagesHandler(CustomLogger):
|
|
| 82 |
return data
|
| 83 |
|
| 84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
proxy_handler_instance = TrimMessagesHandler()
|
|
|
|
|
|
| 1 |
"""
|
| 2 |
Pre-call hook that trims messages to stay under a token limit.
|
| 3 |
+
Post-call failure hook that returns a clean error message to clients.
|
| 4 |
"""
|
| 5 |
from __future__ import annotations
|
| 6 |
|
|
|
|
| 8 |
import os
|
| 9 |
from typing import TYPE_CHECKING, Any, Literal, Optional
|
| 10 |
|
| 11 |
+
from fastapi import HTTPException
|
| 12 |
from litellm.integrations.custom_logger import CustomLogger
|
| 13 |
|
| 14 |
if TYPE_CHECKING:
|
|
|
|
| 83 |
return data
|
| 84 |
|
| 85 |
|
| 86 |
+
class ErrorHandler(CustomLogger):
|
| 87 |
+
"""Returns a clean error message to clients; logs full exception server-side only."""
|
| 88 |
+
|
| 89 |
+
async def async_post_call_failure_hook(
|
| 90 |
+
self,
|
| 91 |
+
request_data: dict[str, Any],
|
| 92 |
+
original_exception: Exception,
|
| 93 |
+
user_api_key_dict: Any,
|
| 94 |
+
traceback_str: Optional[str] = None,
|
| 95 |
+
) -> Optional[HTTPException]:
|
| 96 |
+
_log = logging.getLogger(__name__)
|
| 97 |
+
if traceback_str:
|
| 98 |
+
_log.error("LLM call failed: %s\n%s", original_exception, traceback_str)
|
| 99 |
+
else:
|
| 100 |
+
_log.error("LLM call failed: %s", original_exception, exc_info=True)
|
| 101 |
+
status_code = 500
|
| 102 |
+
message = "An error occurred while processing your request. Please retry."
|
| 103 |
+
if getattr(original_exception, "response", None) is not None:
|
| 104 |
+
resp = getattr(original_exception, "response")
|
| 105 |
+
if getattr(resp, "status_code", None) == 503:
|
| 106 |
+
status_code = 503
|
| 107 |
+
message = "Service temporarily unavailable. Please retry."
|
| 108 |
+
elif "503" in str(original_exception):
|
| 109 |
+
status_code = 503
|
| 110 |
+
message = "Service temporarily unavailable. Please retry."
|
| 111 |
+
return HTTPException(status_code=status_code, detail=message)
|
| 112 |
+
|
| 113 |
+
|
| 114 |
proxy_handler_instance = TrimMessagesHandler()
|
| 115 |
+
error_handler_instance = ErrorHandler()
|