Spaces:

raheem786
/

litellm-proxy

Sleeping

App Files Files Community

raheem786 commited on Feb 5

Commit

c7d2b40

verified ·

1 Parent(s): 5e8a4fd

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

.python-version +1 -0
README.md +5 -3
litellm-config-auto.yaml +1 -1
trim_messages_hook.py +31 -1

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.9

README.md CHANGED Viewed

@@ -47,15 +47,17 @@ If the env vars are set on the server and you send the master key in the Authori
 **Hugging Face: "does not have sufficient permissions to call Inference Providers"** – Your `HF_TOKEN` must be allowed to use the Inference API (serverless). Create or edit the token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and enable the scope that allows Inference / Inference Providers. Without it, `my-hf-models` fallbacks will fail; the proxy will still use other model groups.
-**503 "The model is overloaded"** – Returned by the provider (e.g. Gemini). The proxy retries and then follows fallbacks. Retrying the request later usually works.
-**Context length exceeded (e.g. 139k &gt; 128k)** – The config uses a large-context model group (`my-large-context`, 1.8M tokens) as the first fallback when context is exceeded. Ensure `OPENROUTER_API_KEY` is set so that fallback can be used. If the request is very long (e.g. 140k+ tokens including tool input), consider trimming history or using a client that supports prompt compression so fallbacks have a chance to succeed.
 **Hugging Face: "property 'prefix' is unsupported"** – Some clients (e.g. agent mode) send assistant messages with a `prefix` field. The proxy strips that field from every message in a pre-call hook before any provider sees it, so all providers (including HF on fallback) receive messages without prefix. If you still see this error, ensure the trim-messages hook is loaded in config (callbacks) and the proxy was restarted after config changes.
 **Hugging Face: "Credit balance is depleted"** – Your HF Inference credits are used up. Add pre-paid credits or subscribe to PRO; until then, `my-hf-models` fallbacks will fail and the proxy will try other groups (e.g. `my-large-context`, `my-free-models`).
-**Request fails after many fallbacks / "tokens lost"** – If the primary model and every fallback fail (context too long, HF credits depleted, Gemini 503, etc.), the request returns an error and no completion is streamed. To avoid wasting tokens: (1) keep conversations or tool payloads within context limits where possible, (2) ensure at least one fallback provider (OpenRouter, HF, or Gemini) has quota and supports your message format, (3) retry during peak overload (e.g. Gemini 503).
 **Agent mode (tool calls)** – The proxy strips prefix from every message and trims input only when the request has no tool_calls or tool-role messages, so agent flows keep valid tool sequences. If agent mode still fails, check context length (set LITELLM_TRIM_MAX_INPUT_TOKENS or keep tool payloads smaller), cooldown (No deployments available), and that at least one fallback has quota.

 **Hugging Face: "does not have sufficient permissions to call Inference Providers"** – Your `HF_TOKEN` must be allowed to use the Inference API (serverless). Create or edit the token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and enable the scope that allows Inference / Inference Providers. Without it, `my-hf-models` fallbacks will fail; the proxy will still use other model groups.
+**503 "The model is overloaded"** – Returned by the provider (e.g. Gemini). The proxy retries and then follows fallbacks. Clients receive a short error message only (no traceback). Retrying the request later usually works.
+**Context length exceeded (e.g. 139k &gt; 128k)** – The config uses a large-context model group (`my-large-context`) as the first fallback when context is exceeded. Ensure `OPENROUTER_API_KEY` is set so that fallback can be used. If the request is very long (e.g. 140k+ tokens including tool input), consider trimming history or using a client that supports prompt compression so fallbacks have a chance to succeed.
 **Hugging Face: "property 'prefix' is unsupported"** – Some clients (e.g. agent mode) send assistant messages with a `prefix` field. The proxy strips that field from every message in a pre-call hook before any provider sees it, so all providers (including HF on fallback) receive messages without prefix. If you still see this error, ensure the trim-messages hook is loaded in config (callbacks) and the proxy was restarted after config changes.
 **Hugging Face: "Credit balance is depleted"** – Your HF Inference credits are used up. Add pre-paid credits or subscribe to PRO; until then, `my-hf-models` fallbacks will fail and the proxy will try other groups (e.g. `my-large-context`, `my-free-models`).
+**Request fails after many fallbacks / "tokens lost"** – If the primary model and every fallback fail (context too long, HF credits depleted, Gemini 503, etc.), the request returns an error and no completion is streamed. Clients see a plain error message only. To avoid wasting tokens: (1) keep conversations or tool payloads within context limits where possible, (2) ensure at least one fallback provider (OpenRouter, HF, or Gemini) has quota and supports your message format, (3) retry during peak overload (e.g. Gemini 503).
+**Client cancels or disconnects** – If the client closes the connection or cancels the request, the proxy may still continue calling the provider until the request finishes. That can use extra tokens until LiteLLM supports cancelling upstream on disconnect.
 **Agent mode (tool calls)** – The proxy strips prefix from every message and trims input only when the request has no tool_calls or tool-role messages, so agent flows keep valid tool sequences. If agent mode still fails, check context length (set LITELLM_TRIM_MAX_INPUT_TOKENS or keep tool payloads smaller), cooldown (No deployments available), and that at least one fallback has quota.

litellm-config-auto.yaml CHANGED Viewed

@@ -5,7 +5,7 @@ litellm_settings:
   drop_params: True
   modify_params: True
   additional_drop_params: ["messages[*].prefix"]
-  callbacks: ["trim_messages_hook.proxy_handler_instance"]
   set_verbose: False
   request_timeout: 300

   drop_params: True
   modify_params: True
   additional_drop_params: ["messages[*].prefix"]
+  callbacks: ["trim_messages_hook.proxy_handler_instance", "trim_messages_hook.error_handler_instance"]
   set_verbose: False
   request_timeout: 300

trim_messages_hook.py CHANGED Viewed

@@ -1,6 +1,6 @@
 """
 Pre-call hook that trims messages to stay under a token limit.
-Reduces token usage and avoids context-length errors / failed fallbacks.
 """
 from __future__ import annotations
@@ -8,6 +8,7 @@ import logging
 import os
 from typing import TYPE_CHECKING, Any, Literal, Optional
 from litellm.integrations.custom_logger import CustomLogger
 if TYPE_CHECKING:
@@ -82,4 +83,33 @@ class TrimMessagesHandler(CustomLogger):
         return data
 proxy_handler_instance = TrimMessagesHandler()

 """
 Pre-call hook that trims messages to stay under a token limit.
+Post-call failure hook that returns a clean error message to clients.
 """
 from __future__ import annotations
 import os
 from typing import TYPE_CHECKING, Any, Literal, Optional
+from fastapi import HTTPException
 from litellm.integrations.custom_logger import CustomLogger
 if TYPE_CHECKING:
         return data
+class ErrorHandler(CustomLogger):
+    """Returns a clean error message to clients; logs full exception server-side only."""
+    async def async_post_call_failure_hook(
+        self,
+        request_data: dict[str, Any],
+        original_exception: Exception,
+        user_api_key_dict: Any,
+        traceback_str: Optional[str] = None,
+    ) -> Optional[HTTPException]:
+        _log = logging.getLogger(__name__)
+        if traceback_str:
+            _log.error("LLM call failed: %s\n%s", original_exception, traceback_str)
+        else:
+            _log.error("LLM call failed: %s", original_exception, exc_info=True)
+        status_code = 500
+        message = "An error occurred while processing your request. Please retry."
+        if getattr(original_exception, "response", None) is not None:
+            resp = getattr(original_exception, "response")
+            if getattr(resp, "status_code", None) == 503:
+                status_code = 503
+                message = "Service temporarily unavailable. Please retry."
+        elif "503" in str(original_exception):
+            status_code = 503
+            message = "Service temporarily unavailable. Please retry."
+        return HTTPException(status_code=status_code, detail=message)
 proxy_handler_instance = TrimMessagesHandler()
+error_handler_instance = ErrorHandler()