raheem786 commited on
Commit
c7d2b40
Β·
verified Β·
1 Parent(s): 5e8a4fd

Upload folder using huggingface_hub

Browse files
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.9
README.md CHANGED
@@ -47,15 +47,17 @@ If the env vars are set on the server and you send the master key in the Authori
47
 
48
  **Hugging Face: "does not have sufficient permissions to call Inference Providers"** – Your `HF_TOKEN` must be allowed to use the Inference API (serverless). Create or edit the token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and enable the scope that allows Inference / Inference Providers. Without it, `my-hf-models` fallbacks will fail; the proxy will still use other model groups.
49
 
50
- **503 "The model is overloaded"** – Returned by the provider (e.g. Gemini). The proxy retries and then follows fallbacks. Retrying the request later usually works.
51
 
52
- **Context length exceeded (e.g. 139k > 128k)** – The config uses a large-context model group (`my-large-context`, 1.8M tokens) as the first fallback when context is exceeded. Ensure `OPENROUTER_API_KEY` is set so that fallback can be used. If the request is very long (e.g. 140k+ tokens including tool input), consider trimming history or using a client that supports prompt compression so fallbacks have a chance to succeed.
53
 
54
  **Hugging Face: "property 'prefix' is unsupported"** – Some clients (e.g. agent mode) send assistant messages with a `prefix` field. The proxy strips that field from every message in a pre-call hook before any provider sees it, so all providers (including HF on fallback) receive messages without prefix. If you still see this error, ensure the trim-messages hook is loaded in config (callbacks) and the proxy was restarted after config changes.
55
 
56
  **Hugging Face: "Credit balance is depleted"** – Your HF Inference credits are used up. Add pre-paid credits or subscribe to PRO; until then, `my-hf-models` fallbacks will fail and the proxy will try other groups (e.g. `my-large-context`, `my-free-models`).
57
 
58
- **Request fails after many fallbacks / "tokens lost"** – If the primary model and every fallback fail (context too long, HF credits depleted, Gemini 503, etc.), the request returns an error and no completion is streamed. To avoid wasting tokens: (1) keep conversations or tool payloads within context limits where possible, (2) ensure at least one fallback provider (OpenRouter, HF, or Gemini) has quota and supports your message format, (3) retry during peak overload (e.g. Gemini 503).
 
 
59
 
60
  **Agent mode (tool calls)** – The proxy strips prefix from every message and trims input only when the request has no tool_calls or tool-role messages, so agent flows keep valid tool sequences. If agent mode still fails, check context length (set LITELLM_TRIM_MAX_INPUT_TOKENS or keep tool payloads smaller), cooldown (No deployments available), and that at least one fallback has quota.
61
 
 
47
 
48
  **Hugging Face: "does not have sufficient permissions to call Inference Providers"** – Your `HF_TOKEN` must be allowed to use the Inference API (serverless). Create or edit the token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and enable the scope that allows Inference / Inference Providers. Without it, `my-hf-models` fallbacks will fail; the proxy will still use other model groups.
49
 
50
+ **503 "The model is overloaded"** – Returned by the provider (e.g. Gemini). The proxy retries and then follows fallbacks. Clients receive a short error message only (no traceback). Retrying the request later usually works.
51
 
52
+ **Context length exceeded (e.g. 139k > 128k)** – The config uses a large-context model group (`my-large-context`) as the first fallback when context is exceeded. Ensure `OPENROUTER_API_KEY` is set so that fallback can be used. If the request is very long (e.g. 140k+ tokens including tool input), consider trimming history or using a client that supports prompt compression so fallbacks have a chance to succeed.
53
 
54
  **Hugging Face: "property 'prefix' is unsupported"** – Some clients (e.g. agent mode) send assistant messages with a `prefix` field. The proxy strips that field from every message in a pre-call hook before any provider sees it, so all providers (including HF on fallback) receive messages without prefix. If you still see this error, ensure the trim-messages hook is loaded in config (callbacks) and the proxy was restarted after config changes.
55
 
56
  **Hugging Face: "Credit balance is depleted"** – Your HF Inference credits are used up. Add pre-paid credits or subscribe to PRO; until then, `my-hf-models` fallbacks will fail and the proxy will try other groups (e.g. `my-large-context`, `my-free-models`).
57
 
58
+ **Request fails after many fallbacks / "tokens lost"** – If the primary model and every fallback fail (context too long, HF credits depleted, Gemini 503, etc.), the request returns an error and no completion is streamed. Clients see a plain error message only. To avoid wasting tokens: (1) keep conversations or tool payloads within context limits where possible, (2) ensure at least one fallback provider (OpenRouter, HF, or Gemini) has quota and supports your message format, (3) retry during peak overload (e.g. Gemini 503).
59
+
60
+ **Client cancels or disconnects** – If the client closes the connection or cancels the request, the proxy may still continue calling the provider until the request finishes. That can use extra tokens until LiteLLM supports cancelling upstream on disconnect.
61
 
62
  **Agent mode (tool calls)** – The proxy strips prefix from every message and trims input only when the request has no tool_calls or tool-role messages, so agent flows keep valid tool sequences. If agent mode still fails, check context length (set LITELLM_TRIM_MAX_INPUT_TOKENS or keep tool payloads smaller), cooldown (No deployments available), and that at least one fallback has quota.
63
 
litellm-config-auto.yaml CHANGED
@@ -5,7 +5,7 @@ litellm_settings:
5
  drop_params: True
6
  modify_params: True
7
  additional_drop_params: ["messages[*].prefix"]
8
- callbacks: ["trim_messages_hook.proxy_handler_instance"]
9
  set_verbose: False
10
  request_timeout: 300
11
 
 
5
  drop_params: True
6
  modify_params: True
7
  additional_drop_params: ["messages[*].prefix"]
8
+ callbacks: ["trim_messages_hook.proxy_handler_instance", "trim_messages_hook.error_handler_instance"]
9
  set_verbose: False
10
  request_timeout: 300
11
 
trim_messages_hook.py CHANGED
@@ -1,6 +1,6 @@
1
  """
2
  Pre-call hook that trims messages to stay under a token limit.
3
- Reduces token usage and avoids context-length errors / failed fallbacks.
4
  """
5
  from __future__ import annotations
6
 
@@ -8,6 +8,7 @@ import logging
8
  import os
9
  from typing import TYPE_CHECKING, Any, Literal, Optional
10
 
 
11
  from litellm.integrations.custom_logger import CustomLogger
12
 
13
  if TYPE_CHECKING:
@@ -82,4 +83,33 @@ class TrimMessagesHandler(CustomLogger):
82
  return data
83
 
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  proxy_handler_instance = TrimMessagesHandler()
 
 
1
  """
2
  Pre-call hook that trims messages to stay under a token limit.
3
+ Post-call failure hook that returns a clean error message to clients.
4
  """
5
  from __future__ import annotations
6
 
 
8
  import os
9
  from typing import TYPE_CHECKING, Any, Literal, Optional
10
 
11
+ from fastapi import HTTPException
12
  from litellm.integrations.custom_logger import CustomLogger
13
 
14
  if TYPE_CHECKING:
 
83
  return data
84
 
85
 
86
+ class ErrorHandler(CustomLogger):
87
+ """Returns a clean error message to clients; logs full exception server-side only."""
88
+
89
+ async def async_post_call_failure_hook(
90
+ self,
91
+ request_data: dict[str, Any],
92
+ original_exception: Exception,
93
+ user_api_key_dict: Any,
94
+ traceback_str: Optional[str] = None,
95
+ ) -> Optional[HTTPException]:
96
+ _log = logging.getLogger(__name__)
97
+ if traceback_str:
98
+ _log.error("LLM call failed: %s\n%s", original_exception, traceback_str)
99
+ else:
100
+ _log.error("LLM call failed: %s", original_exception, exc_info=True)
101
+ status_code = 500
102
+ message = "An error occurred while processing your request. Please retry."
103
+ if getattr(original_exception, "response", None) is not None:
104
+ resp = getattr(original_exception, "response")
105
+ if getattr(resp, "status_code", None) == 503:
106
+ status_code = 503
107
+ message = "Service temporarily unavailable. Please retry."
108
+ elif "503" in str(original_exception):
109
+ status_code = 503
110
+ message = "Service temporarily unavailable. Please retry."
111
+ return HTTPException(status_code=status_code, detail=message)
112
+
113
+
114
  proxy_handler_instance = TrimMessagesHandler()
115
+ error_handler_instance = ErrorHandler()