raheem786 commited on
Commit
177d781
Β·
verified Β·
1 Parent(s): c7d2b40

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +3 -1
  2. trim_messages_hook.py +14 -8
README.md CHANGED
@@ -49,13 +49,15 @@ If the env vars are set on the server and you send the master key in the Authori
49
 
50
  **503 "The model is overloaded"** – Returned by the provider (e.g. Gemini). The proxy retries and then follows fallbacks. Clients receive a short error message only (no traceback). Retrying the request later usually works.
51
 
 
 
52
  **Context length exceeded (e.g. 139k > 128k)** – The config uses a large-context model group (`my-large-context`) as the first fallback when context is exceeded. Ensure `OPENROUTER_API_KEY` is set so that fallback can be used. If the request is very long (e.g. 140k+ tokens including tool input), consider trimming history or using a client that supports prompt compression so fallbacks have a chance to succeed.
53
 
54
  **Hugging Face: "property 'prefix' is unsupported"** – Some clients (e.g. agent mode) send assistant messages with a `prefix` field. The proxy strips that field from every message in a pre-call hook before any provider sees it, so all providers (including HF on fallback) receive messages without prefix. If you still see this error, ensure the trim-messages hook is loaded in config (callbacks) and the proxy was restarted after config changes.
55
 
56
  **Hugging Face: "Credit balance is depleted"** – Your HF Inference credits are used up. Add pre-paid credits or subscribe to PRO; until then, `my-hf-models` fallbacks will fail and the proxy will try other groups (e.g. `my-large-context`, `my-free-models`).
57
 
58
- **Request fails after many fallbacks / "tokens lost"** – If the primary model and every fallback fail (context too long, HF credits depleted, Gemini 503, etc.), the request returns an error and no completion is streamed. Clients see a plain error message only. To avoid wasting tokens: (1) keep conversations or tool payloads within context limits where possible, (2) ensure at least one fallback provider (OpenRouter, HF, or Gemini) has quota and supports your message format, (3) retry during peak overload (e.g. Gemini 503).
59
 
60
  **Client cancels or disconnects** – If the client closes the connection or cancels the request, the proxy may still continue calling the provider until the request finishes. That can use extra tokens until LiteLLM supports cancelling upstream on disconnect.
61
 
 
49
 
50
  **503 "The model is overloaded"** – Returned by the provider (e.g. Gemini). The proxy retries and then follows fallbacks. Clients receive a short error message only (no traceback). Retrying the request later usually works.
51
 
52
+ **429 Rate limit exceeded** – Returned when a provider (e.g. OpenRouter) throttles requests. The proxy retries and uses fallbacks. Clients receive a short rate-limit message only. If all fallbacks hit rate limits, retry after a short delay.
53
+
54
  **Context length exceeded (e.g. 139k > 128k)** – The config uses a large-context model group (`my-large-context`) as the first fallback when context is exceeded. Ensure `OPENROUTER_API_KEY` is set so that fallback can be used. If the request is very long (e.g. 140k+ tokens including tool input), consider trimming history or using a client that supports prompt compression so fallbacks have a chance to succeed.
55
 
56
  **Hugging Face: "property 'prefix' is unsupported"** – Some clients (e.g. agent mode) send assistant messages with a `prefix` field. The proxy strips that field from every message in a pre-call hook before any provider sees it, so all providers (including HF on fallback) receive messages without prefix. If you still see this error, ensure the trim-messages hook is loaded in config (callbacks) and the proxy was restarted after config changes.
57
 
58
  **Hugging Face: "Credit balance is depleted"** – Your HF Inference credits are used up. Add pre-paid credits or subscribe to PRO; until then, `my-hf-models` fallbacks will fail and the proxy will try other groups (e.g. `my-large-context`, `my-free-models`).
59
 
60
+ **Request fails after many fallbacks / "tokens lost"** – If the primary model and every fallback fail (context too long, HF credits depleted, Gemini 503, rate limits, etc.), the request returns an error and no completion is streamed. Clients see a plain error message only. Server logs may still show full tracebacks for debugging, especially when the failure happens during streaming. To avoid wasting tokens: (1) keep conversations or tool payloads within context limits where possible, (2) ensure at least one fallback provider (OpenRouter, HF, or Gemini) has quota and supports your message format, (3) retry during peak overload (e.g. Gemini 503) or after rate limits reset.
61
 
62
  **Client cancels or disconnects** – If the client closes the connection or cancels the request, the proxy may still continue calling the provider until the request finishes. That can use extra tokens until LiteLLM supports cancelling upstream on disconnect.
63
 
trim_messages_hook.py CHANGED
@@ -9,6 +9,7 @@ import os
9
  from typing import TYPE_CHECKING, Any, Literal, Optional
10
 
11
  from fastapi import HTTPException
 
12
  from litellm.integrations.custom_logger import CustomLogger
13
 
14
  if TYPE_CHECKING:
@@ -98,16 +99,21 @@ class ErrorHandler(CustomLogger):
98
  _log.error("LLM call failed: %s\n%s", original_exception, traceback_str)
99
  else:
100
  _log.error("LLM call failed: %s", original_exception, exc_info=True)
101
- status_code = 500
102
- message = "An error occurred while processing your request. Please retry."
103
- if getattr(original_exception, "response", None) is not None:
104
- resp = getattr(original_exception, "response")
105
- if getattr(resp, "status_code", None) == 503:
106
- status_code = 503
107
- message = "Service temporarily unavailable. Please retry."
108
- elif "503" in str(original_exception):
109
  status_code = 503
 
110
  message = "Service temporarily unavailable. Please retry."
 
 
 
 
 
111
  return HTTPException(status_code=status_code, detail=message)
112
 
113
 
 
9
  from typing import TYPE_CHECKING, Any, Literal, Optional
10
 
11
  from fastapi import HTTPException
12
+ from litellm import RateLimitError
13
  from litellm.integrations.custom_logger import CustomLogger
14
 
15
  if TYPE_CHECKING:
 
99
  _log.error("LLM call failed: %s\n%s", original_exception, traceback_str)
100
  else:
101
  _log.error("LLM call failed: %s", original_exception, exc_info=True)
102
+ status_code = getattr(original_exception, "status_code", None)
103
+ if status_code is not None and 400 <= status_code < 600:
104
+ pass
105
+ else:
106
+ status_code = 500
107
+ resp = getattr(original_exception, "response", None)
108
+ if resp is not None and getattr(resp, "status_code", None) == 503:
 
109
  status_code = 503
110
+ if status_code == 503:
111
  message = "Service temporarily unavailable. Please retry."
112
+ elif status_code == 429 or isinstance(original_exception, RateLimitError):
113
+ status_code = 429
114
+ message = "Rate limit exceeded. Please retry shortly."
115
+ else:
116
+ message = "An error occurred while processing your request. Please retry."
117
  return HTTPException(status_code=status_code, detail=message)
118
 
119