Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- README.md +3 -1
- trim_messages_hook.py +14 -8
README.md
CHANGED
|
@@ -49,13 +49,15 @@ If the env vars are set on the server and you send the master key in the Authori
|
|
| 49 |
|
| 50 |
**503 "The model is overloaded"** β Returned by the provider (e.g. Gemini). The proxy retries and then follows fallbacks. Clients receive a short error message only (no traceback). Retrying the request later usually works.
|
| 51 |
|
|
|
|
|
|
|
| 52 |
**Context length exceeded (e.g. 139k > 128k)** β The config uses a large-context model group (`my-large-context`) as the first fallback when context is exceeded. Ensure `OPENROUTER_API_KEY` is set so that fallback can be used. If the request is very long (e.g. 140k+ tokens including tool input), consider trimming history or using a client that supports prompt compression so fallbacks have a chance to succeed.
|
| 53 |
|
| 54 |
**Hugging Face: "property 'prefix' is unsupported"** β Some clients (e.g. agent mode) send assistant messages with a `prefix` field. The proxy strips that field from every message in a pre-call hook before any provider sees it, so all providers (including HF on fallback) receive messages without prefix. If you still see this error, ensure the trim-messages hook is loaded in config (callbacks) and the proxy was restarted after config changes.
|
| 55 |
|
| 56 |
**Hugging Face: "Credit balance is depleted"** β Your HF Inference credits are used up. Add pre-paid credits or subscribe to PRO; until then, `my-hf-models` fallbacks will fail and the proxy will try other groups (e.g. `my-large-context`, `my-free-models`).
|
| 57 |
|
| 58 |
-
**Request fails after many fallbacks / "tokens lost"** β If the primary model and every fallback fail (context too long, HF credits depleted, Gemini 503, etc.), the request returns an error and no completion is streamed. Clients see a plain error message only. To avoid wasting tokens: (1) keep conversations or tool payloads within context limits where possible, (2) ensure at least one fallback provider (OpenRouter, HF, or Gemini) has quota and supports your message format, (3) retry during peak overload (e.g. Gemini 503).
|
| 59 |
|
| 60 |
**Client cancels or disconnects** β If the client closes the connection or cancels the request, the proxy may still continue calling the provider until the request finishes. That can use extra tokens until LiteLLM supports cancelling upstream on disconnect.
|
| 61 |
|
|
|
|
| 49 |
|
| 50 |
**503 "The model is overloaded"** β Returned by the provider (e.g. Gemini). The proxy retries and then follows fallbacks. Clients receive a short error message only (no traceback). Retrying the request later usually works.
|
| 51 |
|
| 52 |
+
**429 Rate limit exceeded** β Returned when a provider (e.g. OpenRouter) throttles requests. The proxy retries and uses fallbacks. Clients receive a short rate-limit message only. If all fallbacks hit rate limits, retry after a short delay.
|
| 53 |
+
|
| 54 |
**Context length exceeded (e.g. 139k > 128k)** β The config uses a large-context model group (`my-large-context`) as the first fallback when context is exceeded. Ensure `OPENROUTER_API_KEY` is set so that fallback can be used. If the request is very long (e.g. 140k+ tokens including tool input), consider trimming history or using a client that supports prompt compression so fallbacks have a chance to succeed.
|
| 55 |
|
| 56 |
**Hugging Face: "property 'prefix' is unsupported"** β Some clients (e.g. agent mode) send assistant messages with a `prefix` field. The proxy strips that field from every message in a pre-call hook before any provider sees it, so all providers (including HF on fallback) receive messages without prefix. If you still see this error, ensure the trim-messages hook is loaded in config (callbacks) and the proxy was restarted after config changes.
|
| 57 |
|
| 58 |
**Hugging Face: "Credit balance is depleted"** β Your HF Inference credits are used up. Add pre-paid credits or subscribe to PRO; until then, `my-hf-models` fallbacks will fail and the proxy will try other groups (e.g. `my-large-context`, `my-free-models`).
|
| 59 |
|
| 60 |
+
**Request fails after many fallbacks / "tokens lost"** β If the primary model and every fallback fail (context too long, HF credits depleted, Gemini 503, rate limits, etc.), the request returns an error and no completion is streamed. Clients see a plain error message only. Server logs may still show full tracebacks for debugging, especially when the failure happens during streaming. To avoid wasting tokens: (1) keep conversations or tool payloads within context limits where possible, (2) ensure at least one fallback provider (OpenRouter, HF, or Gemini) has quota and supports your message format, (3) retry during peak overload (e.g. Gemini 503) or after rate limits reset.
|
| 61 |
|
| 62 |
**Client cancels or disconnects** β If the client closes the connection or cancels the request, the proxy may still continue calling the provider until the request finishes. That can use extra tokens until LiteLLM supports cancelling upstream on disconnect.
|
| 63 |
|
trim_messages_hook.py
CHANGED
|
@@ -9,6 +9,7 @@ import os
|
|
| 9 |
from typing import TYPE_CHECKING, Any, Literal, Optional
|
| 10 |
|
| 11 |
from fastapi import HTTPException
|
|
|
|
| 12 |
from litellm.integrations.custom_logger import CustomLogger
|
| 13 |
|
| 14 |
if TYPE_CHECKING:
|
|
@@ -98,16 +99,21 @@ class ErrorHandler(CustomLogger):
|
|
| 98 |
_log.error("LLM call failed: %s\n%s", original_exception, traceback_str)
|
| 99 |
else:
|
| 100 |
_log.error("LLM call failed: %s", original_exception, exc_info=True)
|
| 101 |
-
status_code =
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
elif "503" in str(original_exception):
|
| 109 |
status_code = 503
|
|
|
|
| 110 |
message = "Service temporarily unavailable. Please retry."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
return HTTPException(status_code=status_code, detail=message)
|
| 112 |
|
| 113 |
|
|
|
|
| 9 |
from typing import TYPE_CHECKING, Any, Literal, Optional
|
| 10 |
|
| 11 |
from fastapi import HTTPException
|
| 12 |
+
from litellm import RateLimitError
|
| 13 |
from litellm.integrations.custom_logger import CustomLogger
|
| 14 |
|
| 15 |
if TYPE_CHECKING:
|
|
|
|
| 99 |
_log.error("LLM call failed: %s\n%s", original_exception, traceback_str)
|
| 100 |
else:
|
| 101 |
_log.error("LLM call failed: %s", original_exception, exc_info=True)
|
| 102 |
+
status_code = getattr(original_exception, "status_code", None)
|
| 103 |
+
if status_code is not None and 400 <= status_code < 600:
|
| 104 |
+
pass
|
| 105 |
+
else:
|
| 106 |
+
status_code = 500
|
| 107 |
+
resp = getattr(original_exception, "response", None)
|
| 108 |
+
if resp is not None and getattr(resp, "status_code", None) == 503:
|
|
|
|
| 109 |
status_code = 503
|
| 110 |
+
if status_code == 503:
|
| 111 |
message = "Service temporarily unavailable. Please retry."
|
| 112 |
+
elif status_code == 429 or isinstance(original_exception, RateLimitError):
|
| 113 |
+
status_code = 429
|
| 114 |
+
message = "Rate limit exceeded. Please retry shortly."
|
| 115 |
+
else:
|
| 116 |
+
message = "An error occurred while processing your request. Please retry."
|
| 117 |
return HTTPException(status_code=status_code, detail=message)
|
| 118 |
|
| 119 |
|