fix(vllm): reduce num_predict 512β350 to stay under max_model_len=2352 51e6b76 verified msradam commited on 19 days ago
fix(mellea): restore vLLM probe with auth header + non-5xx check c92bd29 verified msradam commited on 19 days ago
fix: vLLM probe with auth header + accept 401 as ready 7859ba2 verified msradam commited on 19 days ago
fix: vLLM readiness probe + fast Ollama fallback timeout (app/mellea_validator.py) 4858f1e verified msradam commited on 19 days ago
fix(mellea): raise first_token_timeout to 400s to match new 360s LiteLLM timeout d2b160d verified msradam commited on 19 days ago
fix(mellea): set _timed_out on LiteLLM exception before first token to prevent concurrent retries d8b5a19 verified msradam commited on 19 days ago
fix(mellea): fallback to best_paragraph when later attempts time out 384daac verified msradam commited on 19 days ago
fix(mellea): two-phase timeout β 250s first token, 45s inter-token 9e0c117 verified msradam commited on 20 days ago
fix(warmup): fire 1-token LLM ping at request start to warm RunPod GPU 3d8b58e verified msradam commited on 20 days ago
fix(mellea): 120s cold-start timeout + drop run_reconcile fallback 57f5889 verified msradam commited on 20 days ago
fix(mellea): abort retry loop after streaming hang to prevent concurrent vLLM requests + fix closure late-binding ae9cb16 verified msradam commited on 20 days ago
fix(mellea): increase num_predict 350β512 to avoid citations_dense truncation 745d7fb verified msradam commited on 20 days ago
fix(mellea): reduce num_predict to 350 for vLLM context headroom; reject empty paragraphs 1ddb69a verified msradam commited on 20 days ago
fix(mellea): per-token timeout with queue-based streaming to prevent hang a3447d8 verified msradam commited on 20 days ago