Revert to Llama-3.2-1B Q8_0 (proven 10-24s responses) — fastest working config" 89f8315 verified Rofati commited on May 25
Switch to SmolLM2-360M Q8_0 (386MB = 3.4x smaller = 3x faster inference)" f2f20cb verified Rofati commited on May 25
Use python:3.10-slim + compile llama-cpp 0.2.90 (proven working version)" bf941fe verified Rofati commited on May 22
Revert main.py to exact original working code + n_batch=512 speedup" 65eb6d3 verified Rofati commited on May 22
Use python:3.11-slim + compile llama-cpp with OpenBLAS for CPU speed" a4a6838 verified Rofati commited on May 22
Revert main.py to working config with speed tweaks (n_batch=512, max_tokens=100, only last 2 msgs)" 3300482 verified Rofati commited on May 22
Revert to working config: Q8_0 (was working before), pre-download for fast startup" d01a9fa verified Rofati commited on May 22
Ultra-minimal: max_tokens=40, n_ctx=256 for fastest possible inference" 13ac85e verified Rofati commited on May 22
Switch to ghcr.io/abetlen/llama-cpp-python (has CPU optimizations) + Q4_K_M + pre-download 27585df verified Rofati commited on May 22
Add non-streaming /api/chat/sync endpoint + keep SSE with heartbeat" 63473f8 verified Rofati commited on May 22
Fix: send immediate SSE heartbeat to prevent proxy timeout" d0cd401 verified Rofati commited on May 22
main.py: Llama-3.2-1B Q4_K_M, optimized for speed (n_batch=512, max_tokens=100, streaming)" 6be260f verified Rofati commited on May 22
Revert to Llama-3.2-1B Q4_K_M (proven working, no thinking issues) with speed optimizations" 5d02182 verified Rofati commited on May 22
Fix: pre-fill think block in prompt so model starts answering immediately" add811b verified Rofati commited on May 22
Fix timeout: max_tokens=80, ensure response completes within proxy timeout 7bb0a62 verified Rofati commited on May 22
Fix: handle Qwen3 think mode — skip think tokens, emit only real content" 5376cea verified Rofati commited on May 22
Fix: remove think block, use direct prompt without thinking mode" ffefba6 verified Rofati commited on May 22
Optimized main.py: Qwen3-0.6B, /no_think, n_batch=512, max_tokens=100 6ed7384 verified Rofati commited on May 22
Speed overhaul: Qwen3-0.6B Q4_K_M (397MB, 3x faster), pre-built wheel, optimized config b76d2ed verified Rofati commited on May 22