Commit History

Update main.py
f0edfa1
verified

Rofati commited on

Update main.py
ce81f75
verified

Rofati commited on

Update Dockerfile
02232d6
verified

Rofati commited on

Update main.py
c8a9818
verified

Rofati commited on

Update main.py
ce782c7
verified

Rofati commited on

Update main.py
53f615a
verified

Rofati commited on

Update main.py
468632b
verified

Rofati commited on

Update Dockerfile
59fa144
verified

Rofati commited on

Update Dockerfile
69714e0
verified

Rofati commited on

Update Dockerfile
4c822f8
verified

Rofati commited on

Update main.py
d3accc5
verified

Rofati commited on

Revert to Llama-3.2-1B Q8_0 (proven 10-24s responses) — fastest working config"
89f8315
verified

Rofati commited on

Switch to SmolLM2-360M Q8_0 (386MB = 3.4x smaller = 3x faster inference)"
f2f20cb
verified

Rofati commited on

Use python:3.10-slim + compile llama-cpp 0.2.90 (proven working version)"
bf941fe
verified

Rofati commited on

Revert main.py to exact original working code + n_batch=512 speedup"
65eb6d3
verified

Rofati commited on

Full revert to original working configuration"
34fa7cd
verified

Rofati commited on

main.py: use Q4_K_M path, optimized for OpenBLAS build"
5214ad9
verified

Rofati commited on

Use python:3.11-slim + compile llama-cpp with OpenBLAS for CPU speed"
a4a6838
verified

Rofati commited on

Revert main.py to working config with speed tweaks (n_batch=512, max_tokens=100, only last 2 msgs)"
3300482
verified

Rofati commited on

Revert to working config: Q8_0 (was working before), pre-download for fast startup"
d01a9fa
verified

Rofati commited on

Ultra-minimal: max_tokens=40, n_ctx=256 for fastest possible inference"
13ac85e
verified

Rofati commited on

Force restart
51742f2
verified

Rofati commited on

Switch to ghcr.io/abetlen/llama-cpp-python (has CPU optimizations) + Q4_K_M + pre-download
27585df
verified

Rofati commited on

Add non-streaming /api/chat/sync endpoint + keep SSE with heartbeat"
63473f8
verified

Rofati commited on

Fix: send immediate SSE heartbeat to prevent proxy timeout"
d0cd401
verified

Rofati commited on

main.py: Llama-3.2-1B Q4_K_M, optimized for speed (n_batch=512, max_tokens=100, streaming)"
6be260f
verified

Rofati commited on

Revert to Llama-3.2-1B Q4_K_M (proven working, no thinking issues) with speed optimizations"
5d02182
verified

Rofati commited on

Fix: pre-fill think block in prompt so model starts answering immediately"
add811b
verified

Rofati commited on

Fix timeout: max_tokens=80, ensure response completes within proxy timeout
7bb0a62
verified

Rofati commited on

Fix: handle Qwen3 think mode — skip think tokens, emit only real content"
5376cea
verified

Rofati commited on

Fix: remove think block, use direct prompt without thinking mode"
ffefba6
verified

Rofati commited on

Optimized main.py: Qwen3-0.6B, /no_think, n_batch=512, max_tokens=100
6ed7384
verified

Rofati commited on

Speed overhaul: Qwen3-0.6B Q4_K_M (397MB, 3x faster), pre-built wheel, optimized config
b76d2ed
verified

Rofati commited on

Update Dockerfile
2a245d0
verified

Rofati commited on

Update Dockerfile
c5f20a4
verified

Rofati commited on

Update main.py
86cfd84
verified

Rofati commited on

Update main.py
c613ab4
verified

Rofati commited on

Update main.py
010322d
verified

Rofati commited on

Update main.py
dee3ef1
verified

Rofati commited on

Update Dockerfile
1bfe300
verified

Rofati commited on

Update main.py
2f3ca05
verified

Rofati commited on

Update Dockerfile
5cd3cb5
verified

Rofati commited on

Update main.py
e9fc6d5
verified

Rofati commited on

Update main.py
fbcf15c
verified

Rofati commited on

Update main.py
5968bfd
verified

Rofati commited on

Update main.py
bf8daad
verified

Rofati commited on

Update main.py
d64a731
verified

Rofati commited on

Update main.py
49782c6
verified

Rofati commited on

Update main.py
317c1b6
verified

Rofati commited on

Create main.py
4e80d7e
verified

Rofati commited on