Restore max_new_tokens to 512 (4-bit gen is fast: ~25 tok/s on GPU) 6246295 vivekchakraverty commited on 1 day ago
Load Qwen2.5-Coder-7B in 4-bit (nf4) inside the GPU worker 2709f63 vivekchakraverty Claude Opus 4.8 commited on 1 day ago
ZeroGPU: load model on GPU inside @spaces.GPU (canonical), not at import cccb7d5 vivekchakraverty Claude Opus 4.8 commited on 1 day ago
ZeroGPU: raise GPU budget 120->180s, cap max_new_tokens 512->256 5fa56c1 vivekchakraverty commited on 1 day ago
ZeroGPU: force model.to(cuda) in fn (ignore stale is_available); no cuda at import 743e3d3 vivekchakraverty commited on 2 days ago
diag: log cuda availability + model device + gen timing; force model.to(cuda) in fn 5ff14e5 vivekchakraverty commited on 2 days ago
ZeroGPU: keep model GPU-resident (canonical pattern) 8df32ec vivekchakraverty Claude Opus 4.8 commited on 2 days ago
Load the LLM once at startup instead of per ZeroGPU call 043484b vivekchakraverty Claude Opus 4.8 commited on 2 days ago
GDScript RAG assistant: app + corpus (index added later via Colab) 777ea0e verified vivekchakraverty commited on 2 days ago