Strip MiniCPM <think>...</think> reasoning tags from generated text 5f5ab47 verified unity4ar commited on 17 days ago
Pin transformers<5.0 so KV cache works with MiniCPM bundled code; re-enable use_cache 148a99f verified unity4ar commited on 17 days ago
Revert to use_cache=False (eager attn also broken); hide all audio UI df99cf8 verified unity4ar commited on 17 days ago
Speed up: eager attn + KV cache; drop chat retries to 1; remove MiniCPM-o voice UI artifact 3326f32 verified unity4ar commited on 17 days ago
Cap zerogpu max_new_tokens at 256 (use_cache=False makes long generations O(n^2)) c788ee1 verified unity4ar commited on 17 days ago
Disable KV cache: openbmb modeling_minicpm.py has a cache_utils API drift bug e06599a verified unity4ar commited on 17 days ago
Move .to('cuda') inside @spaces.GPU; background thread keeps model on CPU to avoid emulation bypass 5a8fab5 verified unity4ar commited on 17 days ago
Shim is_torch_fx_available so MiniCPM trust_remote_code import works on transformers >= 5.0 90e360d verified unity4ar commited on 17 days ago
Use canonical .to('cuda') pattern + progress logs so container log shows what loader is doing 031ce2d verified unity4ar commited on 17 days ago
Load model in background thread so health/status endpoints don't block on 16GB download fbd952d verified unity4ar commited on 17 days ago
Load model on cuda at module level (canonical ZeroGPU pattern) a475083 verified unity4ar commited on 17 days ago
Refactor: Docker+llama.cpp -> Gradio SDK + ZeroGPU transformers backend 7036a02 verified unity4ar commited on 17 days ago