Spaces:

build-small-hackathon
/

phantom-grid

Sleeping

App Files Files Community

phantom-grid / llm

Commit History

Strip MiniCPM <think>...</think> reasoning tags from generated text

5f5ab47
verified

unity4ar commited on 17 days ago

Pin transformers<5.0 so KV cache works with MiniCPM bundled code; re-enable use_cache

148a99f
verified

unity4ar commited on 17 days ago

Revert to use_cache=False (eager attn also broken); hide all audio UI

df99cf8
verified

unity4ar commited on 17 days ago

Speed up: eager attn + KV cache; drop chat retries to 1; remove MiniCPM-o voice UI artifact

3326f32
verified

unity4ar commited on 17 days ago

Cap zerogpu max_new_tokens at 256 (use_cache=False makes long generations O(n^2))

c788ee1
verified

unity4ar commited on 17 days ago

Disable KV cache: openbmb modeling_minicpm.py has a cache_utils API drift bug

e06599a
verified

unity4ar commited on 17 days ago

Move .to('cuda') inside @spaces.GPU; background thread keeps model on CPU to avoid emulation bypass

5a8fab5
verified

unity4ar commited on 17 days ago

Shim is_torch_fx_available so MiniCPM trust_remote_code import works on transformers >= 5.0

90e360d
verified

unity4ar commited on 17 days ago

Use canonical .to('cuda') pattern + progress logs so container log shows what loader is doing

031ce2d
verified

unity4ar commited on 17 days ago

Load model in background thread so health/status endpoints don't block on 16GB download

fbd952d
verified

unity4ar commited on 17 days ago

Load model on cuda at module level (canonical ZeroGPU pattern)

a475083
verified

unity4ar commited on 17 days ago

Refactor: Docker+llama.cpp -> Gradio SDK + ZeroGPU transformers backend

7036a02
verified

unity4ar commited on 17 days ago

Ship Phantom Grid Docker Space

d2e6f94
verified

unity4ar commited on 17 days ago

Commit History

Strip MiniCPM <think>...</think> reasoning tags from generated text 5f5ab47 verified

Pin transformers<5.0 so KV cache works with MiniCPM bundled code; re-enable use_cache 148a99f verified

Revert to use_cache=False (eager attn also broken); hide all audio UI df99cf8 verified

Speed up: eager attn + KV cache; drop chat retries to 1; remove MiniCPM-o voice UI artifact 3326f32 verified

Cap zerogpu max_new_tokens at 256 (use_cache=False makes long generations O(n^2)) c788ee1 verified

Disable KV cache: openbmb modeling_minicpm.py has a cache_utils API drift bug e06599a verified

Move .to('cuda') inside @spaces.GPU; background thread keeps model on CPU to avoid emulation bypass 5a8fab5 verified

Shim is_torch_fx_available so MiniCPM trust_remote_code import works on transformers >= 5.0 90e360d verified

Use canonical .to('cuda') pattern + progress logs so container log shows what loader is doing 031ce2d verified

Load model in background thread so health/status endpoints don't block on 16GB download fbd952d verified

Load model on cuda at module level (canonical ZeroGPU pattern) a475083 verified

Refactor: Docker+llama.cpp -> Gradio SDK + ZeroGPU transformers backend 7036a02 verified

Ship Phantom Grid Docker Space d2e6f94 verified

Strip MiniCPM <think>...</think> reasoning tags from generated text

5f5ab47
verified

Pin transformers<5.0 so KV cache works with MiniCPM bundled code; re-enable use_cache

148a99f
verified

Revert to use_cache=False (eager attn also broken); hide all audio UI

df99cf8
verified

Speed up: eager attn + KV cache; drop chat retries to 1; remove MiniCPM-o voice UI artifact

3326f32
verified

Cap zerogpu max_new_tokens at 256 (use_cache=False makes long generations O(n^2))

c788ee1
verified

Disable KV cache: openbmb modeling_minicpm.py has a cache_utils API drift bug

e06599a
verified

Move .to('cuda') inside @spaces.GPU; background thread keeps model on CPU to avoid emulation bypass

5a8fab5
verified

Shim is_torch_fx_available so MiniCPM trust_remote_code import works on transformers >= 5.0

90e360d
verified

Use canonical .to('cuda') pattern + progress logs so container log shows what loader is doing

031ce2d
verified

Load model in background thread so health/status endpoints don't block on 16GB download

fbd952d
verified

Load model on cuda at module level (canonical ZeroGPU pattern)

a475083
verified

Refactor: Docker+llama.cpp -> Gradio SDK + ZeroGPU transformers backend

7036a02
verified

Ship Phantom Grid Docker Space

d2e6f94
verified