polats Claude Opus 4.8 (1M context) commited on
Commit
37982be
·
1 Parent(s): b408d69

Add BLS Mini-Code 1.0 (Cohere 30B MoE) coding sidecar

Browse files

ZeroGPU sidecar serving CohereLabs/BLS-Mini-Code-1.0 via the same /generate
+ /generate_stream contract as Mellum2. Source under spaces/bls-code-zerogpu.

- No FP8 weight upstream (BF16 only), so the Space quantizes to 4-bit at load.
- cohere2_moe is new, so transformers is pulled from git.
- It's a reasoning model: the chat template force-opens <|START_RESPONSE|>
(non-reasoning), which made it ramble reasoning as prose. The Space instead
opens a <|START_THINKING|> block (discarded) and streams only the clean
<|START_RESPONSE|> code, stripping <|START_TEXT|>/<|END_TEXT|> markers, with a
thinking-token budget so requested max_tokens applies to the visible code.

App side mirrors Mellum2: TINY_BLS_CODE_SPACE env var, _bls_code_stream +
_bls_code_stream_with_fallback (Nemotron NIM fallback when the sidecar is
asleep/over quota), bls-mini-code-zerogpu routing branch, and a codingModel.js
dropdown entry. Verified end-to-end through /text/generate/stream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

app.py CHANGED
@@ -272,6 +272,9 @@ MINICPM5_SPACE = os.environ.get("TINY_MINICPM5_SPACE", "").strip()
272
  # Coding model (Skill Forge): Mellum2 is a ZeroGPU sidecar (same /generate contract as
273
  # Aya); Nemotron-30B is too big to self-host, so it runs via hosted NVIDIA NIM (below).
274
  MELLUM_SPACE = os.environ.get("TINY_MELLUM_SPACE", "").strip()
 
 
 
275
  _local_tts = None # VoiceDesign model
276
  _local_clone = None # Base model (voice clone) — lazy, only if a clone is requested
277
  _local_tts_lock = threading.Lock()
@@ -578,6 +581,26 @@ def _mellum_stream_with_fallback(system, user, max_tokens, temperature):
578
  yield from _nim_text_stream(system, user, max_tokens, temperature)
579
 
580
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
581
  @fastapi_app.post("/voxcpm-tts")
582
  async def voxcpm_tts(request: Request):
583
  body = await request.json()
@@ -896,6 +919,14 @@ async def text_generate_stream(request: Request):
896
  if stop.is_set():
897
  break
898
  loop.call_soon_threadsafe(q.put_nowait, ("delta", chunk))
 
 
 
 
 
 
 
 
899
  elif model == "nemotron-3-nano-30b-nim":
900
  if not NIM_KEY:
901
  raise llm.LlmUnavailable("NVIDIA_NIM_API_KEY not set")
 
272
  # Coding model (Skill Forge): Mellum2 is a ZeroGPU sidecar (same /generate contract as
273
  # Aya); Nemotron-30B is too big to self-host, so it runs via hosted NVIDIA NIM (below).
274
  MELLUM_SPACE = os.environ.get("TINY_MELLUM_SPACE", "").strip()
275
+ # BLS Mini-Code 1.0 (Cohere, 30B MoE): another ZeroGPU sidecar (same /generate contract).
276
+ # The sidecar suppresses the model's reasoning and streams clean code; see spaces/bls-code-zerogpu.
277
+ BLS_CODE_SPACE = os.environ.get("TINY_BLS_CODE_SPACE", "").strip()
278
  _local_tts = None # VoiceDesign model
279
  _local_clone = None # Base model (voice clone) — lazy, only if a clone is requested
280
  _local_tts_lock = threading.Lock()
 
581
  yield from _nim_text_stream(system, user, max_tokens, temperature)
582
 
583
 
584
+ def _bls_code_stream(system, user, max_tokens, temperature):
585
+ yield from _space_text_stream(BLS_CODE_SPACE, system, user, max_tokens, temperature)
586
+
587
+
588
+ def _bls_code_stream_with_fallback(system, user, max_tokens, temperature):
589
+ """BLS Mini-Code ZeroGPU sidecar, falling back to Nemotron (NVIDIA NIM) if the sidecar is
590
+ unavailable BEFORE any token streams (same constraint as Mellum2: can't switch mid-stream)."""
591
+ emitted = False
592
+ try:
593
+ if not BLS_CODE_SPACE:
594
+ raise llm.LlmUnavailable("TINY_BLS_CODE_SPACE not set")
595
+ for chunk in _bls_code_stream(system, user, max_tokens, temperature):
596
+ emitted = True
597
+ yield chunk
598
+ except Exception: # noqa: BLE001
599
+ if emitted or not NIM_KEY:
600
+ raise
601
+ yield from _nim_text_stream(system, user, max_tokens, temperature)
602
+
603
+
604
  @fastapi_app.post("/voxcpm-tts")
605
  async def voxcpm_tts(request: Request):
606
  body = await request.json()
 
919
  if stop.is_set():
920
  break
921
  loop.call_soon_threadsafe(q.put_nowait, ("delta", chunk))
922
+ elif model == "bls-mini-code-zerogpu":
923
+ # BLS Mini-Code sidecar, with Nemotron NIM as fallback if it's unavailable.
924
+ if not BLS_CODE_SPACE and not NIM_KEY:
925
+ raise llm.LlmUnavailable("TINY_BLS_CODE_SPACE not set")
926
+ for chunk in _bls_code_stream_with_fallback(system, user, max_tokens, temperature):
927
+ if stop.is_set():
928
+ break
929
+ loop.call_soon_threadsafe(q.put_nowait, ("delta", chunk))
930
  elif model == "nemotron-3-nano-30b-nim":
931
  if not NIM_KEY:
932
  raise llm.LlmUnavailable("NVIDIA_NIM_API_KEY not set")
spaces/bls-code-zerogpu/.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ __pycache__/
2
+ *.pyc
spaces/bls-code-zerogpu/README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Tiny Army BLS Mini-Code ZeroGPU
3
+ emoji: 🪖
4
+ colorFrom: indigo
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 6.15.2
8
+ app_file: app.py
9
+ pinned: false
10
+ suggested_hardware: zero-a10g
11
+ ---
12
+
13
+ # Tiny Army — BLS Mini-Code 1.0 (ZeroGPU coding sidecar)
14
+
15
+ A ZeroGPU sidecar that serves [`CohereLabs/BLS-Mini-Code-1.0`](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0)
16
+ (30B MoE coding model) to the Tiny Army app via the same Gradio API the Mellum2 / Tiny Aya
17
+ sidecars expose.
18
+
19
+ ## API contract (consumed by the main app's `gradio_client`)
20
+
21
+ - `POST /generate_stream` — args `(system, user, max_tokens:int, temperature:float)`, streams
22
+ **cumulative** decoded text (the app diffs successive frames into deltas).
23
+ - `POST /generate` — same args, returns the final text in one shot.
24
+
25
+ ## Config (Space → Settings → Variables)
26
+
27
+ | Var | Default | Notes |
28
+ |-----|---------|-------|
29
+ | `TINY_BLS_MODEL` | `CohereLabs/BLS-Mini-Code-1.0` | source repo |
30
+ | `TINY_BLS_QUANT` | `4bit` | `4bit` (~18GB) / `8bit` (~32GB) / `bf16` (~60GB, tight) — no FP8 weight exists upstream, so we quantize at load |
31
+ | `TINY_BLS_GPU_DURATION` | `120` | ZeroGPU seconds per call |
32
+
33
+ > **Hardware:** set the Space to a ZeroGPU tier with enough VRAM. 30B at 4-bit fits an A10G/H200
34
+ > ZeroGPU slice; `bf16`/`8bit` need the larger H200 slice. Adjust the `hardware:` field above to
35
+ > the ZeroGPU flavor you provision.
36
+
37
+ ## Wiring into the main app (later step)
38
+
39
+ Once this Space is live and the two endpoints respond, set `TINY_BLS_CODE_SPACE=<owner>/<space>`
40
+ in the main app and add the routing branch + `web/codingModel.js` entry (mirrors Mellum2).
spaces/bls-code-zerogpu/app.py ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Tiny Army — BLS Mini-Code 1.0 ZeroGPU coding sidecar.
2
+ #
3
+ # Exposes the SAME Gradio contract as the Mellum2 / Tiny Aya sidecars so the main app's
4
+ # gradio_client can talk to it unchanged (see app.py:_space_text_stream / _space_text_generate):
5
+ # /generate_stream(system, user, max_tokens:int, temperature:float) -> str # CUMULATIVE text, streamed
6
+ # /generate(system, user, max_tokens:int, temperature:float) -> str # final text, one shot
7
+ #
8
+ # Model: CohereLabs/BLS-Mini-Code-1.0 — 30B MoE (cohere2_moe), BF16 only upstream (no FP8
9
+ # weight published as of 2026-06), so we quantize AT LOAD via bitsandbytes to fit the ZeroGPU
10
+ # H200 slice. TINY_BLS_QUANT selects 4bit (default, ~18GB) / 8bit (~32GB) / bf16 (~60GB, tight).
11
+ #
12
+ # REASONING: BLS-Mini-Code is a Cohere reasoning model. Its chat template, with
13
+ # add_generation_prompt=True, force-opens <|START_RESPONSE|> (non-reasoning mode) — which makes
14
+ # the model dump its reasoning as prose into the answer. Instead we open a <|START_THINKING|>
15
+ # block so it reasons in a dedicated section we DISCARD, and we stream only the clean code from
16
+ # <|START_RESPONSE|>…<|END_RESPONSE|>. TINY_BLS_THINK_BUDGET extra tokens are reserved for the
17
+ # (discarded) thinking so the requested max_tokens still applies to the visible code.
18
+ import os
19
+ import threading
20
+
21
+ import gradio as gr
22
+ import spaces
23
+ import torch
24
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
25
+
26
+ MODEL_ID = os.environ.get("TINY_BLS_MODEL", "CohereLabs/BLS-Mini-Code-1.0")
27
+ QUANT = os.environ.get("TINY_BLS_QUANT", "4bit").strip().lower()
28
+ GPU_DURATION = int(os.environ.get("TINY_BLS_GPU_DURATION", "120"))
29
+ THINK_BUDGET = int(os.environ.get("TINY_BLS_THINK_BUDGET", "1024"))
30
+
31
+ START_THINK, END_THINK = "<|START_THINKING|>", "<|END_THINKING|>"
32
+ START_RESP, END_RESP = "<|START_RESPONSE|>", "<|END_RESPONSE|>"
33
+ _STRIP = (START_THINK, END_THINK, START_RESP, END_RESP,
34
+ "<|START_TEXT|>", "<|END_TEXT|>", "<|END_OF_TURN_TOKEN|>")
35
+
36
+ print(f"[bls-code] loading {MODEL_ID} quant={QUANT}", flush=True)
37
+
38
+ _tok = AutoTokenizer.from_pretrained(MODEL_ID)
39
+
40
+
41
+ def _load_kwargs():
42
+ kw = {"torch_dtype": torch.bfloat16, "device_map": "cuda"}
43
+ if QUANT == "bf16":
44
+ return kw
45
+ from transformers import BitsAndBytesConfig
46
+
47
+ if QUANT == "8bit":
48
+ kw["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
49
+ else: # 4bit (default)
50
+ kw["quantization_config"] = BitsAndBytesConfig(
51
+ load_in_4bit=True,
52
+ bnb_4bit_quant_type="nf4",
53
+ bnb_4bit_compute_dtype=torch.bfloat16,
54
+ bnb_4bit_use_double_quant=True,
55
+ )
56
+ return kw
57
+
58
+
59
+ _model = AutoModelForCausalLM.from_pretrained(MODEL_ID, **_load_kwargs())
60
+ _model.eval()
61
+ print("[bls-code] model ready", flush=True)
62
+
63
+
64
+ def _build_inputs(system, user):
65
+ messages = []
66
+ if system and system.strip():
67
+ messages.append({"role": "system", "content": system.strip()})
68
+ messages.append({"role": "user", "content": (user or "").strip()})
69
+ text = _tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
70
+ # The template force-opens <|START_RESPONSE|> (non-reasoning). Swap it for a thinking block
71
+ # so the model reasons where we can discard it, leaving clean code in the response section.
72
+ t = text.rstrip()
73
+ if t.endswith(START_RESP):
74
+ text = t[: -len(START_RESP)] + START_THINK
75
+ enc = _tok(text, return_tensors="pt", add_special_tokens=False)
76
+ return {k: v.to(_model.device) for k, v in enc.items()}
77
+
78
+
79
+ def _extract_response(raw):
80
+ """Pull just the answer out of a (possibly partial) raw decode: content after
81
+ <|START_RESPONSE|> (or after <|END_THINKING|> as a fallback), up to <|END_RESPONSE|>."""
82
+ i = raw.find(START_RESP)
83
+ if i != -1:
84
+ body = raw[i + len(START_RESP):]
85
+ else:
86
+ j = raw.find(END_THINK)
87
+ body = raw[j + len(END_THINK):] if j != -1 else ""
88
+ k = body.find(END_RESP)
89
+ if k != -1:
90
+ body = body[:k]
91
+ for mark in _STRIP:
92
+ body = body.replace(mark, "")
93
+ return body.strip()
94
+
95
+
96
+ def _gen_kwargs(inputs, max_tokens, temperature):
97
+ temp = float(temperature if temperature is not None else 0.6)
98
+ kw = dict(
99
+ **inputs,
100
+ # Reserve THINK_BUDGET on top so the discarded reasoning doesn't eat the code budget.
101
+ max_new_tokens=int(max_tokens or 512) + THINK_BUDGET,
102
+ do_sample=temp > 0,
103
+ pad_token_id=_tok.pad_token_id or _tok.eos_token_id,
104
+ )
105
+ if temp > 0:
106
+ kw.update(temperature=temp, top_p=0.95)
107
+ return kw
108
+
109
+
110
+ @spaces.GPU(duration=GPU_DURATION)
111
+ def generate_stream(system, user, max_tokens, temperature):
112
+ """Stream CUMULATIVE response text (thinking suppressed). The main app diffs successive
113
+ yields into deltas. On failure, yield the traceback so it isn't a silent empty stream."""
114
+ try:
115
+ inputs = _build_inputs(system, user)
116
+ # skip_special_tokens=False so we can SEE the thinking/response markers and split on them.
117
+ streamer = TextIteratorStreamer(_tok, skip_prompt=True, skip_special_tokens=False)
118
+ kw = _gen_kwargs(inputs, max_tokens, temperature)
119
+ kw["streamer"] = streamer
120
+ err = {}
121
+
122
+ def _run():
123
+ try:
124
+ _model.generate(**kw)
125
+ except Exception: # noqa: BLE001
126
+ import traceback
127
+ err["tb"] = traceback.format_exc()
128
+ streamer.end()
129
+
130
+ thread = threading.Thread(target=_run)
131
+ thread.start()
132
+ acc, started = "", False
133
+ for piece in streamer:
134
+ acc += piece
135
+ if not started:
136
+ if START_RESP not in acc:
137
+ continue # still in the thinking block — emit nothing yet
138
+ started = True
139
+ yield _extract_response(acc)
140
+ thread.join()
141
+ if err:
142
+ yield (_extract_response(acc) + "\n[GENERATE ERROR]\n" + err["tb"])
143
+ elif not started:
144
+ # Model never opened a response block — fall back to whatever's after thinking.
145
+ yield _extract_response(acc) or "[EMPTY OUTPUT — no response block produced]"
146
+ except Exception: # noqa: BLE001
147
+ import traceback
148
+ yield "[SETUP ERROR]\n" + traceback.format_exc()
149
+
150
+
151
+ @spaces.GPU(duration=GPU_DURATION)
152
+ def generate(system, user, max_tokens, temperature):
153
+ try:
154
+ inputs = _build_inputs(system, user)
155
+ out = _model.generate(**_gen_kwargs(inputs, max_tokens, temperature))
156
+ raw = _tok.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)
157
+ return _extract_response(raw) or "[EMPTY OUTPUT]"
158
+ except Exception: # noqa: BLE001
159
+ import traceback
160
+ return "[ERROR]\n" + traceback.format_exc()
161
+
162
+
163
+ # Minimal UI; the named API endpoints are what the main app consumes.
164
+ with gr.Blocks(title="BLS Mini-Code 1.0 — Tiny Army sidecar") as demo:
165
+ gr.Markdown("## BLS Mini-Code 1.0 — ZeroGPU coding sidecar")
166
+ sys_in = gr.Textbox(label="system", lines=2)
167
+ usr_in = gr.Textbox(label="user", lines=6)
168
+ mt_in = gr.Slider(16, 2048, value=512, step=16, label="max_tokens")
169
+ temp_in = gr.Slider(0.0, 1.5, value=0.6, step=0.05, label="temperature")
170
+ out = gr.Textbox(label="output", lines=12)
171
+ with gr.Row():
172
+ stream_btn = gr.Button("Stream", variant="primary")
173
+ once_btn = gr.Button("Generate")
174
+ stream_btn.click(
175
+ generate_stream, [sys_in, usr_in, mt_in, temp_in], out, api_name="generate_stream"
176
+ )
177
+ once_btn.click(generate, [sys_in, usr_in, mt_in, temp_in], out, api_name="generate")
178
+
179
+ if __name__ == "__main__":
180
+ demo.queue().launch()
spaces/bls-code-zerogpu/requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ # cohere2_moe is a very new architecture — install transformers from git until it lands
2
+ # in a tagged release. If a release >= the one that adds cohere2_moe exists, pin it instead.
3
+ git+https://github.com/huggingface/transformers.git
4
+ accelerate
5
+ bitsandbytes
6
+ sentencepiece
7
+ # Match the main app's Gradio so the gradio_client (2.5.0) contract stays identical.
8
+ gradio==6.15.2
9
+ spaces
10
+ torch
web/codingModel.js CHANGED
@@ -1,16 +1,17 @@
1
  // Coding-model store for the Skill Forge. SEPARATE from runtime.js (the persona/diary
2
  // "Text Generation Model") so picking a coding model never clobbers the writer model.
3
- // Both candidates are large (Mellum2 ~8GB, Nemotron-30B ~24GB) with no browser-viable
4
- // build, so this is ZeroGPU-only: every choice routes through the same server endpoint
5
- // (/text/generate/stream) the `server` engine uses, by model id. Mellum2 is a ZeroGPU
6
- // sidecar (TINY_MELLUM_SPACE); Nemotron-30B routes through hosted NVIDIA NIM
7
- // (NVIDIA_NIM_API_KEY) since it's too big to self-host.
8
  import { statsTracker } from '/web/genStats.js'
9
  import { streamSse } from '/web/sseText.js'
10
 
11
  const MODELS = [
12
  { id: 'nemotron-3-nano-30b-nim', label: 'Nemotron 3 Nano 30B-A3B', params: '30B (3B active)', backend: 'NVIDIA NIM', note: 'reasoning + agentic code (NVIDIA)' },
13
  { id: 'mellum2-zerogpu', label: 'Mellum2 12B-A2.5B', params: '12B (2.5B active)', backend: 'ZeroGPU sidecar', note: 'code model (JetBrains)' },
 
14
  ]
15
  const DEFAULT = 'nemotron-3-nano-30b-nim'
16
  const KEY = 'tinyarmy.codingModel'
 
1
  // Coding-model store for the Skill Forge. SEPARATE from runtime.js (the persona/diary
2
  // "Text Generation Model") so picking a coding model never clobbers the writer model.
3
+ // All candidates are large (Mellum2 ~8GB, BLS Mini-Code 30B MoE, Nemotron-30B ~24GB) with no
4
+ // browser-viable build, so this is ZeroGPU-only: every choice routes through the same server
5
+ // endpoint (/text/generate/stream) the `server` engine uses, by model id. Mellum2
6
+ // (TINY_MELLUM_SPACE) and BLS Mini-Code (TINY_BLS_CODE_SPACE) are ZeroGPU sidecars; Nemotron-30B
7
+ // routes through hosted NVIDIA NIM (NVIDIA_NIM_API_KEY) since it's too big to self-host.
8
  import { statsTracker } from '/web/genStats.js'
9
  import { streamSse } from '/web/sseText.js'
10
 
11
  const MODELS = [
12
  { id: 'nemotron-3-nano-30b-nim', label: 'Nemotron 3 Nano 30B-A3B', params: '30B (3B active)', backend: 'NVIDIA NIM', note: 'reasoning + agentic code (NVIDIA)' },
13
  { id: 'mellum2-zerogpu', label: 'Mellum2 12B-A2.5B', params: '12B (2.5B active)', backend: 'ZeroGPU sidecar', note: 'code model (JetBrains)' },
14
+ { id: 'bls-mini-code-zerogpu', label: 'BLS Mini-Code 1.0', params: '30B MoE', backend: 'ZeroGPU sidecar', note: 'code model (Cohere); reasoning suppressed' },
15
  ]
16
  const DEFAULT = 'nemotron-3-nano-30b-nim'
17
  const KEY = 'tinyarmy.codingModel'