Spaces:

build-small-hackathon
/

tiny-army

Running

polats Claude Opus 4.8 (1M context) commited on Jun 6

Commit

37982be

1 Parent(s): b408d69

Add BLS Mini-Code 1.0 (Cohere 30B MoE) coding sidecar

ZeroGPU sidecar serving CohereLabs/BLS-Mini-Code-1.0 via the same /generate
+ /generate_stream contract as Mellum2. Source under spaces/bls-code-zerogpu.

- No FP8 weight upstream (BF16 only), so the Space quantizes to 4-bit at load.
- cohere2_moe is new, so transformers is pulled from git.
- It's a reasoning model: the chat template force-opens <|START_RESPONSE|>
(non-reasoning), which made it ramble reasoning as prose. The Space instead
opens a <|START_THINKING|> block (discarded) and streams only the clean
<|START_RESPONSE|> code, stripping <|START_TEXT|>/<|END_TEXT|> markers, with a
thinking-token budget so requested max_tokens applies to the visible code.

App side mirrors Mellum2: TINY_BLS_CODE_SPACE env var, _bls_code_stream +
_bls_code_stream_with_fallback (Nemotron NIM fallback when the sidecar is
asleep/over quota), bls-mini-code-zerogpu routing branch, and a codingModel.js
dropdown entry. Verified end-to-end through /text/generate/stream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (6) hide show

app.py +31 -0
spaces/bls-code-zerogpu/.gitignore +2 -0
spaces/bls-code-zerogpu/README.md +40 -0
spaces/bls-code-zerogpu/app.py +180 -0
spaces/bls-code-zerogpu/requirements.txt +10 -0
web/codingModel.js +6 -5

app.py CHANGED Viewed

@@ -272,6 +272,9 @@ MINICPM5_SPACE = os.environ.get("TINY_MINICPM5_SPACE", "").strip()
 # Coding model (Skill Forge): Mellum2 is a ZeroGPU sidecar (same /generate contract as
 # Aya); Nemotron-30B is too big to self-host, so it runs via hosted NVIDIA NIM (below).
 MELLUM_SPACE = os.environ.get("TINY_MELLUM_SPACE", "").strip()
 _local_tts = None       # VoiceDesign model
 _local_clone = None     # Base model (voice clone) — lazy, only if a clone is requested
 _local_tts_lock = threading.Lock()
@@ -578,6 +581,26 @@ def _mellum_stream_with_fallback(system, user, max_tokens, temperature):
         yield from _nim_text_stream(system, user, max_tokens, temperature)
 @fastapi_app.post("/voxcpm-tts")
 async def voxcpm_tts(request: Request):
     body = await request.json()
@@ -896,6 +919,14 @@ async def text_generate_stream(request: Request):
                         if stop.is_set():
                             break
                         loop.call_soon_threadsafe(q.put_nowait, ("delta", chunk))
                 elif model == "nemotron-3-nano-30b-nim":
                     if not NIM_KEY:
                         raise llm.LlmUnavailable("NVIDIA_NIM_API_KEY not set")

 # Coding model (Skill Forge): Mellum2 is a ZeroGPU sidecar (same /generate contract as
 # Aya); Nemotron-30B is too big to self-host, so it runs via hosted NVIDIA NIM (below).
 MELLUM_SPACE = os.environ.get("TINY_MELLUM_SPACE", "").strip()
+# BLS Mini-Code 1.0 (Cohere, 30B MoE): another ZeroGPU sidecar (same /generate contract).
+# The sidecar suppresses the model's reasoning and streams clean code; see spaces/bls-code-zerogpu.
+BLS_CODE_SPACE = os.environ.get("TINY_BLS_CODE_SPACE", "").strip()
 _local_tts = None       # VoiceDesign model
 _local_clone = None     # Base model (voice clone) — lazy, only if a clone is requested
 _local_tts_lock = threading.Lock()
         yield from _nim_text_stream(system, user, max_tokens, temperature)
+def _bls_code_stream(system, user, max_tokens, temperature):
+    yield from _space_text_stream(BLS_CODE_SPACE, system, user, max_tokens, temperature)
+def _bls_code_stream_with_fallback(system, user, max_tokens, temperature):
+    """BLS Mini-Code ZeroGPU sidecar, falling back to Nemotron (NVIDIA NIM) if the sidecar is
+    unavailable BEFORE any token streams (same constraint as Mellum2: can't switch mid-stream)."""
+    emitted = False
+    try:
+        if not BLS_CODE_SPACE:
+            raise llm.LlmUnavailable("TINY_BLS_CODE_SPACE not set")
+        for chunk in _bls_code_stream(system, user, max_tokens, temperature):
+            emitted = True
+            yield chunk
+    except Exception:  # noqa: BLE001
+        if emitted or not NIM_KEY:
+            raise
+        yield from _nim_text_stream(system, user, max_tokens, temperature)
 @fastapi_app.post("/voxcpm-tts")
 async def voxcpm_tts(request: Request):
     body = await request.json()
                         if stop.is_set():
                             break
                         loop.call_soon_threadsafe(q.put_nowait, ("delta", chunk))
+                elif model == "bls-mini-code-zerogpu":
+                    # BLS Mini-Code sidecar, with Nemotron NIM as fallback if it's unavailable.
+                    if not BLS_CODE_SPACE and not NIM_KEY:
+                        raise llm.LlmUnavailable("TINY_BLS_CODE_SPACE not set")
+                    for chunk in _bls_code_stream_with_fallback(system, user, max_tokens, temperature):
+                        if stop.is_set():
+                            break
+                        loop.call_soon_threadsafe(q.put_nowait, ("delta", chunk))
                 elif model == "nemotron-3-nano-30b-nim":
                     if not NIM_KEY:
                         raise llm.LlmUnavailable("NVIDIA_NIM_API_KEY not set")

spaces/bls-code-zerogpu/.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ __pycache__/
2	+ *.pyc

spaces/bls-code-zerogpu/README.md ADDED Viewed

	@@ -0,0 +1,40 @@

+---
+title: Tiny Army BLS Mini-Code ZeroGPU
+emoji: 🪖
+colorFrom: indigo
+colorTo: green
+sdk: gradio
+sdk_version: 6.15.2
+app_file: app.py
+pinned: false
+suggested_hardware: zero-a10g
+---
+# Tiny Army — BLS Mini-Code 1.0 (ZeroGPU coding sidecar)
+A ZeroGPU sidecar that serves [`CohereLabs/BLS-Mini-Code-1.0`](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0)
+(30B MoE coding model) to the Tiny Army app via the same Gradio API the Mellum2 / Tiny Aya
+sidecars expose.
+## API contract (consumed by the main app's `gradio_client`)
+- `POST /generate_stream` — args `(system, user, max_tokens:int, temperature:float)`, streams
+  **cumulative** decoded text (the app diffs successive frames into deltas).
+- `POST /generate` — same args, returns the final text in one shot.
+## Config (Space → Settings → Variables)
+| Var | Default | Notes |
+|-----|---------|-------|
+| `TINY_BLS_MODEL` | `CohereLabs/BLS-Mini-Code-1.0` | source repo |
+| `TINY_BLS_QUANT` | `4bit` | `4bit` (~18GB) / `8bit` (~32GB) / `bf16` (~60GB, tight) — no FP8 weight exists upstream, so we quantize at load |
+| `TINY_BLS_GPU_DURATION` | `120` | ZeroGPU seconds per call |
+> **Hardware:** set the Space to a ZeroGPU tier with enough VRAM. 30B at 4-bit fits an A10G/H200
+> ZeroGPU slice; `bf16`/`8bit` need the larger H200 slice. Adjust the `hardware:` field above to
+> the ZeroGPU flavor you provision.
+## Wiring into the main app (later step)
+Once this Space is live and the two endpoints respond, set `TINY_BLS_CODE_SPACE=<owner>/<space>`
+in the main app and add the routing branch + `web/codingModel.js` entry (mirrors Mellum2).

spaces/bls-code-zerogpu/app.py ADDED Viewed

	@@ -0,0 +1,180 @@

+# Tiny Army — BLS Mini-Code 1.0 ZeroGPU coding sidecar.
+#
+# Exposes the SAME Gradio contract as the Mellum2 / Tiny Aya sidecars so the main app's
+# gradio_client can talk to it unchanged (see app.py:_space_text_stream / _space_text_generate):
+#   /generate_stream(system, user, max_tokens:int, temperature:float) -> str   # CUMULATIVE text, streamed
+#   /generate(system, user, max_tokens:int, temperature:float)        -> str   # final text, one shot
+#
+# Model: CohereLabs/BLS-Mini-Code-1.0 — 30B MoE (cohere2_moe), BF16 only upstream (no FP8
+# weight published as of 2026-06), so we quantize AT LOAD via bitsandbytes to fit the ZeroGPU
+# H200 slice. TINY_BLS_QUANT selects 4bit (default, ~18GB) / 8bit (~32GB) / bf16 (~60GB, tight).
+#
+# REASONING: BLS-Mini-Code is a Cohere reasoning model. Its chat template, with
+# add_generation_prompt=True, force-opens <|START_RESPONSE|> (non-reasoning mode) — which makes
+# the model dump its reasoning as prose into the answer. Instead we open a <|START_THINKING|>
+# block so it reasons in a dedicated section we DISCARD, and we stream only the clean code from
+# <|START_RESPONSE|>…<|END_RESPONSE|>. TINY_BLS_THINK_BUDGET extra tokens are reserved for the
+# (discarded) thinking so the requested max_tokens still applies to the visible code.
+import os
+import threading
+import gradio as gr
+import spaces
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
+MODEL_ID = os.environ.get("TINY_BLS_MODEL", "CohereLabs/BLS-Mini-Code-1.0")
+QUANT = os.environ.get("TINY_BLS_QUANT", "4bit").strip().lower()
+GPU_DURATION = int(os.environ.get("TINY_BLS_GPU_DURATION", "120"))
+THINK_BUDGET = int(os.environ.get("TINY_BLS_THINK_BUDGET", "1024"))
+START_THINK, END_THINK = "<|START_THINKING|>", "<|END_THINKING|>"
+START_RESP, END_RESP = "<|START_RESPONSE|>", "<|END_RESPONSE|>"
+_STRIP = (START_THINK, END_THINK, START_RESP, END_RESP,
+          "<|START_TEXT|>", "<|END_TEXT|>", "<|END_OF_TURN_TOKEN|>")
+print(f"[bls-code] loading {MODEL_ID} quant={QUANT}", flush=True)
+_tok = AutoTokenizer.from_pretrained(MODEL_ID)
+def _load_kwargs():
+    kw = {"torch_dtype": torch.bfloat16, "device_map": "cuda"}
+    if QUANT == "bf16":
+        return kw
+    from transformers import BitsAndBytesConfig
+    if QUANT == "8bit":
+        kw["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
+    else:  # 4bit (default)
+        kw["quantization_config"] = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16,
+            bnb_4bit_use_double_quant=True,
+        )
+    return kw
+_model = AutoModelForCausalLM.from_pretrained(MODEL_ID, **_load_kwargs())
+_model.eval()
+print("[bls-code] model ready", flush=True)
+def _build_inputs(system, user):
+    messages = []
+    if system and system.strip():
+        messages.append({"role": "system", "content": system.strip()})
+    messages.append({"role": "user", "content": (user or "").strip()})
+    text = _tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+    # The template force-opens <|START_RESPONSE|> (non-reasoning). Swap it for a thinking block
+    # so the model reasons where we can discard it, leaving clean code in the response section.
+    t = text.rstrip()
+    if t.endswith(START_RESP):
+        text = t[: -len(START_RESP)] + START_THINK
+    enc = _tok(text, return_tensors="pt", add_special_tokens=False)
+    return {k: v.to(_model.device) for k, v in enc.items()}
+def _extract_response(raw):
+    """Pull just the answer out of a (possibly partial) raw decode: content after
+    <|START_RESPONSE|> (or after <|END_THINKING|> as a fallback), up to <|END_RESPONSE|>."""
+    i = raw.find(START_RESP)
+    if i != -1:
+        body = raw[i + len(START_RESP):]
+    else:
+        j = raw.find(END_THINK)
+        body = raw[j + len(END_THINK):] if j != -1 else ""
+    k = body.find(END_RESP)
+    if k != -1:
+        body = body[:k]
+    for mark in _STRIP:
+        body = body.replace(mark, "")
+    return body.strip()
+def _gen_kwargs(inputs, max_tokens, temperature):
+    temp = float(temperature if temperature is not None else 0.6)
+    kw = dict(
+        **inputs,
+        # Reserve THINK_BUDGET on top so the discarded reasoning doesn't eat the code budget.
+        max_new_tokens=int(max_tokens or 512) + THINK_BUDGET,
+        do_sample=temp > 0,
+        pad_token_id=_tok.pad_token_id or _tok.eos_token_id,
+    )
+    if temp > 0:
+        kw.update(temperature=temp, top_p=0.95)
+    return kw
+@spaces.GPU(duration=GPU_DURATION)
+def generate_stream(system, user, max_tokens, temperature):
+    """Stream CUMULATIVE response text (thinking suppressed). The main app diffs successive
+    yields into deltas. On failure, yield the traceback so it isn't a silent empty stream."""
+    try:
+        inputs = _build_inputs(system, user)
+        # skip_special_tokens=False so we can SEE the thinking/response markers and split on them.
+        streamer = TextIteratorStreamer(_tok, skip_prompt=True, skip_special_tokens=False)
+        kw = _gen_kwargs(inputs, max_tokens, temperature)
+        kw["streamer"] = streamer
+        err = {}
+        def _run():
+            try:
+                _model.generate(**kw)
+            except Exception:  # noqa: BLE001
+                import traceback
+                err["tb"] = traceback.format_exc()
+                streamer.end()
+        thread = threading.Thread(target=_run)
+        thread.start()
+        acc, started = "", False
+        for piece in streamer:
+            acc += piece
+            if not started:
+                if START_RESP not in acc:
+                    continue  # still in the thinking block — emit nothing yet
+                started = True
+            yield _extract_response(acc)
+        thread.join()
+        if err:
+            yield (_extract_response(acc) + "\n[GENERATE ERROR]\n" + err["tb"])
+        elif not started:
+            # Model never opened a response block — fall back to whatever's after thinking.
+            yield _extract_response(acc) or "[EMPTY OUTPUT — no response block produced]"
+    except Exception:  # noqa: BLE001
+        import traceback
+        yield "[SETUP ERROR]\n" + traceback.format_exc()
+@spaces.GPU(duration=GPU_DURATION)
+def generate(system, user, max_tokens, temperature):
+    try:
+        inputs = _build_inputs(system, user)
+        out = _model.generate(**_gen_kwargs(inputs, max_tokens, temperature))
+        raw = _tok.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)
+        return _extract_response(raw) or "[EMPTY OUTPUT]"
+    except Exception:  # noqa: BLE001
+        import traceback
+        return "[ERROR]\n" + traceback.format_exc()
+# Minimal UI; the named API endpoints are what the main app consumes.
+with gr.Blocks(title="BLS Mini-Code 1.0 — Tiny Army sidecar") as demo:
+    gr.Markdown("## BLS Mini-Code 1.0 — ZeroGPU coding sidecar")
+    sys_in = gr.Textbox(label="system", lines=2)
+    usr_in = gr.Textbox(label="user", lines=6)
+    mt_in = gr.Slider(16, 2048, value=512, step=16, label="max_tokens")
+    temp_in = gr.Slider(0.0, 1.5, value=0.6, step=0.05, label="temperature")
+    out = gr.Textbox(label="output", lines=12)
+    with gr.Row():
+        stream_btn = gr.Button("Stream", variant="primary")
+        once_btn = gr.Button("Generate")
+    stream_btn.click(
+        generate_stream, [sys_in, usr_in, mt_in, temp_in], out, api_name="generate_stream"
+    )
+    once_btn.click(generate, [sys_in, usr_in, mt_in, temp_in], out, api_name="generate")
+if __name__ == "__main__":
+    demo.queue().launch()

spaces/bls-code-zerogpu/requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+# cohere2_moe is a very new architecture — install transformers from git until it lands
+# in a tagged release. If a release >= the one that adds cohere2_moe exists, pin it instead.
+git+https://github.com/huggingface/transformers.git
+accelerate
+bitsandbytes
+sentencepiece
+# Match the main app's Gradio so the gradio_client (2.5.0) contract stays identical.
+gradio==6.15.2
+spaces
+torch

web/codingModel.js CHANGED Viewed

@@ -1,16 +1,17 @@
 // Coding-model store for the Skill Forge. SEPARATE from runtime.js (the persona/diary
 // "Text Generation Model") so picking a coding model never clobbers the writer model.
-// Both candidates are large (Mellum2 ~8GB, Nemotron-30B ~24GB) with no browser-viable
-// build, so this is ZeroGPU-only: every choice routes through the same server endpoint
-// (/text/generate/stream) the `server` engine uses, by model id. Mellum2 is a ZeroGPU
-// sidecar (TINY_MELLUM_SPACE); Nemotron-30B routes through hosted NVIDIA NIM
-// (NVIDIA_NIM_API_KEY) since it's too big to self-host.
 import { statsTracker } from '/web/genStats.js'
 import { streamSse } from '/web/sseText.js'
 const MODELS = [
   { id: 'nemotron-3-nano-30b-nim', label: 'Nemotron 3 Nano 30B-A3B', params: '30B (3B active)', backend: 'NVIDIA NIM', note: 'reasoning + agentic code (NVIDIA)' },
   { id: 'mellum2-zerogpu', label: 'Mellum2 12B-A2.5B', params: '12B (2.5B active)', backend: 'ZeroGPU sidecar', note: 'code model (JetBrains)' },
 ]
 const DEFAULT = 'nemotron-3-nano-30b-nim'
 const KEY = 'tinyarmy.codingModel'

 // Coding-model store for the Skill Forge. SEPARATE from runtime.js (the persona/diary
 // "Text Generation Model") so picking a coding model never clobbers the writer model.
+// All candidates are large (Mellum2 ~8GB, BLS Mini-Code 30B MoE, Nemotron-30B ~24GB) with no
+// browser-viable build, so this is ZeroGPU-only: every choice routes through the same server
+// endpoint (/text/generate/stream) the `server` engine uses, by model id. Mellum2
+// (TINY_MELLUM_SPACE) and BLS Mini-Code (TINY_BLS_CODE_SPACE) are ZeroGPU sidecars; Nemotron-30B
+// routes through hosted NVIDIA NIM (NVIDIA_NIM_API_KEY) since it's too big to self-host.
 import { statsTracker } from '/web/genStats.js'
 import { streamSse } from '/web/sseText.js'
 const MODELS = [
   { id: 'nemotron-3-nano-30b-nim', label: 'Nemotron 3 Nano 30B-A3B', params: '30B (3B active)', backend: 'NVIDIA NIM', note: 'reasoning + agentic code (NVIDIA)' },
   { id: 'mellum2-zerogpu', label: 'Mellum2 12B-A2.5B', params: '12B (2.5B active)', backend: 'ZeroGPU sidecar', note: 'code model (JetBrains)' },
+  { id: 'bls-mini-code-zerogpu', label: 'BLS Mini-Code 1.0', params: '30B MoE', backend: 'ZeroGPU sidecar', note: 'code model (Cohere); reasoning suppressed' },
 ]
 const DEFAULT = 'nemotron-3-nano-30b-nim'
 const KEY = 'tinyarmy.codingModel'