polats's picture
Add BLS Mini-Code 1.0 ZeroGPU coding sidecar
1419b82 verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade
metadata
title: Tiny Army BLS Mini-Code ZeroGPU
emoji: πŸͺ–
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 6.15.2
app_file: app.py
pinned: false
suggested_hardware: zero-a10g

Tiny Army β€” BLS Mini-Code 1.0 (ZeroGPU coding sidecar)

A ZeroGPU sidecar that serves CohereLabs/BLS-Mini-Code-1.0 (30B MoE coding model) to the Tiny Army app via the same Gradio API the Mellum2 / Tiny Aya sidecars expose.

API contract (consumed by the main app's gradio_client)

  • POST /generate_stream β€” args (system, user, max_tokens:int, temperature:float), streams cumulative decoded text (the app diffs successive frames into deltas).
  • POST /generate β€” same args, returns the final text in one shot.

Config (Space β†’ Settings β†’ Variables)

Var Default Notes
TINY_BLS_MODEL CohereLabs/BLS-Mini-Code-1.0 source repo
TINY_BLS_QUANT 4bit 4bit (18GB) / 8bit (32GB) / bf16 (~60GB, tight) β€” no FP8 weight exists upstream, so we quantize at load
TINY_BLS_GPU_DURATION 120 ZeroGPU seconds per call

Hardware: set the Space to a ZeroGPU tier with enough VRAM. 30B at 4-bit fits an A10G/H200 ZeroGPU slice; bf16/8bit need the larger H200 slice. Adjust the hardware: field above to the ZeroGPU flavor you provision.

Wiring into the main app (later step)

Once this Space is live and the two endpoints respond, set TINY_BLS_CODE_SPACE=<owner>/<space> in the main app and add the routing branch + web/codingModel.js entry (mirrors Mellum2).