Spaces:
Running on Zero
Running on Zero
File size: 1,614 Bytes
9ff5b4a 1419b82 9ff5b4a 1419b82 9ff5b4a 1419b82 9ff5b4a 1419b82 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ---
title: Tiny Army BLS Mini-Code ZeroGPU
emoji: πͺ
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 6.15.2
app_file: app.py
pinned: false
suggested_hardware: zero-a10g
---
# Tiny Army β BLS Mini-Code 1.0 (ZeroGPU coding sidecar)
A ZeroGPU sidecar that serves [`CohereLabs/BLS-Mini-Code-1.0`](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0)
(30B MoE coding model) to the Tiny Army app via the same Gradio API the Mellum2 / Tiny Aya
sidecars expose.
## API contract (consumed by the main app's `gradio_client`)
- `POST /generate_stream` β args `(system, user, max_tokens:int, temperature:float)`, streams
**cumulative** decoded text (the app diffs successive frames into deltas).
- `POST /generate` β same args, returns the final text in one shot.
## Config (Space β Settings β Variables)
| Var | Default | Notes |
|-----|---------|-------|
| `TINY_BLS_MODEL` | `CohereLabs/BLS-Mini-Code-1.0` | source repo |
| `TINY_BLS_QUANT` | `4bit` | `4bit` (~18GB) / `8bit` (~32GB) / `bf16` (~60GB, tight) β no FP8 weight exists upstream, so we quantize at load |
| `TINY_BLS_GPU_DURATION` | `120` | ZeroGPU seconds per call |
> **Hardware:** set the Space to a ZeroGPU tier with enough VRAM. 30B at 4-bit fits an A10G/H200
> ZeroGPU slice; `bf16`/`8bit` need the larger H200 slice. Adjust the `hardware:` field above to
> the ZeroGPU flavor you provision.
## Wiring into the main app (later step)
Once this Space is live and the two endpoints respond, set `TINY_BLS_CODE_SPACE=<owner>/<space>`
in the main app and add the routing branch + `web/codingModel.js` entry (mirrors Mellum2).
|