File size: 1,614 Bytes
9ff5b4a
1419b82
 
 
 
9ff5b4a
1419b82
9ff5b4a
 
1419b82
9ff5b4a
 
1419b82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
title: Tiny Army BLS Mini-Code ZeroGPU
emoji: πŸͺ–
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 6.15.2
app_file: app.py
pinned: false
suggested_hardware: zero-a10g
---

# Tiny Army β€” BLS Mini-Code 1.0 (ZeroGPU coding sidecar)

A ZeroGPU sidecar that serves [`CohereLabs/BLS-Mini-Code-1.0`](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0)
(30B MoE coding model) to the Tiny Army app via the same Gradio API the Mellum2 / Tiny Aya
sidecars expose.

## API contract (consumed by the main app's `gradio_client`)

- `POST /generate_stream` β€” args `(system, user, max_tokens:int, temperature:float)`, streams
  **cumulative** decoded text (the app diffs successive frames into deltas).
- `POST /generate` β€” same args, returns the final text in one shot.

## Config (Space β†’ Settings β†’ Variables)

| Var | Default | Notes |
|-----|---------|-------|
| `TINY_BLS_MODEL` | `CohereLabs/BLS-Mini-Code-1.0` | source repo |
| `TINY_BLS_QUANT` | `4bit` | `4bit` (~18GB) / `8bit` (~32GB) / `bf16` (~60GB, tight) β€” no FP8 weight exists upstream, so we quantize at load |
| `TINY_BLS_GPU_DURATION` | `120` | ZeroGPU seconds per call |

> **Hardware:** set the Space to a ZeroGPU tier with enough VRAM. 30B at 4-bit fits an A10G/H200
> ZeroGPU slice; `bf16`/`8bit` need the larger H200 slice. Adjust the `hardware:` field above to
> the ZeroGPU flavor you provision.

## Wiring into the main app (later step)

Once this Space is live and the two endpoints respond, set `TINY_BLS_CODE_SPACE=<owner>/<space>`
in the main app and add the routing branch + `web/codingModel.js` entry (mirrors Mellum2).