Spaces:

polats
/

tiny-army-bls-code-zerogpu

Running on Zero

App Files Files Community

tiny-army-bls-code-zerogpu / README.md

polats

Add BLS Mini-Code 1.0 ZeroGPU coding sidecar

1419b82 verified 3 days ago

preview code

raw

history blame contribute delete

1.61 kB

	---
	title: Tiny Army BLS Mini-Code ZeroGPU
	emoji: 🪖
	colorFrom: indigo
	colorTo: green
	sdk: gradio
	sdk_version: 6.15.2
	app_file: app.py
	pinned: false
	suggested_hardware: zero-a10g
	---

	# Tiny Army — BLS Mini-Code 1.0 (ZeroGPU coding sidecar)

	A ZeroGPU sidecar that serves [`CohereLabs/BLS-Mini-Code-1.0`](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0)
	(30B MoE coding model) to the Tiny Army app via the same Gradio API the Mellum2 / Tiny Aya
	sidecars expose.

	## API contract (consumed by the main app's `gradio_client`)

	- `POST /generate_stream` — args `(system, user, max_tokens:int, temperature:float)`, streams
	cumulative decoded text (the app diffs successive frames into deltas).
	- `POST /generate` — same args, returns the final text in one shot.

	## Config (Space → Settings → Variables)

	\| Var \| Default \| Notes \|
	\|-----\|---------\|-------\|
	\| `TINY_BLS_MODEL` \| `CohereLabs/BLS-Mini-Code-1.0` \| source repo \|
	\| `TINY_BLS_QUANT` \| `4bit` \| `4bit` (~18GB) / `8bit` (~32GB) / `bf16` (~60GB, tight) — no FP8 weight exists upstream, so we quantize at load \|
	\| `TINY_BLS_GPU_DURATION` \| `120` \| ZeroGPU seconds per call \|

	> Hardware: set the Space to a ZeroGPU tier with enough VRAM. 30B at 4-bit fits an A10G/H200
	> ZeroGPU slice; `bf16`/`8bit` need the larger H200 slice. Adjust the `hardware:` field above to
	> the ZeroGPU flavor you provision.

	## Wiring into the main app (later step)

	Once this Space is live and the two endpoints respond, set `TINY_BLS_CODE_SPACE=<owner>/<space>`
	in the main app and add the routing branch + `web/codingModel.js` entry (mirrors Mellum2).