Build Small Hackathon With Cohere Models
The hackathon asks you to keep the total model size at or below 32 billion parameters and to ship a Gradio app on Hugging Face Spaces. Cohere's small open models fit that constraint:
- Tiny Aya: a 3.35B multilingual text generation family covering 70+ languages.
- Cohere Transcribe: a 2B automatic speech recognition model covering 14 languages.
Together, they are a good fit for local multilingual assistants, voice interfaces, accessibility tools, offline translation helpers, and small apps for real people.
Quick Start
If you only try one path, start with Tiny Aya GGUF through llama.cpp:
llama-server -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M
That starts a local OpenAI-compatible server and web UI. You can then point a small Gradio app, script, or frontend at http://localhost:8080/v1.
For speech transcription, start with the native transformers path:
pip install "transformers>=5.4.0" torch huggingface_hub soundfile librosa sentencepiece protobuf
Then load CohereLabs/cohere-transcribe-03-2026 with AutoProcessor and CohereAsrForConditionalGeneration as shown below.
Pick A Tiny Aya Variant
Tiny Aya is a family of 3.35B multilingual language models. Use the region-specialized variants when you already know your app's audience, or use Global when you want the safest default.
| Model | Best For | Local GGUF Repo |
|---|---|---|
| tiny-aya-global | Best balance across languages and regions | tiny-aya-global-GGUF |
| tiny-aya-water | European and Asia-Pacific languages | tiny-aya-water-GGUF |
| tiny-aya-fire | South Asian languages | tiny-aya-fire-GGUF |
| tiny-aya-earth | West Asian and African languages | tiny-aya-earth-GGUF |
Tiny Aya Locally
llama.cpp
Install llama.cpp from your package manager or build it from source. Then run a local server:
llama-server -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M
Or run directly in the terminal:
llama-cli -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M \
-p "Write a friendly welcome message in Spanish, Arabic, and Swahili for a neighborhood garden app."
Swap global for water, fire, or earth if your project focuses on one region:
llama-server -hf CohereLabs/tiny-aya-fire-GGUF:Q4_K_M
Ollama
If you use Ollama, you can pull directly from Hugging Face:
ollama run hf.co/CohereLabs/tiny-aya-global-GGUF:Q4_K_M
Python With transformers
Use the non-GGUF repos when you want native PyTorch or Hugging Face transformers workflows:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "CohereLabs/tiny-aya-global"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto")
messages = [
{
"role": "user",
"content": "Explica en espanol que significa la palabra japonesa 'ikigai' y da un ejemplo practico.",
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=300,
do_sample=True,
temperature=0.1,
top_p=0.95,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Python With llama-cpp-python
This is useful if you want Python control while still using the smaller GGUF files:
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="CohereLabs/tiny-aya-global-GGUF",
filename="*Q4_K_M.gguf",
)
response = llm.create_chat_completion(
messages=[
{
"role": "user",
"content": "Give me three local-first app ideas for helping a multilingual family.",
}
]
)
print(response["choices"][0]["message"]["content"])
Which Quantization Should I Use?
| File Type | Approx Size | Good For |
|---|---|---|
Q4_0 |
2.03 GB | Lowest memory demos and phone or edge experiments |
Q4_K_M |
2.14 GB | Best first choice for laptops and llama.cpp demos |
Q8_0 |
3.57 GB | Better quality if you have more RAM |
BF16 / F16 |
6.71 GB | Highest fidelity local runs, more memory required |
Cohere Transcribe Locally
Cohere Transcribe is a 2B dedicated audio-in, text-out ASR model. It supports Arabic, Chinese, Dutch, English, French, German, Greek, Italian, Japanese, Korean, Polish, Portuguese, Spanish, and Vietnamese. It is Apache 2.0 licensed and is a strong fit for local voice interfaces.
Install the basics:
pip install "transformers>=5.4.0" torch huggingface_hub soundfile librosa sentencepiece protobuf
Transcribe a local audio file:
from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio
model_id = "CohereLabs/cohere-transcribe-03-2026"
audio_path = "voice_note.wav"
processor = AutoProcessor.from_pretrained(model_id)
model = CohereAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
audio = load_audio(audio_path, sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)
Use the language argument for non-English transcription:
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="ja")
Control punctuation:
inputs = processor(
audio,
sampling_rate=16000,
return_tensors="pt",
language="en",
punctuation=False,
)
For long-form audio, the processor automatically chunks audio longer than the feature extractor's maximum clip length. Keep the returned audio_chunk_index and pass it back to processor.decode(...):
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(
outputs,
skip_special_tokens=True,
audio_chunk_index=audio_chunk_index,
language="en",
)[0]
For production-style local serving, use vLLM:
vllm serve CohereLabs/cohere-transcribe-03-2026 --trust-remote-code
Then call the local OpenAI-compatible audio endpoint:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-H "Authorization: Bearer $VLLM_API_KEY" \
-F "file=@$(realpath voice_note.wav)" \
-F "model=CohereLabs/cohere-transcribe-03-2026"
On Apple Silicon, also check the mlx-audio integration linked from the model card if you want a more device-native local path.
Gradio Snippets
Install the small UI dependencies:
pip install gradio openai
Chat UI For A Local llama.cpp Server
Start Tiny Aya locally first:
llama-server -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M
Then connect a tiny Gradio chat UI to the local OpenAI-compatible endpoint:
import gradio as gr
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
model = "CohereLabs/tiny-aya-global-GGUF:Q4_K_M"
def chat(message, history):
messages = list(history)
messages.append({"role": "user", "content": message})
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.2,
)
return response.choices[0].message.content
demo = gr.ChatInterface(
fn=chat,
type="messages",
title="Tiny Aya Local Chat",
description="A multilingual chat UI backed by a local llama.cpp server.",
)
demo.launch()
Audio Input For Cohere Transcribe
This is a small sketch you can adapt into your app:
import gradio as gr
from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio
model_id = "CohereLabs/cohere-transcribe-03-2026"
processor = AutoProcessor.from_pretrained(model_id)
model = CohereAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
def transcribe(audio_path, language):
audio = load_audio(audio_path, sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language=language)
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
return processor.decode(outputs, skip_special_tokens=True)
demo = gr.Interface(
fn=transcribe,
inputs=[
gr.Audio(type="filepath", label="Voice note"),
gr.Dropdown(["en", "es", "fr", "de", "ja", "ko", "zh", "ar"], value="en", label="Language"),
],
outputs=gr.Textbox(label="Transcript"),
title="Cohere Transcribe Local Demo",
)
demo.launch()
You can combine both snippets into a voice app: record audio, transcribe locally, pass the transcript to Tiny Aya, then show a multilingual response.
Beyond gr.Interface
gr.Interface is the fastest way to a demo, but Gradio gives you a lot more for the same hackathon project:
- Streaming responses. Make your function a generator and
yieldpartial output. For chat,gr.ChatInterfacerenders tokens as they arrive — much nicer than waiting for the full reply. See Streaming Outputs. - Rich chatbots.
gr.ChatInterfacesupports multimodal input, message metadata, retry/undo, and example prompts out of the box. See Creating a Chatbot Fast. - Custom layouts.
gr.Blockslets you arrange rows, columns, tabs, and accordions, and wire one component's output to another's input — e.g. transcribe audio in one panel and feed it to Tiny Aya in the next. See Blocks and Event Listeners. - Themes and examples. Ship a polished look with Theming and give judges one-click Examples.
Custom Frontends With gr.Server
When you want a fully custom HTML/JS frontend — your own branding, layout, and chat UI — but still want Gradio's request queue, ZeroGPU lifecycle, and a Python/JS SDK, reach for gr.Server. The idea: bring any UI you like, and let gr.Server be the engine behind it. It splits cleanly into two route types — @server.api for anything that touches the model (queued, GPU-aware, discoverable by the clients), and plain @server.get/@server.post for static content like your homepage HTML.
Here is a Tiny Aya backend that serves its own page and exposes one streaming chat route. Streaming is the modern-chatbot must-have, and it's just a generator that yields the accumulated text:
import threading
import gradio as gr
from fastapi.responses import HTMLResponse
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
try:
import spaces
_HAS_SPACES = True
except ImportError:
_HAS_SPACES = False
model_id = "CohereLabs/tiny-aya-global" # gated model -> set HF_TOKEN as a Space secret
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
model.to("cuda") # module level, so ZeroGPU can fast-restore the placement
def _stream(messages: list):
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt", # return_dict=True: transformers 5.x
).to(model.device) # returns a BatchEncoding, unpack it
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
threading.Thread(
target=model.generate,
kwargs=dict(**inputs, max_new_tokens=512, do_sample=True,
temperature=0.3, streamer=streamer),
).start()
acc = ""
for token in streamer:
acc += token
yield acc # each yield streams to the browser
if _HAS_SPACES:
_stream = spaces.GPU(duration=120)(_stream) # @spaces.GPU on the inner function
server = gr.Server()
@server.get("/", response_class=HTMLResponse)
async def homepage() -> str:
return FRONTEND_HTML # your custom single-page app
@server.api(name="chat")
def chat_api(messages: list) -> str: # generator -> annotate with the YIELDED type
yield from _stream(messages)
if __name__ == "__main__":
server.launch(server_name="0.0.0.0", server_port=7860)
On the frontend, use the JS client's submit() (not predict()) to consume the stream — each data event carries the latest accumulated text, which is what produces the token-by-token typing effect:
<script type="module">
import { Client } from "https://cdn.jsdelivr.net/npm/@gradio/client/dist/index.min.js";
const client = await Client.connect(window.location.origin);
const history = [{ role: "user", content: "Hola, ¿qué tal?" }];
const job = client.submit("/chat", { messages: history });
for await (const msg of job) {
if (msg.type === "data") render(msg.data[0]); // accumulated text so far
}
</script>
A few things that bite people (all covered in the Server Mode guide):
- Return-type annotation is load-bearing. A generator must be annotated with its yielded type (
-> strhere). Without it Gradio registers zero outputs and silently drops every chunk — the UI just hangs. @spaces.GPUgoes on the inner model function, not the@server.apiroute.- Gated models need a token.
tiny-aya-globalis gated, so addHF_TOKENas a Space secret (transformerspicks it up automatically) and accept the model's terms on its page first. apply_chat_template(..., tokenize=True)returns aBatchEncodingon transformers 5.x, so passreturn_dict=Trueand unpack with**inputs—model.generate(inputs, ...)on the raw object raisesAttributeError: shape.
Full implementation — the complete custom frontend (streaming chat UI, conversation history, Cohere Labs + Build Small branding) lives in the companion Space. Read its app.py for the whole thing, or try it live: tiny-aya-build-small-sample. For file uploads, custom URLs, and ZeroGPU duration tuning, see the JS client guide.
Resources
- Hackathon: Build Small Hackathon
- Tiny Aya demo: CohereLabs/tiny-aya Space
- Tiny Aya model cards: global, water, fire, earth
- Tiny Aya GGUF repos: global-GGUF, water-GGUF, fire-GGUF, earth-GGUF
- Cohere Transcribe model: CohereLabs/cohere-transcribe-03-2026
- Cohere Transcribe browser demo: Cohere-Transcribe-WebGPU
- Cohere Transcribe release blog: Introducing Cohere-transcribe
- Tiny Aya + gr.Server sample app: tiny-aya-build-small-sample (app.py)
- Gradio docs: gradio.app/docs
- Gradio Server intro: Introducing gr.Server
- Gradio Server mode guide: gradio.app/guides/server-mode
- Gradio JS client: gradio.app/guides/getting-started-with-the-js-client
- Gradio streaming: gradio.app/guides/streaming-outputs
- llama.cpp: github.com/ggml-org/llama.cpp