Build Small Hackathon With Cohere Models

Community Article Published June 4, 2026

This guide is for builders joining the Build Small Hackathon.

The hackathon asks you to keep the total model size at or below 32 billion parameters and to ship a Gradio app on Hugging Face Spaces. Cohere's small open models fit that constraint:

  • Tiny Aya: a 3.35B multilingual text generation family covering 70+ languages.
  • Cohere Transcribe: a 2B automatic speech recognition model covering 14 languages.

Together, they are a good fit for local multilingual assistants, voice interfaces, accessibility tools, offline translation helpers, and small apps for real people.

Quick Start

If you only try one path, start with Tiny Aya GGUF through llama.cpp:

llama-server -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M

That starts a local OpenAI-compatible server and web UI. You can then point a small Gradio app, script, or frontend at http://localhost:8080/v1.

For speech transcription, start with the native transformers path:

pip install "transformers>=5.4.0" torch huggingface_hub soundfile librosa sentencepiece protobuf

Then load CohereLabs/cohere-transcribe-03-2026 with AutoProcessor and CohereAsrForConditionalGeneration as shown below.

Pick A Tiny Aya Variant

Tiny Aya is a family of 3.35B multilingual language models. Use the region-specialized variants when you already know your app's audience, or use Global when you want the safest default.

Model Best For Local GGUF Repo
tiny-aya-global Best balance across languages and regions tiny-aya-global-GGUF
tiny-aya-water European and Asia-Pacific languages tiny-aya-water-GGUF
tiny-aya-fire South Asian languages tiny-aya-fire-GGUF
tiny-aya-earth West Asian and African languages tiny-aya-earth-GGUF

Tiny Aya Locally

llama.cpp

Install llama.cpp from your package manager or build it from source. Then run a local server:

llama-server -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M

Or run directly in the terminal:

llama-cli -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M \
  -p "Write a friendly welcome message in Spanish, Arabic, and Swahili for a neighborhood garden app."

Swap global for water, fire, or earth if your project focuses on one region:

llama-server -hf CohereLabs/tiny-aya-fire-GGUF:Q4_K_M

Ollama

If you use Ollama, you can pull directly from Hugging Face:

ollama run hf.co/CohereLabs/tiny-aya-global-GGUF:Q4_K_M

Python With transformers

Use the non-GGUF repos when you want native PyTorch or Hugging Face transformers workflows:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "CohereLabs/tiny-aya-global"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto")

messages = [
    {
        "role": "user",
        "content": "Explica en espanol que significa la palabra japonesa 'ikigai' y da un ejemplo practico.",
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=300,
    do_sample=True,
    temperature=0.1,
    top_p=0.95,
)

print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Python With llama-cpp-python

This is useful if you want Python control while still using the smaller GGUF files:

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="CohereLabs/tiny-aya-global-GGUF",
    filename="*Q4_K_M.gguf",
)

response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": "Give me three local-first app ideas for helping a multilingual family.",
        }
    ]
)

print(response["choices"][0]["message"]["content"])

Which Quantization Should I Use?

File Type Approx Size Good For
Q4_0 2.03 GB Lowest memory demos and phone or edge experiments
Q4_K_M 2.14 GB Best first choice for laptops and llama.cpp demos
Q8_0 3.57 GB Better quality if you have more RAM
BF16 / F16 6.71 GB Highest fidelity local runs, more memory required

Cohere Transcribe Locally

Cohere Transcribe is a 2B dedicated audio-in, text-out ASR model. It supports Arabic, Chinese, Dutch, English, French, German, Greek, Italian, Japanese, Korean, Polish, Portuguese, Spanish, and Vietnamese. It is Apache 2.0 licensed and is a strong fit for local voice interfaces.

Install the basics:

pip install "transformers>=5.4.0" torch huggingface_hub soundfile librosa sentencepiece protobuf

Transcribe a local audio file:

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio

model_id = "CohereLabs/cohere-transcribe-03-2026"
audio_path = "voice_note.wav"

processor = AutoProcessor.from_pretrained(model_id)
model = CohereAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")

audio = load_audio(audio_path, sampling_rate=16000)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)

print(text)

Use the language argument for non-English transcription:

inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="ja")

Control punctuation:

inputs = processor(
    audio,
    sampling_rate=16000,
    return_tensors="pt",
    language="en",
    punctuation=False,
)

For long-form audio, the processor automatically chunks audio longer than the feature extractor's maximum clip length. Keep the returned audio_chunk_index and pass it back to processor.decode(...):

inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(
    outputs,
    skip_special_tokens=True,
    audio_chunk_index=audio_chunk_index,
    language="en",
)[0]

For production-style local serving, use vLLM:

vllm serve CohereLabs/cohere-transcribe-03-2026 --trust-remote-code

Then call the local OpenAI-compatible audio endpoint:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -F "file=@$(realpath voice_note.wav)" \
  -F "model=CohereLabs/cohere-transcribe-03-2026"

On Apple Silicon, also check the mlx-audio integration linked from the model card if you want a more device-native local path.

Gradio Snippets

Install the small UI dependencies:

pip install gradio openai

Chat UI For A Local llama.cpp Server

Start Tiny Aya locally first:

llama-server -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M

Then connect a tiny Gradio chat UI to the local OpenAI-compatible endpoint:

import gradio as gr
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
model = "CohereLabs/tiny-aya-global-GGUF:Q4_K_M"


def chat(message, history):
    messages = list(history)
    messages.append({"role": "user", "content": message})

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.2,
    )
    return response.choices[0].message.content


demo = gr.ChatInterface(
    fn=chat,
    type="messages",
    title="Tiny Aya Local Chat",
    description="A multilingual chat UI backed by a local llama.cpp server.",
)

demo.launch()

Audio Input For Cohere Transcribe

This is a small sketch you can adapt into your app:

import gradio as gr
from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio

model_id = "CohereLabs/cohere-transcribe-03-2026"
processor = AutoProcessor.from_pretrained(model_id)
model = CohereAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")


def transcribe(audio_path, language):
    audio = load_audio(audio_path, sampling_rate=16000)
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language=language)
    inputs.to(model.device, dtype=model.dtype)
    outputs = model.generate(**inputs, max_new_tokens=256)
    return processor.decode(outputs, skip_special_tokens=True)


demo = gr.Interface(
    fn=transcribe,
    inputs=[
        gr.Audio(type="filepath", label="Voice note"),
        gr.Dropdown(["en", "es", "fr", "de", "ja", "ko", "zh", "ar"], value="en", label="Language"),
    ],
    outputs=gr.Textbox(label="Transcript"),
    title="Cohere Transcribe Local Demo",
)

demo.launch()

You can combine both snippets into a voice app: record audio, transcribe locally, pass the transcript to Tiny Aya, then show a multilingual response.

Beyond gr.Interface

gr.Interface is the fastest way to a demo, but Gradio gives you a lot more for the same hackathon project:

  • Streaming responses. Make your function a generator and yield partial output. For chat, gr.ChatInterface renders tokens as they arrive — much nicer than waiting for the full reply. See Streaming Outputs.
  • Rich chatbots. gr.ChatInterface supports multimodal input, message metadata, retry/undo, and example prompts out of the box. See Creating a Chatbot Fast.
  • Custom layouts. gr.Blocks lets you arrange rows, columns, tabs, and accordions, and wire one component's output to another's input — e.g. transcribe audio in one panel and feed it to Tiny Aya in the next. See Blocks and Event Listeners.
  • Themes and examples. Ship a polished look with Theming and give judges one-click Examples.

Custom Frontends With gr.Server

When you want a fully custom HTML/JS frontend — your own branding, layout, and chat UI — but still want Gradio's request queue, ZeroGPU lifecycle, and a Python/JS SDK, reach for gr.Server. The idea: bring any UI you like, and let gr.Server be the engine behind it. It splits cleanly into two route types — @server.api for anything that touches the model (queued, GPU-aware, discoverable by the clients), and plain @server.get/@server.post for static content like your homepage HTML.

Here is a Tiny Aya backend that serves its own page and exposes one streaming chat route. Streaming is the modern-chatbot must-have, and it's just a generator that yields the accumulated text:

import threading
import gradio as gr
from fastapi.responses import HTMLResponse
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

try:
    import spaces
    _HAS_SPACES = True
except ImportError:
    _HAS_SPACES = False

model_id = "CohereLabs/tiny-aya-global"   # gated model -> set HF_TOKEN as a Space secret
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
model.to("cuda")  # module level, so ZeroGPU can fast-restore the placement


def _stream(messages: list):
    inputs = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, tokenize=True,
        return_dict=True, return_tensors="pt",   # return_dict=True: transformers 5.x
    ).to(model.device)                            # returns a BatchEncoding, unpack it

    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    threading.Thread(
        target=model.generate,
        kwargs=dict(**inputs, max_new_tokens=512, do_sample=True,
                    temperature=0.3, streamer=streamer),
    ).start()

    acc = ""
    for token in streamer:
        acc += token
        yield acc                                # each yield streams to the browser


if _HAS_SPACES:
    _stream = spaces.GPU(duration=120)(_stream)  # @spaces.GPU on the inner function

server = gr.Server()


@server.get("/", response_class=HTMLResponse)
async def homepage() -> str:
    return FRONTEND_HTML            # your custom single-page app


@server.api(name="chat")
def chat_api(messages: list) -> str:   # generator -> annotate with the YIELDED type
    yield from _stream(messages)


if __name__ == "__main__":
    server.launch(server_name="0.0.0.0", server_port=7860)

On the frontend, use the JS client's submit() (not predict()) to consume the stream — each data event carries the latest accumulated text, which is what produces the token-by-token typing effect:

<script type="module">
import { Client } from "https://cdn.jsdelivr.net/npm/@gradio/client/dist/index.min.js";
const client = await Client.connect(window.location.origin);

const history = [{ role: "user", content: "Hola, ¿qué tal?" }];
const job = client.submit("/chat", { messages: history });
for await (const msg of job) {
  if (msg.type === "data") render(msg.data[0]);   // accumulated text so far
}
</script>

A few things that bite people (all covered in the Server Mode guide):

  • Return-type annotation is load-bearing. A generator must be annotated with its yielded type (-> str here). Without it Gradio registers zero outputs and silently drops every chunk — the UI just hangs.
  • @spaces.GPU goes on the inner model function, not the @server.api route.
  • Gated models need a token. tiny-aya-global is gated, so add HF_TOKEN as a Space secret (transformers picks it up automatically) and accept the model's terms on its page first.
  • apply_chat_template(..., tokenize=True) returns a BatchEncoding on transformers 5.x, so pass return_dict=True and unpack with **inputsmodel.generate(inputs, ...) on the raw object raises AttributeError: shape.

Full implementation — the complete custom frontend (streaming chat UI, conversation history, Cohere Labs + Build Small branding) lives in the companion Space. Read its app.py for the whole thing, or try it live: tiny-aya-build-small-sample. For file uploads, custom URLs, and ZeroGPU duration tuning, see the JS client guide.

Resources

Community

Sign up or log in to comment