Build Small Hackathon With Cohere Models

Community Article

Published June 4, 2026

Upvote

Alejandro Rodriguez

alexrs

CohereLabs

This guide is for builders joining the Build Small Hackathon.

The hackathon asks you to keep the total model size at or below 32 billion parameters and to ship a Gradio app on Hugging Face Spaces. Cohere's small open models fit that constraint:

Tiny Aya: a 3.35B multilingual text generation family covering 70+ languages.
Cohere Transcribe: a 2B automatic speech recognition model covering 14 languages.

Together, they are a good fit for local multilingual assistants, voice interfaces, accessibility tools, offline translation helpers, and small apps for real people.

Quick Start

If you only try one path, start with Tiny Aya GGUF through llama.cpp:

llama-server -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M

That starts a local OpenAI-compatible server and web UI. You can then point a small Gradio app, script, or frontend at http://localhost:8080/v1.

For speech transcription, start with the native transformers path:

pip install "transformers>=5.4.0" torch huggingface_hub soundfile librosa sentencepiece protobuf

Then load CohereLabs/cohere-transcribe-03-2026 with AutoProcessor and CohereAsrForConditionalGeneration as shown below.

Pick A Tiny Aya Variant

Tiny Aya is a family of 3.35B multilingual language models. Use the region-specialized variants when you already know your app's audience, or use Global when you want the safest default.

Model	Best For	Local GGUF Repo
tiny-aya-global	Best balance across languages and regions	tiny-aya-global-GGUF
tiny-aya-water	European and Asia-Pacific languages	tiny-aya-water-GGUF
tiny-aya-fire	South Asian languages	tiny-aya-fire-GGUF
tiny-aya-earth	West Asian and African languages	tiny-aya-earth-GGUF

Tiny Aya Locally

llama.cpp

Install llama.cpp from your package manager or build it from source. Then run a local server:

llama-server -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M

Or run directly in the terminal:

llama-cli -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M \
  -p "Write a friendly welcome message in Spanish, Arabic, and Swahili for a neighborhood garden app."

Swap global for water, fire, or earth if your project focuses on one region:

llama-server -hf CohereLabs/tiny-aya-fire-GGUF:Q4_K_M

Ollama

If you use Ollama, you can pull directly from Hugging Face:

ollama run hf.co/CohereLabs/tiny-aya-global-GGUF:Q4_K_M

Python With transformers

Use the non-GGUF repos when you want native PyTorch or Hugging Face transformers workflows:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "CohereLabs/tiny-aya-global"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto")

messages = [
    {
        "role": "user",
        "content": "Explica en espanol que significa la palabra japonesa 'ikigai' y da un ejemplo practico.",
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=300,
    do_sample=True,
    temperature=0.1,
    top_p=0.95,
)

print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Python With llama-cpp-python

This is useful if you want Python control while still using the smaller GGUF files:

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="CohereLabs/tiny-aya-global-GGUF",
    filename="*Q4_K_M.gguf",
)

response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": "Give me three local-first app ideas for helping a multilingual family.",
        }
    ]
)

print(response["choices"][0]["message"]["content"])

Which Quantization Should I Use?

File Type	Approx Size	Good For
`Q4_0`	2.03 GB	Lowest memory demos and phone or edge experiments
`Q4_K_M`	2.14 GB	Best first choice for laptops and llama.cpp demos
`Q8_0`	3.57 GB	Better quality if you have more RAM
`BF16` / `F16`	6.71 GB	Highest fidelity local runs, more memory required

Cohere Transcribe Locally

Cohere Transcribe is a 2B dedicated audio-in, text-out ASR model. It supports Arabic, Chinese, Dutch, English, French, German, Greek, Italian, Japanese, Korean, Polish, Portuguese, Spanish, and Vietnamese. It is Apache 2.0 licensed and is a strong fit for local voice interfaces.

Install the basics:

pip install "transformers>=5.4.0" torch huggingface_hub soundfile librosa sentencepiece protobuf

Transcribe a local audio file:

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio

model_id = "CohereLabs/cohere-transcribe-03-2026"
audio_path = "voice_note.wav"

processor = AutoProcessor.from_pretrained(model_id)
model = CohereAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")

audio = load_audio(audio_path, sampling_rate=16000)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)

print(text)

Use the language argument for non-English transcription:

inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="ja")

Control punctuation:

inputs = processor(
    audio,
    sampling_rate=16000,
    return_tensors="pt",
    language="en",
    punctuation=False,
)

For long-form audio, the processor automatically chunks audio longer than the feature extractor's maximum clip length. Keep the returned audio_chunk_index and pass it back to processor.decode(...):

inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(
    outputs,
    skip_special_tokens=True,
    audio_chunk_index=audio_chunk_index,
    language="en",
)[0]

For production-style local serving, use vLLM:

vllm serve CohereLabs/cohere-transcribe-03-2026 --trust-remote-code

Then call the local OpenAI-compatible audio endpoint:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -F "file=@$(realpath voice_note.wav)" \
  -F "model=CohereLabs/cohere-transcribe-03-2026"

On Apple Silicon, also check the mlx-audio integration linked from the model card if you want a more device-native local path.

Gradio Snippets

Install the small UI dependencies:

pip install gradio openai

Chat UI For A Local llama.cpp Server

Start Tiny Aya locally first:

llama-server -hf CohereLabs/tiny-aya-global-GGUF:Q4_K_M

Then connect a tiny Gradio chat UI to the local OpenAI-compatible endpoint:

import gradio as gr
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
model = "CohereLabs/tiny-aya-global-GGUF:Q4_K_M"


def chat(message, history):
    messages = list(history)
    messages.append({"role": "user", "content": message})

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.2,
    )
    return response.choices[0].message.content


demo = gr.ChatInterface(
    fn=chat,
    type="messages",
    title="Tiny Aya Local Chat",
    description="A multilingual chat UI backed by a local llama.cpp server.",
)

demo.launch()

Audio Input For Cohere Transcribe

This is a small sketch you can adapt into your app:

import gradio as gr
from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio

model_id = "CohereLabs/cohere-transcribe-03-2026"
processor = AutoProcessor.from_pretrained(model_id)
model = CohereAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")


def transcribe(audio_path, language):
    audio = load_audio(audio_path, sampling_rate=16000)
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language=language)
    inputs.to(model.device, dtype=model.dtype)
    outputs = model.generate(**inputs, max_new_tokens=256)
    return processor.decode(outputs, skip_special_tokens=True)


demo = gr.Interface(
    fn=transcribe,
    inputs=[
        gr.Audio(type="filepath", label="Voice note"),
        gr.Dropdown(["en", "es", "fr", "de", "ja", "ko", "zh", "ar"], value="en", label="Language"),
    ],
    outputs=gr.Textbox(label="Transcript"),
    title="Cohere Transcribe Local Demo",
)

demo.launch()

You can combine both snippets into a voice app: record audio, transcribe locally, pass the transcript to Tiny Aya, then show a multilingual response.

Beyond gr.Interface

gr.Interface is the fastest way to a demo, but Gradio gives you a lot more for the same hackathon project:

Streaming responses. Make your function a generator and yield partial output. For chat, gr.ChatInterface renders tokens as they arrive — much nicer than waiting for the full reply. See Streaming Outputs.
Rich chatbots. gr.ChatInterface supports multimodal input, message metadata, retry/undo, and example prompts out of the box. See Creating a Chatbot Fast.
Custom layouts. gr.Blocks lets you arrange rows, columns, tabs, and accordions, and wire one component's output to another's input — e.g. transcribe audio in one panel and feed it to Tiny Aya in the next. See Blocks and Event Listeners.
Themes and examples. Ship a polished look with Theming and give judges one-click Examples.

Custom Frontends With gr.Server

When you want a fully custom HTML/JS frontend — your own branding, layout, and chat UI — but still want Gradio's request queue, ZeroGPU lifecycle, and a Python/JS SDK, reach for gr.Server. The idea: bring any UI you like, and let gr.Server be the engine behind it. It splits cleanly into two route types — @server.api for anything that touches the model (queued, GPU-aware, discoverable by the clients), and plain @server.get/@server.post for static content like your homepage HTML.

Here is a Tiny Aya backend that serves its own page and exposes one streaming chat route. Streaming is the modern-chatbot must-have, and it's just a generator that yields the accumulated text:

import threading
import gradio as gr
from fastapi.responses import HTMLResponse
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

try:
    import spaces
    _HAS_SPACES = True
except ImportError:
    _HAS_SPACES = False

model_id = "CohereLabs/tiny-aya-global"   # gated model -> set HF_TOKEN as a Space secret
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
model.to("cuda")  # module level, so ZeroGPU can fast-restore the placement


def _stream(messages: list):
    inputs = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, tokenize=True,
        return_dict=True, return_tensors="pt",   # return_dict=True: transformers 5.x
    ).to(model.device)                            # returns a BatchEncoding, unpack it

    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    threading.Thread(
        target=model.generate,
        kwargs=dict(**inputs, max_new_tokens=512, do_sample=True,
                    temperature=0.3, streamer=streamer),
    ).start()

    acc = ""
    for token in streamer:
        acc += token
        yield acc                                # each yield streams to the browser


if _HAS_SPACES:
    _stream = spaces.GPU(duration=120)(_stream)  # @spaces.GPU on the inner function

server = gr.Server()


@server.get("/", response_class=HTMLResponse)
async def homepage() -> str:
    return FRONTEND_HTML            # your custom single-page app


@server.api(name="chat")
def chat_api(messages: list) -> str:   # generator -> annotate with the YIELDED type
    yield from _stream(messages)


if __name__ == "__main__":
    server.launch(server_name="0.0.0.0", server_port=7860)

On the frontend, use the JS client's submit() (not predict()) to consume the stream — each data event carries the latest accumulated text, which is what produces the token-by-token typing effect:

<script type="module">
import { Client } from "https://cdn.jsdelivr.net/npm/@gradio/client/dist/index.min.js";
const client = await Client.connect(window.location.origin);

const history = [{ role: "user", content: "Hola, ¿qué tal?" }];
const job = client.submit("/chat", { messages: history });
for await (const msg of job) {
  if (msg.type === "data") render(msg.data[0]);   // accumulated text so far
}
</script>

A few things that bite people (all covered in the Server Mode guide):

Return-type annotation is load-bearing. A generator must be annotated with its yielded type (-> str here). Without it Gradio registers zero outputs and silently drops every chunk — the UI just hangs.
@spaces.GPU goes on the inner model function, not the @server.api route.
Gated models need a token. tiny-aya-global is gated, so add HF_TOKEN as a Space secret (transformers picks it up automatically) and accept the model's terms on its page first.
apply_chat_template(..., tokenize=True) returns a BatchEncoding on transformers 5.x, so pass return_dict=True and unpack with **inputs — model.generate(inputs, ...) on the raw object raises AttributeError: shape.

Full implementation — the complete custom frontend (streaming chat UI, conversation history, Cohere Labs + Build Small branding) lives in the companion Space. Read its app.py for the whole thing, or try it live: tiny-aya-build-small-sample. For file uploads, custom URLs, and ZeroGPU duration tuning, see the JS client guide.

Resources

Hackathon: Build Small Hackathon
Tiny Aya demo: CohereLabs/tiny-aya Space
Tiny Aya model cards: global, water, fire, earth
Tiny Aya GGUF repos: global-GGUF, water-GGUF, fire-GGUF, earth-GGUF
Cohere Transcribe model: CohereLabs/cohere-transcribe-03-2026
Cohere Transcribe browser demo: Cohere-Transcribe-WebGPU
Cohere Transcribe release blog: Introducing Cohere-transcribe
Tiny Aya + gr.Server sample app: tiny-aya-build-small-sample (app.py)
Gradio docs: gradio.app/docs
Gradio Server intro: Introducing gr.Server
Gradio Server mode guide: gradio.app/guides/server-mode
Gradio JS client: gradio.app/guides/getting-started-with-the-js-client
Gradio streaming: gradio.app/guides/streaming-outputs
llama.cpp: github.com/ggml-org/llama.cpp

Models mentioned in this article 9

Spaces mentioned in this article 3

Collections mentioned in this article 1

Meet Cohere Transcribe Arabic

July 7, 2026

Introducing North Mini Code: Cohere’s First Model For Developers

June 9, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Build Small Hackathon With Cohere Models

Quick Start

Pick A Tiny Aya Variant

Tiny Aya Locally

llama.cpp

Ollama

Python With transformers

Python With llama-cpp-python

Which Quantization Should I Use?

Cohere Transcribe Locally

Gradio Snippets

Chat UI For A Local llama.cpp Server

Audio Input For Cohere Transcribe

Beyond gr.Interface

Custom Frontends With gr.Server

Resources

Models mentioned in this article 9

Spaces mentioned in this article 3

Cohere Transcribe WebGPU

Tiny Aya

Tiny Aya Build Small Sample

Collections mentioned in this article 1

Meet Cohere Transcribe Arabic

Introducing North Mini Code: Cohere’s First Model For Developers

Community

Models mentioned in this article 9

Spaces mentioned in this article 3

Cohere Transcribe WebGPU

Tiny Aya

Tiny Aya Build Small Sample

Collections mentioned in this article 1