Cohere Transcribe β€” Diarize + Timestamps (English)

Built by syv.ai β€” a Danish AI company focused on shipping practical speech and language models. We release open-weights speech models so teams can build on top without leaving their own infrastructure.

This model is CohereLabs/cohere-transcribe-03-2026 fine-tuned to also emit speaker labels and word-aligned timestamps in a single decoder pass, while preserving the base model's transcription quality. It's a drop-in replacement when you need to know who said what and when on short-form audio (≀ 30 s), and pairs with our diarize_long_vllm helper for arbitrary-length recordings.

Recommended deployment: vLLM β€” see Serving with vLLM. We measured 44Γ— real-time end-to-end on a 10-min clip with one RTX 3090 (decode 113Γ— RTF, embed 16 seg/s), and 249Γ— peak throughput under concurrent load. Transformers works too and is shown first for a minimal example, but the vLLM path is what we run in production.

WE ARE LOOKING FOR COMPUTE PARTNERS TO FURTHER IMPROVE OUR MODELS - REACH OUT IF YOU CAN HELP

Namecohere-transcribe-diarize
Base modelCohereLabs/cohere-transcribe-03-2026 (Apache 2.0, 2 B params)
Architectureconformer-based encoder–decoder, full fine-tune (no LoRA)
Inputaudio waveform (16 kHz mono, resampled automatically). Maximum supported clip length: 30 s β€” longer audio should be processed with sliding windows (see below)
Outputspecial-token stream interleaving speaker IDs, timestamps, and transcribed text, e.g. <|spltoken0|><|t:0.0|> Welcome back to the show.<|t:2.4|><|spltoken1|><|t:2.4|> Thanks for having me.<|t:3.8|>
Vocabulary extensions8 speaker tokens (<|spltoken0|>…<|spltoken7|>) + 300 timestamp tokens at 100 ms resolution (<|t:0.0|>…<|t:29.9|>)
Languages Primary: English (the diarization + timestamp fine-tune was done exclusively on English supervision).
Likely usable (untested by us): the other 13 languages the Cohere Transcribe base supports β€” Arabic, German, Greek, Spanish, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Vietnamese, Chinese (Mandarin). The base model's multilingual transcription weights are preserved, and the diarization head conditions on language-agnostic speaker acoustics, so segmentation and speaker IDs should transfer; word-level timestamp accuracy will be best on English. Pass the matching language code in the prompt (<|de|>, <|fr|>, …) to switch.
LicenseApache 2.0 (inherited from base)

Quick start

pip install transformers==4.57.6 torch huggingface_hub soundfile librosa sentencepiece protobuf
import re
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from transformers.audio_utils import load_audio

MODEL_ID = "syvai/cohere-transcribe-diarize"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    MODEL_ID, dtype=torch.bfloat16
).to("cuda").eval()

# Prompt that activates diarization + timestamps. The base Cohere model
# uses special control tokens to switch features on/off; we keep that contract.
# `<|en|><|en|>` is the canonical Cohere prompt β€” the two slots are
# audio-language + transcript-language; setting them to the same code means
# "transcribe" (different codes would be "translate"). To run on another
# Cohere language, swap BOTH tokens, e.g. `<|de|><|de|>`.
# Each `<|...|>` is a single special token in the tokenizer vocab. Resolve
# via convert_tokens_to_ids β€” running the prompt string through the tokenizer
# re-tokenizes each marker into 6-12 subword pieces, which weakens the
# control-token signal the model trained on.
PROMPT_TOKENS = [
    "<|startofcontext|>", "<|startoftranscript|>",
    "<|emo:undefined|>", "<|en|>", "<|en|>",
    "<|pnc|>", "<|noitn|>", "<|timestamp|>", "<|diarize|>",
]
prompt_ids = torch.tensor(
    [[processor.tokenizer.convert_tokens_to_ids(t) for t in PROMPT_TOKENS]]
).to(model.device)

# Load any ≀ 30 s audio clip.
audio = load_audio("clip.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to(model.device, dtype=model.dtype if v.is_floating_point() else None)
          for k, v in inputs.items()}

with torch.inference_mode():
    out = model.generate(
        input_features=inputs["input_features"],
        attention_mask=torch.ones(inputs["input_features"].shape[:2], device=model.device),
        decoder_input_ids=prompt_ids,
        max_new_tokens=400,
        do_sample=False,
        repetition_penalty=1.2,  # baked into generation_config but explicit here
    )

raw = processor.tokenizer.decode(out[0], skip_special_tokens=False)
print(raw)
# β†’ <|spltoken0|><|t:0.0|> Welcome back. <|t:1.5|><|spltoken1|><|t:1.5|> Thanks. <|t:2.4|>...

Parsing the output into structured segments

SEG_RE = re.compile(r"<\|spltoken(\d+)\|><\|t:(\d+\.\d+)\|>(.*?)<\|t:(\d+\.\d+)\|>", re.DOTALL)

# Drop the prompt prefix; the diarized text follows <|diarize|>
text = raw.split("<|diarize|>", 1)[-1].replace("<|endoftext|>", "")

segments = [
    {
        "speaker": int(m.group(1)),
        "start":   float(m.group(2)),
        "end":     float(m.group(4)),
        "text":    re.sub(r"<\|[^|]+\|>", "", m.group(3)).strip(),
    }
    for m in SEG_RE.finditer(text)
]
for s in segments:
    print(f"[{s['start']:6.2f}–{s['end']:6.2f}] SPK{s['speaker']:02d}  {s['text']}")

Output:

[  0.00–  1.50] SPK00  Welcome back.
[  1.50–  2.40] SPK01  Thanks for having me.
[  2.40–  3.80] SPK00  Let's get into it.

The model uses 8 reusable speaker slots per clip (<|spltoken0|>…<|spltoken7|>). IDs are local to the clip β€” there is no global identity across separately decoded clips. For long-form audio that's split into windows, re-link windows with the helper below.

Long-form audio (> 30 s)

Audio longer than 30 s exceeds the encoder's maximum window. Two helpers in this repo do the windowing + cross-chunk speaker matching for you:

  • diarize_long_vllm.py β€” recommended. Calls a local vLLM server concurrently (continuous batching) and reuses one GPU for both decode and embedding. ~44Γ— RTF on a 10-min clip on a single 3090.
  • diarize_long.py β€” transformers-only fallback, no server needed. Slower (~7Γ— RTF on the same clip) but minimal deps.

Both helpers:

  1. Slide 28 s windows with 2 s overlap over the full audio
  2. Decode each window with this model
  3. Embed each parsed segment with ReDimNet2 B6 (12 M params, 0.17 % EER, loaded automatically via torch.hub)
  4. Cluster embeddings globally with cosine-distance AHC so the same speaker keeps the same ID across windows
# Assumes vLLM is already serving (see next section)
python diarize_long_vllm.py podcast.wav \
    --vllm http://127.0.0.1:8000 \
    --model syvai/cohere-transcribe-diarize \
    --language en \
    --tau 0.45 \
    --concurrency 32 \
    --embed-batch 32

Or via the offline transformers helper (slower, no server):

from diarize_long import diarize_long_audio

segments = diarize_long_audio(
    audio="podcast.wav",
    diar_model_id="syvai/cohere-transcribe-diarize",
    language="en",
    chunk_s=28.0,
    overlap_s=2.0,
    cluster_threshold=0.45,
)

Additional dependencies for long-form inference: numpy, scipy, soundfile, torchaudio (required by ReDimNet2's feature extractor), plus aiohttp if using diarize_long_vllm.py.

Tuning the clustering threshold. cluster_threshold is the cosine-distance ceiling for AHC merges over ReDimNet2 embeddings. Around 0.45 is a good default for podcast / panel-style audio: a 2-min Bernie Sanders town-hall clip cleanly resolves Bernie as one consistent ID across all 5 sliding windows and the host as a second ID, while short audience interjections get their own IDs. Drop to 0.30–0.35 if the audio has many similar-sounding speakers; raise to 0.50–0.55 for noisier conditions where you'd rather collapse near-duplicate IDs.

Serving with vLLM (recommended)

The transformers code path above works but is single-stream. For production we run this model on vLLM 0.19.0 (note: 0.19.1 is broken) β€” it gives continuous batching, a custom OpenAI-compatible diarized_json response format, and ~25Γ— higher peak throughput than calling model.generate() in a loop.

One-time setup

Two scripts ship with this repo to handle the setup β€” both idempotent:

# Download the model locally first, then patch it
hf download syvai/cohere-transcribe-diarize --local-dir cohere-transcribe-diarize

# 1. Reshape the checkpoint files for vLLM compatibility
python fix_for_vllm.py ./cohere-transcribe-diarize

fix_for_vllm.py makes three edits to your local copy:

  • tokenizer_config.json: drops the legacy extra_special_tokens list (transformers 4.57+ expects a dict; the actual tokens are still in tokenizer.json).
  • config.json: sets head.num_classes and transf_decoder.config_dict.vocab_size to 16684 (the resized vocab).
  • model.safetensors: strips the model. weight-name prefix and drops the BatchNorm num_batches_tracked tensors vLLM's CohereAsr model doesn't register.
# 2. Install vLLM 0.19.0 (NOT 0.19.1 β€” broken)
uv pip install "vllm==0.19.0" --torch-backend=cu128
uv pip install librosa

# 3. Patch vLLM's speech_to_text endpoint to add diarized_json
python vllm_diarized_patch.py

vllm_diarized_patch.py applies five edits inside the installed vLLM (also idempotent):

  1. protocol.py β€” add "diarized_json" to the AudioResponseFormat enum
  2. protocol.py β€” force skip_special_tokens=False in to_sampling_params so <|spltoken*|> and <|t:*|> survive into the response text
  3. speech_to_text.py β€” let the validator accept response_format="diarized_json"
  4. speech_to_text.py β€” parse the raw token stream with the segment regex and return OpenAI-compatible {task, language, duration, text, segments:[{speaker, start, end, text}], speakers, usage} JSON
  5. api_router.py β€” pass JSONResponse returns through unchanged (otherwise the diarized branch's return value gets misinterpreted as a streaming generator and the response body comes out empty)

Launch the server

vllm serve ./cohere-transcribe-diarize \
    --served-model-name syvai/cohere-transcribe-diarize \
    --trust-remote-code \
    --host 127.0.0.1 --port 8000 \
    --gpu-memory-utilization 0.55     # leaves ~10 GB for ReDimNet2 batching

--gpu-memory-utilization 0.55 is the sweet spot on a 24 GB card when you also run ReDimNet2 on the same GPU for long-form. If you only need short-form decode (≀ 30 s, no cross-chunk linking), bump it to 0.85 for better KV cache headroom.

Call the API

Plain transcription is OpenAI-compatible:

curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
    -F "file=@clip.wav" \
    -F "model=syvai/cohere-transcribe-diarize" \
    -F "language=en" \
    -F "response_format=diarized_json" \
    --form-string "prompt=<|startofcontext|><|startoftranscript|><|emo:undefined|><|en|><|en|><|pnc|><|noitn|><|timestamp|><|diarize|>"

Response shape (mirrors OpenAI's gpt-4o-transcribe-diarize):

{
  "task": "transcribe",
  "language": "en",
  "duration": 28.0,
  "text": "UM I REJECT THE IDEA I REALLY DO ...",
  "segments": [
    {"speaker": "SPEAKER_00", "start": 2.5,  "end": 3.8,  "text": "I REALLY DO"},
    {"speaker": "SPEAKER_01", "start": 3.6,  "end": 15.0, "text": "IT'S ONE OF THINGS THAT BOTHERS ME ..."},
    {"speaker": "SPEAKER_02", "start": 15.5, "end": 28.0, "text": "IS RAISING A STARVATION MINIMUM WAGE ..."}
  ],
  "speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
  "usage": {"type": "duration", "seconds": 28}
}

The prompt field must be passed explicitly β€” vLLM's default prompt builder emits <|nodiarize|> which suppresses the speaker tokens.

Measured throughput (RTX 3090, 28 s clips)

Concurrency Throughput
1 22Γ— audio/wall
8 117Γ—
32 171Γ—
128 249Γ— (peak)

vLLM does continuous (in-flight) batching automatically β€” fire concurrent requests at the endpoint and it batches them through one forward pass.

Training

This model was produced by full fine-tuning of CohereLabs/cohere-transcribe-03-2026 on English diarization data. The base vocabulary was extended with 8 speaker tokens and 300 100 ms timestamp tokens; the new rows of the embedding and LM-head matrices were initialised from the existing token embedding statistics.

Dataset Rows Description
AMI SDM (train split) 19,928 Single-distant-microphone meeting recordings, sliding 28 s windows with 14 s hop, up to 4 simultaneous speakers per window. Provides realistic multi-speaker conversation with overlap, hesitations, and turn-taking.
LibriSpeech synthetic mix 11,813 Synthetic K-speaker mixtures (K weighted 0.2 / 0.3 / 0.3 / 0.2 for K=1…4) constructed from LibriSpeech utterances, with realistic gap silences. Provides clean cross-talk-free speaker examples to anchor the diarization head.
Total 31,741 All segments are ≀ 30 s and capped at K ≀ 4 speakers.

Training ran for 2 epochs at peak LR 3e-4 (linear warmup over 100 optimizer steps, then linear decay to 0). Effective batch size 128 (per-device batch 2 Γ— 64 gradient-accumulation), bf16, gradient checkpointing, AdamW8bit optimizer. The full fine-tune updates all 2 B parameters. repetition_penalty=1.2 is baked into the generation config and is required at inference β€” without it, K=4 outputs occasionally loop on a single speaker token.

Limitations

  • 30 s hard cap per decoder pass β€” use diarize_long for longer audio. The Cohere feature extractor batches longer clips into multiple chunks, which the diarization decoder is not trained to consume.
  • K ≀ 4 well-supported, K = 5–8 still emit but accuracy degrades on dense overlapping speech.
  • Real-time factor β‰ˆ 14Γ— on RTX 3090 at bf16 β€” the 2 B autoregressive decoder is the bottleneck. For >100Γ— RTF on long audio, pair with a smaller segmenter (e.g. DiariZen-base) or use this model only on the highlight regions.
  • Speaker IDs are local to each generate call. Always cluster embeddings across windows when working with audio that crosses the 30 s boundary.

Citation

If you use this model, please cite Cohere Labs' base release alongside this fine-tune:

@misc{cohere-transcribe-diarize-2026,
  author       = {{syv.ai}},
  title        = {Cohere Transcribe β€” Diarize + Timestamps (English)},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/syvai/cohere-transcribe-diarize}},
}

License

Apache 2.0, inherited from the base model.

Downloads last month
92
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for syvai/cohere-transcribe-diarize

Finetuned
(5)
this model