---
license: apache-2.0
datasets:
- edinburghcstr/ami
language:
- en
metrics:
- wer
base_model:
- Qwen/Qwen2.5-7B-Instruct
pipeline_tag: automatic-speech-recognition
tags:
- tagspeech
- diarization
- multi-speaker
---
# TagSpeech
**TagSpeech** is a **fully end-to-end multi-speaker ASR and diarization model**.
Given a raw waveform of a multi-speaker conversation, the model directly outputs **speaker-attributed transcriptions with timestamps and gender labels**, without requiring a separate diarization or clustering stage.
š **Paper**: [TagSpeech: Unified E2E Multi-Speaker ASR and Diarization Model](https://arxiv.org/abs/2601.06896)
Available checkpoints:
- **English (AMI)**: `AudenAI/TagSpeech-AMI`
- **Mandarin (AliMeeting)**: `AudenAI/TagSpeech-Alimeeting`
---
## š What Can This Model Do?
- šļø **Multi-speaker speech recognition**
- š§āš¤āš§ **Speaker diarization**
- ā±ļø **Timestamped utterances**
- š» **Gender prediction**
- š§© **Single forward pass** (no external diarization model required)
The model is designed for **meeting-style conversational audio** with overlapping speakers.
## Quick Inference
```python
import torch
from model import TagSpeech
from utils.xml_utils import xml_to_json
device = "cuda" if torch.cuda.is_available() else "cpu"
model = TagSpeech.from_pretrained("AudenAI/TagSpeech-AMI").to(device)
wav_files = ["assets/test_example_AMI_EN2002c-12-0-35.wav"]
audio_token = model.config.audio_token
messages = [
[{"role": "user", "content": f"{audio_token}\n{audio_token}"}]
for _ in wav_files
]
outputs = model.generate(wav_files, messages, max_new_tokens=800, num_beams=1, do_sample=False)
# Print outputs in XML and JSON formats
for i, output in enumerate(outputs, 1):
print(f"\n{'='*80}\nOutput {i}/{len(outputs)} - XML:\n{'='*80}\n{output}\n{'='*80}")
json_output = xml_to_json(output)
if json_output:
print(f"\nOutput {i}/{len(outputs)} - JSON:\n{'='*80}\n{json_output}\n{'='*80}")
else:
print(f"\nā ļø Warning: Output {i} could not be parsed as valid XML\n{'='*80}")
```
**Example Output**
```
{
"segments": [
{
"start": 0.0,
"end": 2.6,
"text": "oh right so oh so that that is all st stored to",
"speaker_id": "1",
"speaker_gender": "female"
},
{
"start": 2.15,
"end": 5.88,
"text": "the speaker class knows about all of that stuff and the meeting class knows about that stuff",
"speaker_id": "2",
"speaker_gender": "male"
},
{
"start": 4.12,
"end": 4.51,
"text": "alright",
"speaker_id": "1",
"speaker_gender": "female"
},
{
"start": 5.88,
"end": 6.75,
"text": "well",
"speaker_id": "2",
"speaker_gender": "male"
}
]
}
```
## š Model Characteristics
- Input: Raw audio waveform (16 kHz recommended)
- Output: Speaker-attributed ASR with timestamps in XML format, can be parsed into JSON format
- Backend LLM: Qwen2.5-7B-Instruct (frozen)
- Architecture: Dual encoders (semantic + [voice](https://huggingface.co/AudenAI/auden-encoder-voice)) with numeric time anchors
## ā ļø Limitations
- This checkpoint is trained on **approximately 65 hours of AMI meeting speech** only, and is primarily optimized for **noisy, far-field, multi-speaker meeting scenarios**. Performance may degrade on out-of-domain audio (e.g., clean close-talk speech or other acoustic conditions). For best results, we recommend fine-tuning on in-domain data.
- The model is recommended for **short inference (⤠30 seconds)**. For long-form recordings, **chunk-based inference** is required. Chunking and post-processing logic are not provided in this repository.
- This model is designed for **offline inference only** and does **not support real-time or streaming ASR**.
## Citation
If you use TagSpeech in your research, please cite:
```
@article{huo2026tagspeech,
title={TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding},
author={Huo, Mingyue and Shao, Yiwen and Zhang, Yuheng},
journal={arXiv preprint arXiv:2601.06896},
year={2026}
}
```