File size: 4,211 Bytes
cda4a95 9a85271 cda4a95 9a85271 cda4a95 9a85271 cda4a95 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: apache-2.0
datasets:
- edinburghcstr/ami
language:
- en
metrics:
- wer
base_model:
- Qwen/Qwen2.5-7B-Instruct
pipeline_tag: automatic-speech-recognition
tags:
- tagspeech
- diarization
- multi-speaker
---
# TagSpeech
**TagSpeech** is a **fully end-to-end multi-speaker ASR and diarization model**.
Given a raw waveform of a multi-speaker conversation, the model directly outputs **speaker-attributed transcriptions with timestamps and gender labels**, without requiring a separate diarization or clustering stage.
π **Paper**: [TagSpeech: Unified E2E Multi-Speaker ASR and Diarization Model](https://arxiv.org/abs/2601.06896)
Available checkpoints:
- **English (AMI)**: `AudenAI/TagSpeech-AMI`
- **Mandarin (AliMeeting)**: `AudenAI/TagSpeech-Alimeeting`
---
## π What Can This Model Do?
- ποΈ **Multi-speaker speech recognition**
- π§βπ€βπ§ **Speaker diarization**
- β±οΈ **Timestamped utterances**
- π» **Gender prediction**
- π§© **Single forward pass** (no external diarization model required)
The model is designed for **meeting-style conversational audio** with overlapping speakers.
## Quick Inference
```python
import torch
from model import TagSpeech
from utils.xml_utils import xml_to_json
device = "cuda" if torch.cuda.is_available() else "cpu"
model = TagSpeech.from_pretrained("AudenAI/TagSpeech-AMI").to(device)
wav_files = ["assets/test_example_AMI_EN2002c-12-0-35.wav"]
audio_token = model.config.audio_token
messages = [
[{"role": "user", "content": f"<text>{audio_token}</text>\n<speaker>{audio_token}</speaker>"}]
for _ in wav_files
]
outputs = model.generate(wav_files, messages, max_new_tokens=800, num_beams=1, do_sample=False)
# Print outputs in XML and JSON formats
for i, output in enumerate(outputs, 1):
print(f"\n{'='*80}\nOutput {i}/{len(outputs)} - XML:\n{'='*80}\n{output}\n{'='*80}")
json_output = xml_to_json(output)
if json_output:
print(f"\nOutput {i}/{len(outputs)} - JSON:\n{'='*80}\n{json_output}\n{'='*80}")
else:
print(f"\nβ οΈ Warning: Output {i} could not be parsed as valid XML\n{'='*80}")
```
**Example Output**
```
{
"segments": [
{
"start": 0.0,
"end": 2.6,
"text": "oh right so oh so that that is all st stored to",
"speaker_id": "1",
"speaker_gender": "female"
},
{
"start": 2.15,
"end": 5.88,
"text": "the speaker class knows about all of that stuff and the meeting class knows about that stuff",
"speaker_id": "2",
"speaker_gender": "male"
},
{
"start": 4.12,
"end": 4.51,
"text": "alright",
"speaker_id": "1",
"speaker_gender": "female"
},
{
"start": 5.88,
"end": 6.75,
"text": "well",
"speaker_id": "2",
"speaker_gender": "male"
}
]
}
```
## π Model Characteristics
- Input: Raw audio waveform (16 kHz recommended)
- Output: Speaker-attributed ASR with timestamps in XML format, can be parsed into JSON format
- Backend LLM: Qwen2.5-7B-Instruct (frozen)
- Architecture: Dual encoders (semantic + [voice](https://huggingface.co/AudenAI/auden-encoder-voice)) with numeric time anchors
## β οΈ Limitations
- This checkpoint is trained on **approximately 65 hours of AMI meeting speech** only, and is primarily optimized for **noisy, far-field, multi-speaker meeting scenarios**. Performance may degrade on out-of-domain audio (e.g., clean close-talk speech or other acoustic conditions). For best results, we recommend fine-tuning on in-domain data.
- The model is recommended for **short inference (β€ 30 seconds)**. For long-form recordings, **chunk-based inference** is required. Chunking and post-processing logic are not provided in this repository.
- This model is designed for **offline inference only** and does **not support real-time or streaming ASR**.
## Citation
If you use TagSpeech in your research, please cite:
```
@article{huo2026tagspeech,
title={TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding},
author={Huo, Mingyue and Shao, Yiwen and Zhang, Yuheng},
journal={arXiv preprint arXiv:2601.06896},
year={2026}
}
```
|