File size: 4,211 Bytes

---
license: apache-2.0
datasets:
- edinburghcstr/ami
language:
- en
metrics:
- wer
base_model:
- Qwen/Qwen2.5-7B-Instruct
pipeline_tag: automatic-speech-recognition
tags:
- tagspeech
- diarization
- multi-speaker
---

# TagSpeech

**TagSpeech** is a **fully end-to-end multi-speaker ASR and diarization model**.  
Given a raw waveform of a multi-speaker conversation, the model directly outputs **speaker-attributed transcriptions with timestamps and gender labels**, without requiring a separate diarization or clustering stage.

🔗 **Paper**: [TagSpeech: Unified E2E Multi-Speaker ASR and Diarization Model](https://arxiv.org/abs/2601.06896)

Available checkpoints:
- **English (AMI)**: `AudenAI/TagSpeech-AMI`
- **Mandarin (AliMeeting)**: `AudenAI/TagSpeech-Alimeeting`

---

## 🔍 What Can This Model Do?

- 🎙️ **Multi-speaker speech recognition**
- 🧑‍🤝‍🧑 **Speaker diarization**
- ⏱️ **Timestamped utterances**
- 🚻 **Gender prediction**
- 🧩 **Single forward pass** (no external diarization model required)

The model is designed for **meeting-style conversational audio** with overlapping speakers.


## Quick Inference

```python
import torch
from model import TagSpeech
from utils.xml_utils import xml_to_json

device = "cuda" if torch.cuda.is_available() else "cpu"
model = TagSpeech.from_pretrained("AudenAI/TagSpeech-AMI").to(device) 

wav_files = ["assets/test_example_AMI_EN2002c-12-0-35.wav"]

audio_token = model.config.audio_token
messages = [
    [{"role": "user", "content": f"<text>{audio_token}</text>\n<speaker>{audio_token}</speaker>"}]
    for _ in wav_files
]

outputs = model.generate(wav_files, messages, max_new_tokens=800, num_beams=1, do_sample=False)

# Print outputs in XML and JSON formats
for i, output in enumerate(outputs, 1):
    print(f"\n{'='*80}\nOutput {i}/{len(outputs)} - XML:\n{'='*80}\n{output}\n{'='*80}")
    
    json_output = xml_to_json(output)
    if json_output:
        print(f"\nOutput {i}/{len(outputs)} - JSON:\n{'='*80}\n{json_output}\n{'='*80}")
    else:
        print(f"\n⚠️  Warning: Output {i} could not be parsed as valid XML\n{'='*80}")
```

**Example Output**
```
{
  "segments": [
    {
      "start": 0.0,
      "end": 2.6,
      "text": "oh right so oh so that that is all st stored to",
      "speaker_id": "1",
      "speaker_gender": "female"
    },
    {
      "start": 2.15,
      "end": 5.88,
      "text": "the speaker class knows about all of that stuff and the meeting class knows about that stuff",
      "speaker_id": "2",
      "speaker_gender": "male"
    },
    {
      "start": 4.12,
      "end": 4.51,
      "text": "alright",
      "speaker_id": "1",
      "speaker_gender": "female"
    },
    {
      "start": 5.88,
      "end": 6.75,
      "text": "well",
      "speaker_id": "2",
      "speaker_gender": "male"
    }
  ]
}
```

## 📌 Model Characteristics

- Input: Raw audio waveform (16 kHz recommended)
- Output: Speaker-attributed ASR with timestamps in XML format, can be parsed into JSON format
- Backend LLM: Qwen2.5-7B-Instruct (frozen)
- Architecture: Dual encoders (semantic + [voice](https://huggingface.co/AudenAI/auden-encoder-voice)) with numeric time anchors

## ⚠️ Limitations

- This checkpoint is trained on **approximately 65 hours of AMI meeting speech** only, and is primarily optimized for **noisy, far-field, multi-speaker meeting scenarios**. Performance may degrade on out-of-domain audio (e.g., clean close-talk speech or other acoustic conditions). For best results, we recommend fine-tuning on in-domain data.

- The model is recommended for **short inference (≤ 30 seconds)**. For long-form recordings, **chunk-based inference** is required. Chunking and post-processing logic are not provided in this repository.

- This model is designed for **offline inference only** and does **not support real-time or streaming ASR**.


## Citation
If you use TagSpeech in your research, please cite:

```
@article{huo2026tagspeech,
  title={TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding},
  author={Huo, Mingyue and Shao, Yiwen and Zhang, Yuheng},
  journal={arXiv preprint arXiv:2601.06896},
  year={2026}
}
```