File size: 4,211 Bytes
cda4a95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a85271
cda4a95
 
 
 
 
9a85271
 
cda4a95
 
 
 
 
9a85271
 
 
cda4a95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: apache-2.0
datasets:
- edinburghcstr/ami
language:
- en
metrics:
- wer
base_model:
- Qwen/Qwen2.5-7B-Instruct
pipeline_tag: automatic-speech-recognition
tags:
- tagspeech
- diarization
- multi-speaker
---

# TagSpeech

**TagSpeech** is a **fully end-to-end multi-speaker ASR and diarization model**.  
Given a raw waveform of a multi-speaker conversation, the model directly outputs **speaker-attributed transcriptions with timestamps and gender labels**, without requiring a separate diarization or clustering stage.

πŸ”— **Paper**: [TagSpeech: Unified E2E Multi-Speaker ASR and Diarization Model](https://arxiv.org/abs/2601.06896)

Available checkpoints:
- **English (AMI)**: `AudenAI/TagSpeech-AMI`
- **Mandarin (AliMeeting)**: `AudenAI/TagSpeech-Alimeeting`

---

## πŸ” What Can This Model Do?

- πŸŽ™οΈ **Multi-speaker speech recognition**
- πŸ§‘β€πŸ€β€πŸ§‘ **Speaker diarization**
- ⏱️ **Timestamped utterances**
- 🚻 **Gender prediction**
- 🧩 **Single forward pass** (no external diarization model required)

The model is designed for **meeting-style conversational audio** with overlapping speakers.


## Quick Inference

```python
import torch
from model import TagSpeech
from utils.xml_utils import xml_to_json

device = "cuda" if torch.cuda.is_available() else "cpu"
model = TagSpeech.from_pretrained("AudenAI/TagSpeech-AMI").to(device) 

wav_files = ["assets/test_example_AMI_EN2002c-12-0-35.wav"]

audio_token = model.config.audio_token
messages = [
    [{"role": "user", "content": f"<text>{audio_token}</text>\n<speaker>{audio_token}</speaker>"}]
    for _ in wav_files
]

outputs = model.generate(wav_files, messages, max_new_tokens=800, num_beams=1, do_sample=False)

# Print outputs in XML and JSON formats
for i, output in enumerate(outputs, 1):
    print(f"\n{'='*80}\nOutput {i}/{len(outputs)} - XML:\n{'='*80}\n{output}\n{'='*80}")
    
    json_output = xml_to_json(output)
    if json_output:
        print(f"\nOutput {i}/{len(outputs)} - JSON:\n{'='*80}\n{json_output}\n{'='*80}")
    else:
        print(f"\n⚠️  Warning: Output {i} could not be parsed as valid XML\n{'='*80}")
```

**Example Output**
```
{
  "segments": [
    {
      "start": 0.0,
      "end": 2.6,
      "text": "oh right so oh so that that is all st stored to",
      "speaker_id": "1",
      "speaker_gender": "female"
    },
    {
      "start": 2.15,
      "end": 5.88,
      "text": "the speaker class knows about all of that stuff and the meeting class knows about that stuff",
      "speaker_id": "2",
      "speaker_gender": "male"
    },
    {
      "start": 4.12,
      "end": 4.51,
      "text": "alright",
      "speaker_id": "1",
      "speaker_gender": "female"
    },
    {
      "start": 5.88,
      "end": 6.75,
      "text": "well",
      "speaker_id": "2",
      "speaker_gender": "male"
    }
  ]
}
```

## πŸ“Œ Model Characteristics

- Input: Raw audio waveform (16 kHz recommended)
- Output: Speaker-attributed ASR with timestamps in XML format, can be parsed into JSON format
- Backend LLM: Qwen2.5-7B-Instruct (frozen)
- Architecture: Dual encoders (semantic + [voice](https://huggingface.co/AudenAI/auden-encoder-voice)) with numeric time anchors

## ⚠️ Limitations

- This checkpoint is trained on **approximately 65 hours of AMI meeting speech** only, and is primarily optimized for **noisy, far-field, multi-speaker meeting scenarios**. Performance may degrade on out-of-domain audio (e.g., clean close-talk speech or other acoustic conditions). For best results, we recommend fine-tuning on in-domain data.

- The model is recommended for **short inference (≀ 30 seconds)**. For long-form recordings, **chunk-based inference** is required. Chunking and post-processing logic are not provided in this repository.

- This model is designed for **offline inference only** and does **not support real-time or streaming ASR**.


## Citation
If you use TagSpeech in your research, please cite:

```
@article{huo2026tagspeech,
  title={TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding},
  author={Huo, Mingyue and Shao, Yiwen and Zhang, Yuheng},
  journal={arXiv preprint arXiv:2601.06896},
  year={2026}
}
```