--- license: apache-2.0 datasets: - edinburghcstr/ami language: - en metrics: - wer base_model: - Qwen/Qwen2.5-7B-Instruct pipeline_tag: automatic-speech-recognition tags: - tagspeech - diarization - multi-speaker --- # TagSpeech **TagSpeech** is a **fully end-to-end multi-speaker ASR and diarization model**. Given a raw waveform of a multi-speaker conversation, the model directly outputs **speaker-attributed transcriptions with timestamps and gender labels**, without requiring a separate diarization or clustering stage. šŸ”— **Paper**: [TagSpeech: Unified E2E Multi-Speaker ASR and Diarization Model](https://arxiv.org/abs/2601.06896) Available checkpoints: - **English (AMI)**: `AudenAI/TagSpeech-AMI` - **Mandarin (AliMeeting)**: `AudenAI/TagSpeech-Alimeeting` --- ## šŸ” What Can This Model Do? - šŸŽ™ļø **Multi-speaker speech recognition** - šŸ§‘ā€šŸ¤ā€šŸ§‘ **Speaker diarization** - ā±ļø **Timestamped utterances** - 🚻 **Gender prediction** - 🧩 **Single forward pass** (no external diarization model required) The model is designed for **meeting-style conversational audio** with overlapping speakers. ## Quick Inference ```python import torch from model import TagSpeech from utils.xml_utils import xml_to_json device = "cuda" if torch.cuda.is_available() else "cpu" model = TagSpeech.from_pretrained("AudenAI/TagSpeech-AMI").to(device) wav_files = ["assets/test_example_AMI_EN2002c-12-0-35.wav"] audio_token = model.config.audio_token messages = [ [{"role": "user", "content": f"{audio_token}\n{audio_token}"}] for _ in wav_files ] outputs = model.generate(wav_files, messages, max_new_tokens=800, num_beams=1, do_sample=False) # Print outputs in XML and JSON formats for i, output in enumerate(outputs, 1): print(f"\n{'='*80}\nOutput {i}/{len(outputs)} - XML:\n{'='*80}\n{output}\n{'='*80}") json_output = xml_to_json(output) if json_output: print(f"\nOutput {i}/{len(outputs)} - JSON:\n{'='*80}\n{json_output}\n{'='*80}") else: print(f"\nāš ļø Warning: Output {i} could not be parsed as valid XML\n{'='*80}") ``` **Example Output** ``` { "segments": [ { "start": 0.0, "end": 2.6, "text": "oh right so oh so that that is all st stored to", "speaker_id": "1", "speaker_gender": "female" }, { "start": 2.15, "end": 5.88, "text": "the speaker class knows about all of that stuff and the meeting class knows about that stuff", "speaker_id": "2", "speaker_gender": "male" }, { "start": 4.12, "end": 4.51, "text": "alright", "speaker_id": "1", "speaker_gender": "female" }, { "start": 5.88, "end": 6.75, "text": "well", "speaker_id": "2", "speaker_gender": "male" } ] } ``` ## šŸ“Œ Model Characteristics - Input: Raw audio waveform (16 kHz recommended) - Output: Speaker-attributed ASR with timestamps in XML format, can be parsed into JSON format - Backend LLM: Qwen2.5-7B-Instruct (frozen) - Architecture: Dual encoders (semantic + [voice](https://huggingface.co/AudenAI/auden-encoder-voice)) with numeric time anchors ## āš ļø Limitations - This checkpoint is trained on **approximately 65 hours of AMI meeting speech** only, and is primarily optimized for **noisy, far-field, multi-speaker meeting scenarios**. Performance may degrade on out-of-domain audio (e.g., clean close-talk speech or other acoustic conditions). For best results, we recommend fine-tuning on in-domain data. - The model is recommended for **short inference (≤ 30 seconds)**. For long-form recordings, **chunk-based inference** is required. Chunking and post-processing logic are not provided in this repository. - This model is designed for **offline inference only** and does **not support real-time or streaming ASR**. ## Citation If you use TagSpeech in your research, please cite: ``` @article{huo2026tagspeech, title={TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding}, author={Huo, Mingyue and Shao, Yiwen and Zhang, Yuheng}, journal={arXiv preprint arXiv:2601.06896}, year={2026} } ```