TtT-3B (S2S)
TtT (Text-to-Talk) is a unified audio-language model for speech-to-speech (S2S) interaction. It unifies autoregressive text modeling and non-autoregressive audio generation (absorbing discrete diffusion) within a single Transformer to produce conversational spoken responses.
- Code (training / inference): https://github.com/ai4ed/TtT
- Model repo:
Stephen-Lee/TtT-3B - Paper: From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training (ICLR 2026)
Quickstart
Installation
For full environment setup and end-to-end S2S examples, please follow the GitHub repository.
Basic Inference
from modeling_qwen_TtT import Qwen2ForARDiffLM
from transformers import AutoTokenizer
# Load model and tokenizer
model = Qwen2ForARDiffLM.from_pretrained("Stephen-Lee/TtT-3B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Stephen-Lee/TtT-3B", trust_remote_code=True)
# Set up special tokens
if tokenizer.mask_token_id is None:
tokenizer.mask_token_id = tokenizer.convert_tokens_to_ids("<|mask_token|>")
# Generate
prompt = (
"<|im_start|>user\n"
"<|begin_of_audio|><|audio_1234|>...<|end_of_audio|><|im_end|>\n"
"<|im_start|>assistant\n"
)
output = generate(model, tokenizer, prompt, max_gen_len=2048)
Advanced Generation
python inference_TtT.py
Notes
- Input audio is represented as discrete audio tokens (e.g., <|audio_1234|>). Please refer to the GitHub repo for tokenization and end-to-end S2S usage.
- If you encounter OOM, reduce max_gen_len and adjust generation settings in the inference script.
Citation
If you find this model useful, please cite:
@inproceedings{liu2026ttt,
title={From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training},
author={Liu, Tianqiao and Li, Xueyi and Wang, Hao and Li, Haoxuan and Chen, Zhichao and Luo, Weiqi and Liu, Zitao},
booktitle={Proceedings of the 14th International Conference on Learning Representations},
month = {April},
year={2026},
address = {Rio de Janeiro, Brazil}
}
License
Apache-2.0
- Downloads last month
- 30