TtT-3B (S2S)

TtT (Text-to-Talk) is a unified audio-language model for speech-to-speech (S2S) interaction. It unifies autoregressive text modeling and non-autoregressive audio generation (absorbing discrete diffusion) within a single Transformer to produce conversational spoken responses.

  • Code (training / inference): https://github.com/ai4ed/TtT
  • Model repo: Stephen-Lee/Pretrain-TtT-3B
  • Paper: From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training (ICLR 2026)

Quickstart

Installation

For full environment setup and end-to-end S2S examples, please follow the GitHub repository.

Basic Inference

from modeling_qwen_TtT import Qwen2ForARDiffLM
from transformers import AutoTokenizer

# Load model and tokenizer
model = Qwen2ForARDiffLM.from_pretrained("Stephen-Lee/TtT-3B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Stephen-Lee/TtT-3B", trust_remote_code=True)

# Set up special tokens
if tokenizer.mask_token_id is None:
    tokenizer.mask_token_id = tokenizer.convert_tokens_to_ids("<|mask_token|>")

# Generate
prompt = (
    "<|im_start|>user\n"
    "<|begin_of_audio|><|audio_1234|>...<|end_of_audio|><|im_end|>\n"
    "<|im_start|>assistant\n"
)
output = generate(model, tokenizer, prompt, max_gen_len=2048)

Advanced Generation

python inference_TtT.py

Notes

  • Input audio is represented as discrete audio tokens (e.g., <|audio_1234|>). Please refer to the GitHub repo for tokenization and end-to-end S2S usage.
  • If you encounter OOM, reduce max_gen_len and adjust generation settings in the inference script.

Citation

If you find this model useful, please cite:

@inproceedings{liu2026ttt,
  title={From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training},
  author={Liu, Tianqiao and Li, Xueyi and Wang, Hao and Li, Haoxuan and Chen, Zhichao and Luo, Weiqi and Liu, Zitao},
  booktitle={Proceedings of the 14th International Conference on Learning Representations},
  month = {April},
  year={2026},
  address = {Rio de Janeiro, Brazil}
}

License

Apache-2.0

Downloads last month
25
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support