Chatterbox β Gujarati fine-tune
A fine-tuned Chatterbox TTS model for Gujarati text-to-speech with voice cloning and emotion control.
Model details
| Attribute | Value |
|---|---|
| Base model | ResembleAI/chatterbox (Multilingual) |
| Architecture | CosyVoice 2.0 β LLaMA-based T3 (0.5B params) |
| Language | Gujarati (gu) |
| Training hardware | NVIDIA L4 (24 GB VRAM) |
| Fine-tuning repo | gokhaneraslan/chatterbox-finetuning |
| Vocab extension | 2454 β 2514 tokens (+60 Gujarati characters) |
Training data
Fine-tuned on Gujarati clips from Arjun4707/gu-hi-tts (~33,851 clips after CPS + diarization filtering).
Data source: Audio clips scraped from publicly available YouTube videos. Preprocessed with speaker diarization (pyannote-audio) to keep only single-speaker clips, CPS filtered (4-25 chars/sec), duration filtered (2-20s).
Known limitations
- Very short utterances (1-3 words) produce poor quality β architecture needs minimum ~5 words
- Medium to long sentences (5-30 words) produce good quality with clear Gujarati pronunciation
- Not suitable for sub-2s audio generation
Training code
Full training pipeline and troubleshooting: BhammarArjun/TTS_4_training
License
CC-BY-NC-4.0 β Non-commercial use only.
The base Chatterbox model is MIT-licensed, but this fine-tuned version uses YouTube-sourced training data. To be transparent and responsible about data provenance, we apply CC-BY-NC-4.0 to this fine-tuned version.
Citation
@misc{arjun2026chatterboxgu,
title = {Chatterbox fine-tuned for Gujarati},
author = {Arjun Bhammar},
year = {2026},
url = {https://huggingface.co/Arjun4707/chatterbox-gujarati}
}
Acknowledgements
- Resemble AI for the Chatterbox TTS model
- gokhaneraslan for the fine-tuning framework
- Downloads last month
- 9
Model tree for Arjun4707/chatterbox-gujarati
Base model
ResembleAI/chatterbox