ScrappyLabs Narrator TTS
A fine-tuned Qwen3-TTS 1.7B CustomVoice model producing a warm, expressive narrator voice optimized for storytelling, audiobooks, and conversational AI.
What Is This?
This is one of the first public fine-tunes of Qwen3-TTS. We took the 1.7B CustomVoice base model and trained it on 461 curated narrator-style audio samples to produce a consistent, high-quality storytelling voice without needing a reference audio clip at inference time.
Unlike voice cloning (which requires a reference WAV each time), this finetune bakes the voice identity into the model weights β just pass text in and get narrator audio out.
Audio Samples
Storytelling
"Once upon a time, in a forest older than memory, there lived a bear who collected stories the way other bears collected honey."
Suspense
"The door creaked open. Inside, the room was empty, except for a single envelope on the table, addressed to no one."
Educational
"The octopus has three hearts, blue blood, and the ability to change both the color and texture of its skin in less than a second."
Conversational
"Hey, you know what? Today was a good day. Not perfect, but good. And sometimes that's more than enough."
Key Features
- No reference audio needed β the narrator voice is embedded in the weights
- Instruct/emotion support β accepts style instructions (e.g., "warm and gentle", "excited", "mysterious whisper")
- Multilingual capable β inherits Qwen3-TTS base multilingual support (en, zh, ja, ko, de, fr, ru, pt, es, it)
- Drop-in compatible β works with any Qwen3-TTS serving stack
- Apache 2.0 β fully open, commercial use OK
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice |
| Training samples | 461 WAV clips (~varied lengths, diverse content) |
| Epochs | 5 |
| Batch size | 64 |
| Learning rate | 2e-5 |
| Warmup steps | 200 |
| Precision | bf16 |
| Gradient checkpointing | Enabled |
| Block size | 10240 |
| Hardware | NVIDIA RTX PRO 6000 (Blackwell, 48GB VRAM) |
The training data consists of narrator-style speech samples covering a range of emotional expressions, pacing, and content types β from calm storytelling to dramatic narration.
Model Size
| Component | Size |
|---|---|
Model weights (model.safetensors) |
3.8 GB |
| Speech tokenizer | 682 MB |
| Total | ~4.5 GB |
Usage
With Qwen3-TTS Server
If you're running a Qwen3-TTS server (like qwen3-tts-server), register this as a finetuned speaker and request it by name:
import requests
response = requests.post("http://localhost:8109/v1/audio/speech", json={
"input": "Once upon a time, in a land far far away, there lived a brave little bear.",
"voice": "narrator"
})
with open("output.wav", "wb") as f:
f.write(response.content)
With Instruct (Emotion Control)
response = requests.post("http://localhost:8109/v1/audio/speech", json={
"input": "And then, from the shadows... something moved.",
"voice": "narrator",
"instruct": "mysterious and suspenseful, slow pace"
})
Direct with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"scrappylabs/narrator-tts",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("scrappylabs/narrator-tts")
See the Qwen3-TTS documentation for full inference code.
Why Finetune vs. Voice Clone?
| Voice Clone | Finetune | |
|---|---|---|
| Reference audio needed | Yes (every request) | No |
| Voice consistency | Good | Excellent |
| Inference latency | Higher (processes reference) | Lower (voice is in weights) |
| Emotion/instruct control | Limited | Full support |
| Deployment size | Base model + WAV files | Single model checkpoint |
For production use cases where you need a consistent character voice across thousands of generations, finetuning wins.
Use Cases
- Audiobook narration
- Children's story apps
- Podcast intros/outros
- AI assistant voice
- Game narration
- Any application needing a warm, consistent narrator
About ScrappyLabs
ScrappyLabs builds open-source voice AI tools. This model powers our production TTS pipeline, serving narrator voice across our apps including Poo Bear (an AI-powered teddy bear for kids).
License
Apache 2.0 β same as the base Qwen3-TTS model. Use it for anything.
Citation
If you use this model, a mention of ScrappyLabs is appreciated but not required.
@misc{scrappylabs-narrator-tts-2026,
author = {ScrappyLabs},
title = {ScrappyLabs Narrator TTS: A Qwen3-TTS 1.7B Voice Finetune},
year = {2026},
url = {https://huggingface.co/scrappylabs/narrator-tts}
}
- Downloads last month
- 65
Model tree for scrappylabs/narrator-tts
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice