JetlinkTTS

This repository hosts an organization-managed copy of JetlinkTTS for multilingual text-to-speech, voice cloning, and controllable voice design workloads.

It is intended for teams that want to manage deployment, access, and internal distribution from their own namespace while preserving compatibility with the upstream model ecosystem.

Model Summary

JetlinkTTS is a tokenizer-free diffusion autoregressive text-to-speech model built for expressive multilingual speech generation. The upstream model card describes it as a 2B-parameter model supporting 30 languages, with 48kHz audio output, trained on over 2 million hours of multilingual speech data. It also supports voice cloning, voice design, streaming generation, and context-aware synthesis. :contentReference[oaicite:1]{index=1}

Key Features

Multilingual text-to-speech across 30 supported languages
Voice Design from natural-language voice descriptions
Controllable Voice Cloning from short reference audio
Ultimate Cloning using reference audio plus transcript for higher fidelity
48kHz studio-quality output
Streaming generation
Context-aware prosody and expressiveness
Commercial-friendly Apache-2.0 license :contentReference[oaicite:2]{index=2}

Supported Languages

According to the upstream model card, VoxCPM2 supports the following 30 languages:

Arabic
Burmese
Chinese
Danish
Dutch
English
Finnish
French
German
Greek
Hebrew
Hindi
Indonesian
Italian
Japanese
Khmer
Korean
Lao
Malay
Norwegian
Polish
Portuguese
Russian
Spanish
Swahili
Swedish
Tagalog
Thai
Turkish
Vietnamese

The upstream model card also lists support for several Chinese dialects, including:

四川话
粤语
吴语
东北话
河南话
陕西话
山东话
天津话
闽南话 :contentReference[oaicite:3]{index=3}

Intended Use

This model is suitable for:

multilingual speech synthesis
narration and audiobook generation
voice assistant backends
voice cloning workflows
creative voice design
subtitle dubbing and localization
conversational TTS pipelines
research and benchmarking :contentReference[oaicite:4]{index=4}

Model Details

Architecture

The upstream model card describes VoxCPM2 as:

Architecture: Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT)
Backbone: Based on MiniCPM-4
Total parameters: 2B
Audio VAE: AudioVAE V2
Reference input: 16kHz
Output audio: 48kHz
Maximum sequence length: 8192 tokens
Default dtype: bfloat16 :contentReference[oaicite:5]{index=5}

Hardware Requirements

This model does not have a single universal minimum hardware requirement for all usage scenarios.

Actual requirements depend on:

inference backend
text length
streaming vs offline mode
voice cloning usage
concurrency
latency target
runtime configuration :contentReference[oaicite:6]{index=6}

Minimum System Requirements

The upstream model card explicitly reports ~8 GB VRAM in the model details section. It also lists the core software requirements as Python ≥ 3.10, PyTorch ≥ 2.5.0, and CUDA ≥ 12.0. :contentReference[oaicite:7]{index=7}

Practical memory guidance for JetlinkTTS:

Estimated practical minimum VRAM: ~8 GB
Recommended for smoother local development and testing: 12–16 GB VRAM
Recommended for production or higher concurrency: modern datacenter-class GPUs

Note: real memory usage can increase depending on text length, cloning mode, streaming usage, batch size, and backend overhead. The values above should be treated as practical guidance rather than hard universal limits. :contentReference[oaicite:8]{index=8}

Reference Hardware

For practical deployment planning:

Development / light testing: a single modern GPU with around 8 GB VRAM or higher may be sufficient
Smoother local experimentation: 12–16 GB VRAM
Production-oriented serving: modern datacenter GPUs are recommended
Lower latency / higher throughput serving: optimized inference stacks should be considered :contentReference[oaicite:9]{index=9}

Software Requirements

Recommended environment:

Python 3.10 or newer
PyTorch 2.5.0 or newer
CUDA 12.0 or newer
Linux recommended for deployment
voxcpm package for upstream usage :contentReference[oaicite:10]{index=10}

Common dependencies may include:

torch
soundfile
voxcpm

Quickstart

Install the upstream package:

pip install voxcpm

Basic usage:

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained("Jetlink/JetlinkTTS", load_denoiser=False)

wav = model.generate(
    text="JetlinkTTS delivers expressive multilingual speech generation.",
    cfg_value=2.0,
    inference_timesteps=10,
)

sf.write("output.wav", wav, model.tts_model.sample_rate)

Voice Design Example

You can guide the voice with a natural-language description placed in parentheses at the beginning of the text:

wav = model.generate(
    text="(A young woman, gentle and warm voice)Hello, welcome to JetlinkTTS!",
    cfg_value=2.0,
    inference_timesteps=10,
)

Voice Cloning Example

Basic cloning with a short reference clip:

wav = model.generate(
    text="This is a cloned voice generated by JetlinkTTS.",
    reference_wav_path="speaker.wav",
)

Controllable cloning with style guidance:

wav = model.generate(
    text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
    reference_wav_path="speaker.wav",
    cfg_value=2.0,
    inference_timesteps=10,
)

High-Fidelity / Ultimate Cloning

For maximum similarity, provide both the reference audio and its transcript:

wav = model.generate(
    text="This is a high-fidelity cloning demonstration using JetlinkTTS.",
    prompt_wav_path="speaker_reference.wav",
    prompt_text="The transcript of the reference audio.",
    reference_wav_path="speaker_reference.wav",
)

Streaming Example

JetlinkTTS also supports streaming generation in the upstream workflow:

import numpy as np

chunks = []
for chunk in model.generate_streaming(text="Streaming is easy with JetlinkTTS!"):
    chunks.append(chunk)

wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)

Serving Notes

This model is suitable for:

real-time or near-real-time TTS
voice cloning services
multilingual TTS APIs
creative speech generation pipelines
enterprise speech applications :contentReference[oaicite:11]{index=11}

The upstream model card reports real-time factor values as low as approximately 0.30 on NVIDIA RTX 4090 and approximately 0.13 with Nano-VLLM acceleration, indicating that optimized serving is possible with the right runtime stack. :contentReference[oaicite:12]{index=12}

Strengths

strong multilingual TTS coverage
voice design without reference audio
controllable voice cloning
high-fidelity cloning with transcript guidance
48kHz output quality
streaming support
open-source and commercial-friendly licensing :contentReference[oaicite:13]{index=13}

Limitations

According to the upstream model card:

voice design and style control results may vary between runs
performance varies across languages depending on training data availability
occasional instability may appear with very long or highly expressive inputs
the model must not be used for impersonation, fraud, or disinformation
AI-generated content should be clearly labeled :contentReference[oaicite:14]{index=14}

Out-of-Scope / Cautionary Use

Outputs should not be used for:

impersonation
fraud
disinformation
deceptive identity simulation
unlabeled synthetic voice deployment in sensitive scenarios

Human review, clear disclosure, and policy controls are strongly recommended. :contentReference[oaicite:15]{index=15}

Fine-Tuning

The upstream model card states that VoxCPM2 supports both LoRA fine-tuning and full fine-tuning, with as little as 5–10 minutes of audio in some workflows. Refer to the upstream fine-tuning guide for exact procedures and configuration details. :contentReference[oaicite:16]{index=16}

License

This repository follows the same license as the upstream release.

License: Apache-2.0

If you redistribute, fine-tune, quantize, or otherwise modify this model, make sure your usage remains compliant with the upstream license and attribution requirements. :contentReference[oaicite:17]{index=17}

Attribution

Original upstream model:

openbmb/VoxCPM2

This repository is an organization-managed copy and is not the original upstream source.

Citation

Please cite the original VoxCPM2 release when using this model in research, evaluation, or production documentation.

@article{voxcpm2_2026,
  title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
  author  = {VoxCPM Team},
  journal = {GitHub},
  year    = {2026},
}

@article{voxcpm2025,
  title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
  author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
             Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
             Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
  journal = {arXiv preprint arXiv:2509.24650},
  year    = {2025},
} :contentReference[oaicite:18]{index=18}

Disclaimer

This repository may include packaging, naming, or deployment-oriented changes for organizational use.

For official updates, benchmark details, and upstream release notes, refer to the original upstream model card. :contentReference[oaicite:19]{index=19}

JetlinkTTS (Türkçe)

Bu depo, çok dilli metinden konuşmaya dönüştürme, ses klonlama ve kontrol edilebilir voice design iş yükleri için openbmb/VoxCPM2 tabanlı JetlinkTTS modelinin kurum tarafından yönetilen bir kopyasını barındırır.

Bu depo; modeli kendi namespace’i altında yönetmek, erişimi kontrol etmek ve dağıtımı kolaylaştırmak isteyen ekipler için hazırlanmıştır. Amaç, upstream model ekosistemiyle uyumluluğu koruyarak kurumsal kullanım sağlamaktır.

Model Özeti

JetlinkTTS, VoxCPM2 tabanlı tokenizer-free diffusion autoregressive bir metinden konuşmaya dönüştürme modelidir. Upstream model kartına göre model 2B parametreye sahiptir, 30 dil destekler, 48kHz ses çıktısı üretir ve 2 milyon saatten fazla çok dilli konuşma verisi üzerinde eğitilmiştir. Ayrıca voice cloning, voice design, streaming generation ve context-aware synthesis özelliklerini destekler. :contentReference[oaicite:20]{index=20}

Temel Özellikler

30 dilde çok dilli TTS
Doğal dil açıklamasından Voice Design
Kısa referans ses ile Controllable Voice Cloning
Referans ses + transcript ile Ultimate Cloning
48kHz stüdyo kalitesinde çıktı
Streaming generation
Bağlama duyarlı prosody ve ifade üretimi
Apache-2.0 ile ticari kullanıma uygun lisans :contentReference[oaicite:21]{index=21}

Desteklenen Diller

Upstream model kartına göre VoxCPM2 şu 30 dili destekler:

Arapça
Burma dili
Çince
Danca
Hollandaca
İngilizce
Fince
Fransızca
Almanca
Yunanca
İbranice
Hintçe
Endonezce
İtalyanca
Japonca
Kmerce
Korece
Lao dili
Malayca
Norveççe
Lehçe
Portekizce
Rusça
İspanyolca
Svahili
İsveççe
Tagalog
Tayca
Türkçe
Vietnamca

Ek olarak bazı Çince lehçeleri de listelenmiştir:

四川话
粤语
吴语
东北话
河南话
陕西话
山东话
天津话
闽南话 :contentReference[oaicite:22]{index=22}

Kullanım Amacı

Bu model aşağıdaki senaryolar için uygundur:

çok dilli konuşma sentezi
anlatım ve seslendirme üretimi
voice assistant backend’leri
ses klonlama iş akışları
yaratıcı voice design
altyazı dublajı ve lokalizasyon
konuşma tabanlı TTS servisleri
araştırma ve benchmark çalışmaları :contentReference[oaicite:23]{index=23}

Model Detayları

Mimari

Upstream model kartı VoxCPM2’yi şu şekilde tanımlar:

Mimari: Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT)
Backbone: MiniCPM-4 tabanlı
Toplam parametre: 2B
Audio VAE: AudioVAE V2
Referans giriş: 16kHz
Çıktı sesi: 48kHz
Maksimum sequence length: 8192 token
Varsayılan dtype: bfloat16 :contentReference[oaicite:24]{index=24}

Donanım Gereksinimleri

Bu model için tüm kullanım senaryolarını kapsayan tek bir evrensel minimum donanım gereksinimi yoktur.

Gerçek ihtiyaçlar şunlara bağlıdır:

inference backend
metin uzunluğu
streaming veya offline kullanım
voice cloning kullanımı
concurrency
latency hedefi
runtime yapılandırması :contentReference[oaicite:25]{index=25}

Minimum Sistem Gereksinimleri

Upstream model kartı model detaylarında doğrudan yaklaşık ~8 GB VRAM bilgisi verir. Ayrıca temel yazılım gereksinimleri olarak Python ≥ 3.10, PyTorch ≥ 2.5.0 ve CUDA ≥ 12.0 belirtilmiştir. :contentReference[oaicite:26]{index=26}

JetlinkTTS için pratik bellek rehberi:

Tahmini pratik minimum VRAM: ~8 GB
Daha rahat local geliştirme ve test için önerilen: 12–16 GB VRAM
Production veya daha yüksek concurrency için önerilen: modern datacenter sınıfı GPU’lar

Not: gerçek bellek kullanımı; metin uzunluğu, klonlama modu, streaming kullanımı, batch size ve backend kaynaklı ek yükler nedeniyle artabilir. Yukarıdaki değerler kesin sınırlar değil, pratik rehber olarak değerlendirilmelidir. :contentReference[oaicite:27]{index=27}

Referans Donanım

Pratik dağıtım planlaması için:

Geliştirme / hafif test: yaklaşık 8 GB VRAM veya üzeri tek modern GPU yeterli olabilir
Daha rahat local denemeler: 12–16 GB VRAM
Production odaklı serving: modern datacenter GPU’lar önerilir
Daha düşük latency / daha yüksek throughput: optimize inference stack’leri değerlendirilmelidir :contentReference[oaicite:28]{index=28}

Yazılım Gereksinimleri

Önerilen ortam:

Python 3.10 veya üzeri
PyTorch 2.5.0 veya üzeri
CUDA 12.0 veya üzeri
deployment için Linux önerilir
upstream kullanım için voxcpm paketi :contentReference[oaicite:29]{index=29}

Yaygın bağımlılıklar:

torch
soundfile
voxcpm

Hızlı Başlangıç

Upstream paketi kur:

pip install voxcpm

Temel kullanım:

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained("Jetlink/JetlinkTTS", load_denoiser=False)

wav = model.generate(
    text="JetlinkTTS delivers expressive multilingual speech generation.",
    cfg_value=2.0,
    inference_timesteps=10,
)

sf.write("output.wav", wav, model.tts_model.sample_rate)

Voice Design Örneği

Ses tarzını, metnin başında parantez içinde doğal dil ile yönlendirebilirsin:

wav = model.generate(
    text="(Genç bir kadın, yumuşak ve sıcak bir ses tonu)Merhaba, JetlinkTTS'e hoş geldiniz!",
    cfg_value=2.0,
    inference_timesteps=10,
)

Voice Cloning Örneği

Kısa bir referans ses ile temel klonlama:

wav = model.generate(
    text="Bu, JetlinkTTS tarafından üretilmiş klonlanmış bir sestir.",
    reference_wav_path="speaker.wav",
)

Stil kontrollü klonlama:

wav = model.generate(
    text="(Biraz daha hızlı, neşeli bir ton)Bu, stil kontrolü uygulanmış klonlanmış bir sestir.",
    reference_wav_path="speaker.wav",
    cfg_value=2.0,
    inference_timesteps=10,
)

Yüksek Benzerlikli / Ultimate Cloning

En yüksek benzerlik için hem referans ses hem de transcript verilebilir:

wav = model.generate(
    text="Bu, JetlinkTTS ile yapılmış yüksek benzerlikli klonlama örneğidir.",
    prompt_wav_path="speaker_reference.wav",
    prompt_text="Referans sesin transcript metni.",
    reference_wav_path="speaker_reference.wav",
)

Streaming Örneği

Upstream akışta streaming üretim de desteklenir:

import numpy as np

chunks = []
for chunk in model.generate_streaming(text="JetlinkTTS ile streaming oldukça kolay!"):
    chunks.append(chunk)

wav = np.concatenate(chunks)

Serving Notları

Bu model şu kullanım türleri için uygundur:

gerçek zamanlı veya gerçeğe yakın zamanlı TTS
voice cloning servisleri
çok dilli TTS API’leri
yaratıcı konuşma üretim akışları
kurumsal ses uygulamaları :contentReference[oaicite:30]{index=30}

Upstream model kartı, gerçek zaman faktörü için yaklaşık RTX 4090 üzerinde ~0.30 ve Nano-VLLM hızlandırmasıyla ~0.13 seviyelerini raporlar. Bu da uygun runtime stack ile optimize serving yapılabildiğini gösterir. :contentReference[oaicite:31]{index=31}

Güçlü Yönler

güçlü çok dilli TTS kapsaması
referans ses olmadan voice design
kontrol edilebilir voice cloning
transcript destekli yüksek benzerlikli klonlama
48kHz çıktı kalitesi
streaming desteği
açık kaynak ve ticari kullanıma uygun lisans :contentReference[oaicite:32]{index=32}

Sınırlamalar

Upstream model kartına göre:

voice design ve style control sonuçları çalıştırmalar arasında değişebilir
performans, eğitim verisi kapsamına bağlı olarak dillere göre değişir
çok uzun veya aşırı ifadeli girdilerde zaman zaman kararsızlık görülebilir
impersonation, fraud veya disinformation için kullanımı kesinlikle yasaktır
AI ile üretilmiş içerikler açıkça etiketlenmelidir :contentReference[oaicite:33]{index=33}

Kapsam Dışı / Dikkat Gerektiren Kullanımlar

Çıktılar şu amaçlarla kullanılmamalıdır:

kimliğe bürünme
dolandırıcılık
dezenformasyon
aldatıcı kimlik simülasyonu
hassas senaryolarda etiketsiz sentetik ses kullanımı

İnsan denetimi, açık bilgilendirme ve politika kontrolleri güçlü şekilde önerilir. :contentReference[oaicite:34]{index=34}

Fine-Tuning

Upstream model kartı, VoxCPM2’nin hem LoRA fine-tuning hem de full fine-tuning desteklediğini ve bazı senaryolarda 5–10 dakika ses verisi ile ince ayar yapılabildiğini belirtir. Kesin prosedür ve konfigürasyon detayları için upstream fine-tuning rehberine bakılmalıdır. :contentReference[oaicite:35]{index=35}

Lisans

Bu depo, upstream sürümle aynı lisansı takip eder.

Lisans: Apache-2.0

Modeli yeniden dağıtıyor, fine-tune ediyor, quantize ediyor veya başka şekilde değiştiriyorsan; kullanımının upstream lisans ve attribution gereklilikleriyle uyumlu olduğundan emin olmalısın. :contentReference[oaicite:36]{index=36}

Atıf

Orijinal upstream model:

openbmb/VoxCPM2

Bu depo, kurum tarafından yönetilen bir kopyadır ve orijinal upstream kaynak değildir.

Atıf / Citation

Bu modeli araştırma, değerlendirme veya production dokümantasyonunda kullanıyorsan, lütfen orijinal VoxCPM2 sürümüne atıf yap.

@article{voxcpm2_2026,
  title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
  author  = {VoxCPM Team},
  journal = {GitHub},
  year    = {2026},
}

@article{voxcpm2025,
  title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
  author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
             Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
             Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
  journal = {arXiv preprint arXiv:2509.24650},
  year    = {2025},
}

Feragatname

Bu depo, kurumsal kullanım amacıyla paketleme, isimlendirme veya dağıtım odaklı bazı değişiklikler içerebilir.

Resmi güncellemeler, benchmark detayları ve upstream sürüm notları için orijinal upstream model kartına bakılmalıdır.

Downloads last month: 44

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for Jetlink/JetlinkTTS

Base model

openbmb/VoxCPM2

Finetuned

(11)

this model

Paper for Jetlink/JetlinkTTS

VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

Paper • 2509.24650 • Published Sep 29, 2025 • 11