File size: 20,865 Bytes

---
license: apache-2.0
library_name: transformers
tags:
  - text-to-speech
  - tts
  - multilingual
  - voice-cloning
  - voice-design
  - audio
  - diffusion
  - transformers
pipeline_tag: text-to-speech
base_model: openbmb/VoxCPM2
---

# JetlinkTTS

This repository hosts an organization-managed copy of **JetlinkTTS** for multilingual text-to-speech, voice cloning, and controllable voice design workloads.

It is intended for teams that want to manage deployment, access, and internal distribution from their own namespace while preserving compatibility with the upstream model ecosystem.

## Model Summary

JetlinkTTS is a tokenizer-free diffusion autoregressive text-to-speech model built for expressive multilingual speech generation. The upstream model card describes it as a **2B-parameter** model supporting **30 languages**, with **48kHz audio output**, trained on **over 2 million hours of multilingual speech data**. It also supports **voice cloning**, **voice design**, **streaming generation**, and **context-aware synthesis**. :contentReference[oaicite:1]{index=1}

## Key Features

- **Multilingual text-to-speech** across 30 supported languages
- **Voice Design** from natural-language voice descriptions
- **Controllable Voice Cloning** from short reference audio
- **Ultimate Cloning** using reference audio plus transcript for higher fidelity
- **48kHz studio-quality output**
- **Streaming generation**
- **Context-aware prosody and expressiveness**
- **Commercial-friendly Apache-2.0 license** :contentReference[oaicite:2]{index=2}

## Supported Languages

According to the upstream model card, VoxCPM2 supports the following **30 languages**:

- Arabic
- Burmese
- Chinese
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hebrew
- Hindi
- Indonesian
- Italian
- Japanese
- Khmer
- Korean
- Lao
- Malay
- Norwegian
- Polish
- Portuguese
- Russian
- Spanish
- Swahili
- Swedish
- Tagalog
- Thai
- Turkish
- Vietnamese

The upstream model card also lists support for several Chinese dialects, including:

- 四川话
- 粤语
- 吴语
- 东北话
- 河南话
- 陕西话
- 山东话
- 天津话
- 闽南话 :contentReference[oaicite:3]{index=3}

## Intended Use

This model is suitable for:

- multilingual speech synthesis
- narration and audiobook generation
- voice assistant backends
- voice cloning workflows
- creative voice design
- subtitle dubbing and localization
- conversational TTS pipelines
- research and benchmarking :contentReference[oaicite:4]{index=4}

## Model Details

### Architecture

The upstream model card describes VoxCPM2 as:

- **Architecture:** Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT)
- **Backbone:** Based on MiniCPM-4
- **Total parameters:** 2B
- **Audio VAE:** AudioVAE V2
- **Reference input:** 16kHz
- **Output audio:** 48kHz
- **Maximum sequence length:** 8192 tokens
- **Default dtype:** bfloat16 :contentReference[oaicite:5]{index=5}

## Hardware Requirements

> This model does not have a single universal minimum hardware requirement for all usage scenarios.

Actual requirements depend on:

- inference backend
- text length
- streaming vs offline mode
- voice cloning usage
- concurrency
- latency target
- runtime configuration :contentReference[oaicite:6]{index=6}

### Minimum System Requirements

The upstream model card explicitly reports **~8 GB VRAM** in the model details section. It also lists the core software requirements as **Python ≥ 3.10**, **PyTorch ≥ 2.5.0**, and **CUDA ≥ 12.0**. :contentReference[oaicite:7]{index=7}

Practical memory guidance for JetlinkTTS:

- **Estimated practical minimum VRAM:** **~8 GB**
- **Recommended for smoother local development and testing:** **12–16 GB VRAM**
- **Recommended for production or higher concurrency:** modern datacenter-class GPUs

> Note: real memory usage can increase depending on text length, cloning mode, streaming usage, batch size, and backend overhead. The values above should be treated as practical guidance rather than hard universal limits. :contentReference[oaicite:8]{index=8}

### Reference Hardware

For practical deployment planning:

- **Development / light testing:** a single modern GPU with around **8 GB VRAM or higher** may be sufficient
- **Smoother local experimentation:** **12–16 GB VRAM**
- **Production-oriented serving:** modern datacenter GPUs are recommended
- **Lower latency / higher throughput serving:** optimized inference stacks should be considered :contentReference[oaicite:9]{index=9}

## Software Requirements

Recommended environment:

- **Python 3.10 or newer**
- **PyTorch 2.5.0 or newer**
- **CUDA 12.0 or newer**
- Linux recommended for deployment
- `voxcpm` package for upstream usage :contentReference[oaicite:10]{index=10}

Common dependencies may include:

- `torch`
- `soundfile`
- `voxcpm`

## Quickstart

Install the upstream package:

    pip install voxcpm

Basic usage:

    import soundfile as sf
    from voxcpm import VoxCPM

    model = VoxCPM.from_pretrained("Jetlink/JetlinkTTS", load_denoiser=False)

    wav = model.generate(
        text="JetlinkTTS delivers expressive multilingual speech generation.",
        cfg_value=2.0,
        inference_timesteps=10,
    )

    sf.write("output.wav", wav, model.tts_model.sample_rate)

## Voice Design Example

You can guide the voice with a natural-language description placed in parentheses at the beginning of the text:

    wav = model.generate(
        text="(A young woman, gentle and warm voice)Hello, welcome to JetlinkTTS!",
        cfg_value=2.0,
        inference_timesteps=10,
    )

## Voice Cloning Example

Basic cloning with a short reference clip:

    wav = model.generate(
        text="This is a cloned voice generated by JetlinkTTS.",
        reference_wav_path="speaker.wav",
    )

Controllable cloning with style guidance:

    wav = model.generate(
        text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
        reference_wav_path="speaker.wav",
        cfg_value=2.0,
        inference_timesteps=10,
    )

## High-Fidelity / Ultimate Cloning

For maximum similarity, provide both the reference audio and its transcript:

    wav = model.generate(
        text="This is a high-fidelity cloning demonstration using JetlinkTTS.",
        prompt_wav_path="speaker_reference.wav",
        prompt_text="The transcript of the reference audio.",
        reference_wav_path="speaker_reference.wav",
    )

## Streaming Example

JetlinkTTS also supports streaming generation in the upstream workflow:

    import numpy as np

    chunks = []
    for chunk in model.generate_streaming(text="Streaming is easy with JetlinkTTS!"):
        chunks.append(chunk)

    wav = np.concatenate(chunks)
    sf.write("streaming.wav", wav, model.tts_model.sample_rate)

## Serving Notes

This model is suitable for:

- real-time or near-real-time TTS
- voice cloning services
- multilingual TTS APIs
- creative speech generation pipelines
- enterprise speech applications :contentReference[oaicite:11]{index=11}

The upstream model card reports real-time factor values as low as approximately **0.30 on NVIDIA RTX 4090** and approximately **0.13 with Nano-VLLM acceleration**, indicating that optimized serving is possible with the right runtime stack. :contentReference[oaicite:12]{index=12}

## Strengths

- strong multilingual TTS coverage
- voice design without reference audio
- controllable voice cloning
- high-fidelity cloning with transcript guidance
- 48kHz output quality
- streaming support
- open-source and commercial-friendly licensing :contentReference[oaicite:13]{index=13}

## Limitations

According to the upstream model card:

- voice design and style control results may vary between runs
- performance varies across languages depending on training data availability
- occasional instability may appear with very long or highly expressive inputs
- the model must not be used for impersonation, fraud, or disinformation
- AI-generated content should be clearly labeled :contentReference[oaicite:14]{index=14}

## Out-of-Scope / Cautionary Use

Outputs should not be used for:

- impersonation
- fraud
- disinformation
- deceptive identity simulation
- unlabeled synthetic voice deployment in sensitive scenarios

Human review, clear disclosure, and policy controls are strongly recommended. :contentReference[oaicite:15]{index=15}

## Fine-Tuning

The upstream model card states that VoxCPM2 supports both **LoRA fine-tuning** and **full fine-tuning**, with as little as **5–10 minutes of audio** in some workflows. Refer to the upstream fine-tuning guide for exact procedures and configuration details. :contentReference[oaicite:16]{index=16}

## License

This repository follows the same license as the upstream release.

- **License:** Apache-2.0

If you redistribute, fine-tune, quantize, or otherwise modify this model, make sure your usage remains compliant with the upstream license and attribution requirements. :contentReference[oaicite:17]{index=17}

## Attribution

Original upstream model:
- `openbmb/VoxCPM2`

This repository is an organization-managed copy and is **not the original upstream source**.

## Citation

Please cite the original VoxCPM2 release when using this model in research, evaluation, or production documentation.

    @article{voxcpm2_2026,
      title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
      author  = {VoxCPM Team},
      journal = {GitHub},
      year    = {2026},
    }

    @article{voxcpm2025,
      title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
      author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
                 Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
                 Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
      journal = {arXiv preprint arXiv:2509.24650},
      year    = {2025},
    } :contentReference[oaicite:18]{index=18}

## Disclaimer

This repository may include packaging, naming, or deployment-oriented changes for organizational use.

For official updates, benchmark details, and upstream release notes, refer to the original upstream model card. :contentReference[oaicite:19]{index=19}

---

# JetlinkTTS (Türkçe)

Bu depo, çok dilli metinden konuşmaya dönüştürme, ses klonlama ve kontrol edilebilir voice design iş yükleri için **openbmb/VoxCPM2** tabanlı **JetlinkTTS** modelinin kurum tarafından yönetilen bir kopyasını barındırır.

Bu depo; modeli kendi namespace’i altında yönetmek, erişimi kontrol etmek ve dağıtımı kolaylaştırmak isteyen ekipler için hazırlanmıştır. Amaç, upstream model ekosistemiyle uyumluluğu koruyarak kurumsal kullanım sağlamaktır.

## Model Özeti

JetlinkTTS, **VoxCPM2** tabanlı tokenizer-free diffusion autoregressive bir metinden konuşmaya dönüştürme modelidir. Upstream model kartına göre model **2B parametreye** sahiptir, **30 dil** destekler, **48kHz ses çıktısı** üretir ve **2 milyon saatten fazla çok dilli konuşma verisi** üzerinde eğitilmiştir. Ayrıca **voice cloning**, **voice design**, **streaming generation** ve **context-aware synthesis** özelliklerini destekler. :contentReference[oaicite:20]{index=20}

## Temel Özellikler

- **30 dilde çok dilli TTS**
- Doğal dil açıklamasından **Voice Design**
- Kısa referans ses ile **Controllable Voice Cloning**
- Referans ses + transcript ile **Ultimate Cloning**
- **48kHz stüdyo kalitesinde çıktı**
- **Streaming generation**
- **Bağlama duyarlı prosody ve ifade üretimi**
- **Apache-2.0** ile ticari kullanıma uygun lisans :contentReference[oaicite:21]{index=21}

## Desteklenen Diller

Upstream model kartına göre VoxCPM2 şu **30 dili** destekler:

- Arapça
- Burma dili
- Çince
- Danca
- Hollandaca
- İngilizce
- Fince
- Fransızca
- Almanca
- Yunanca
- İbranice
- Hintçe
- Endonezce
- İtalyanca
- Japonca
- Kmerce
- Korece
- Lao dili
- Malayca
- Norveççe
- Lehçe
- Portekizce
- Rusça
- İspanyolca
- Svahili
- İsveççe
- Tagalog
- Tayca
- Türkçe
- Vietnamca

Ek olarak bazı Çince lehçeleri de listelenmiştir:

- 四川话
- 粤语
- 吴语
- 东北话
- 河南话
- 陕西话
- 山东话
- 天津话
- 闽南话 :contentReference[oaicite:22]{index=22}

## Kullanım Amacı

Bu model aşağıdaki senaryolar için uygundur:

- çok dilli konuşma sentezi
- anlatım ve seslendirme üretimi
- voice assistant backend’leri
- ses klonlama iş akışları
- yaratıcı voice design
- altyazı dublajı ve lokalizasyon
- konuşma tabanlı TTS servisleri
- araştırma ve benchmark çalışmaları :contentReference[oaicite:23]{index=23}

## Model Detayları

### Mimari

Upstream model kartı VoxCPM2’yi şu şekilde tanımlar:

- **Mimari:** Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT)
- **Backbone:** MiniCPM-4 tabanlı
- **Toplam parametre:** 2B
- **Audio VAE:** AudioVAE V2
- **Referans giriş:** 16kHz
- **Çıktı sesi:** 48kHz
- **Maksimum sequence length:** 8192 token
- **Varsayılan dtype:** bfloat16 :contentReference[oaicite:24]{index=24}

## Donanım Gereksinimleri

> Bu model için tüm kullanım senaryolarını kapsayan tek bir evrensel minimum donanım gereksinimi yoktur.

Gerçek ihtiyaçlar şunlara bağlıdır:

- inference backend
- metin uzunluğu
- streaming veya offline kullanım
- voice cloning kullanımı
- concurrency
- latency hedefi
- runtime yapılandırması :contentReference[oaicite:25]{index=25}

### Minimum Sistem Gereksinimleri

Upstream model kartı model detaylarında doğrudan **yaklaşık ~8 GB VRAM** bilgisi verir. Ayrıca temel yazılım gereksinimleri olarak **Python ≥ 3.10**, **PyTorch ≥ 2.5.0** ve **CUDA ≥ 12.0** belirtilmiştir. :contentReference[oaicite:26]{index=26}

JetlinkTTS için pratik bellek rehberi:

- **Tahmini pratik minimum VRAM:** **~8 GB**
- **Daha rahat local geliştirme ve test için önerilen:** **12–16 GB VRAM**
- **Production veya daha yüksek concurrency için önerilen:** modern datacenter sınıfı GPU’lar

> Not: gerçek bellek kullanımı; metin uzunluğu, klonlama modu, streaming kullanımı, batch size ve backend kaynaklı ek yükler nedeniyle artabilir. Yukarıdaki değerler kesin sınırlar değil, pratik rehber olarak değerlendirilmelidir. :contentReference[oaicite:27]{index=27}

### Referans Donanım

Pratik dağıtım planlaması için:

- **Geliştirme / hafif test:** yaklaşık **8 GB VRAM veya üzeri** tek modern GPU yeterli olabilir
- **Daha rahat local denemeler:** **12–16 GB VRAM**
- **Production odaklı serving:** modern datacenter GPU’lar önerilir
- **Daha düşük latency / daha yüksek throughput:** optimize inference stack’leri değerlendirilmelidir :contentReference[oaicite:28]{index=28}

## Yazılım Gereksinimleri

Önerilen ortam:

- **Python 3.10 veya üzeri**
- **PyTorch 2.5.0 veya üzeri**
- **CUDA 12.0 veya üzeri**
- deployment için Linux önerilir
- upstream kullanım için `voxcpm` paketi :contentReference[oaicite:29]{index=29}

Yaygın bağımlılıklar:

- `torch`
- `soundfile`
- `voxcpm`

## Hızlı Başlangıç

Upstream paketi kur:

    pip install voxcpm

Temel kullanım:

    import soundfile as sf
    from voxcpm import VoxCPM

    model = VoxCPM.from_pretrained("Jetlink/JetlinkTTS", load_denoiser=False)

    wav = model.generate(
        text="JetlinkTTS delivers expressive multilingual speech generation.",
        cfg_value=2.0,
        inference_timesteps=10,
    )

    sf.write("output.wav", wav, model.tts_model.sample_rate)

## Voice Design Örneği

Ses tarzını, metnin başında parantez içinde doğal dil ile yönlendirebilirsin:

    wav = model.generate(
        text="(Genç bir kadın, yumuşak ve sıcak bir ses tonu)Merhaba, JetlinkTTS'e hoş geldiniz!",
        cfg_value=2.0,
        inference_timesteps=10,
    )

## Voice Cloning Örneği

Kısa bir referans ses ile temel klonlama:

    wav = model.generate(
        text="Bu, JetlinkTTS tarafından üretilmiş klonlanmış bir sestir.",
        reference_wav_path="speaker.wav",
    )

Stil kontrollü klonlama:

    wav = model.generate(
        text="(Biraz daha hızlı, neşeli bir ton)Bu, stil kontrolü uygulanmış klonlanmış bir sestir.",
        reference_wav_path="speaker.wav",
        cfg_value=2.0,
        inference_timesteps=10,
    )

## Yüksek Benzerlikli / Ultimate Cloning

En yüksek benzerlik için hem referans ses hem de transcript verilebilir:

    wav = model.generate(
        text="Bu, JetlinkTTS ile yapılmış yüksek benzerlikli klonlama örneğidir.",
        prompt_wav_path="speaker_reference.wav",
        prompt_text="Referans sesin transcript metni.",
        reference_wav_path="speaker_reference.wav",
    )

## Streaming Örneği

Upstream akışta streaming üretim de desteklenir:

    import numpy as np

    chunks = []
    for chunk in model.generate_streaming(text="JetlinkTTS ile streaming oldukça kolay!"):
        chunks.append(chunk)

    wav = np.concatenate(chunks)

## Serving Notları

Bu model şu kullanım türleri için uygundur:

- gerçek zamanlı veya gerçeğe yakın zamanlı TTS
- voice cloning servisleri
- çok dilli TTS API’leri
- yaratıcı konuşma üretim akışları
- kurumsal ses uygulamaları :contentReference[oaicite:30]{index=30}

Upstream model kartı, gerçek zaman faktörü için yaklaşık **RTX 4090 üzerinde ~0.30** ve **Nano-VLLM hızlandırmasıyla ~0.13** seviyelerini raporlar. Bu da uygun runtime stack ile optimize serving yapılabildiğini gösterir. :contentReference[oaicite:31]{index=31}

## Güçlü Yönler

- güçlü çok dilli TTS kapsaması
- referans ses olmadan voice design
- kontrol edilebilir voice cloning
- transcript destekli yüksek benzerlikli klonlama
- 48kHz çıktı kalitesi
- streaming desteği
- açık kaynak ve ticari kullanıma uygun lisans :contentReference[oaicite:32]{index=32}

## Sınırlamalar

Upstream model kartına göre:

- voice design ve style control sonuçları çalıştırmalar arasında değişebilir
- performans, eğitim verisi kapsamına bağlı olarak dillere göre değişir
- çok uzun veya aşırı ifadeli girdilerde zaman zaman kararsızlık görülebilir
- impersonation, fraud veya disinformation için kullanımı kesinlikle yasaktır
- AI ile üretilmiş içerikler açıkça etiketlenmelidir :contentReference[oaicite:33]{index=33}

## Kapsam Dışı / Dikkat Gerektiren Kullanımlar

Çıktılar şu amaçlarla kullanılmamalıdır:

- kimliğe bürünme
- dolandırıcılık
- dezenformasyon
- aldatıcı kimlik simülasyonu
- hassas senaryolarda etiketsiz sentetik ses kullanımı

İnsan denetimi, açık bilgilendirme ve politika kontrolleri güçlü şekilde önerilir. :contentReference[oaicite:34]{index=34}

## Fine-Tuning

Upstream model kartı, VoxCPM2’nin hem **LoRA fine-tuning** hem de **full fine-tuning** desteklediğini ve bazı senaryolarda **5–10 dakika ses verisi** ile ince ayar yapılabildiğini belirtir. Kesin prosedür ve konfigürasyon detayları için upstream fine-tuning rehberine bakılmalıdır. :contentReference[oaicite:35]{index=35}

## Lisans

Bu depo, upstream sürümle aynı lisansı takip eder.

- **Lisans:** Apache-2.0

Modeli yeniden dağıtıyor, fine-tune ediyor, quantize ediyor veya başka şekilde değiştiriyorsan; kullanımının upstream lisans ve attribution gereklilikleriyle uyumlu olduğundan emin olmalısın. :contentReference[oaicite:36]{index=36}

## Atıf

Orijinal upstream model:
- `openbmb/VoxCPM2`

Bu depo, kurum tarafından yönetilen bir kopyadır ve **orijinal upstream kaynak değildir**.

## Atıf / Citation

Bu modeli araştırma, değerlendirme veya production dokümantasyonunda kullanıyorsan, lütfen orijinal VoxCPM2 sürümüne atıf yap.

    @article{voxcpm2_2026,
      title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
      author  = {VoxCPM Team},
      journal = {GitHub},
      year    = {2026},
    }

    @article{voxcpm2025,
      title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
      author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
                 Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
                 Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
      journal = {arXiv preprint arXiv:2509.24650},
      year    = {2025},
    }

## Feragatname

Bu depo, kurumsal kullanım amacıyla paketleme, isimlendirme veya dağıtım odaklı bazı değişiklikler içerebilir.

Resmi güncellemeler, benchmark detayları ve upstream sürüm notları için orijinal upstream model kartına bakılmalıdır.