Configuration Parsing Warning: Config file tokenizer_config.json cannot be fetched (too big)

Multilingual-Expressive-TTS-1.7B

Continue pretraining Scicom-intl/Multilingual-TTS-1.7B-Base on Multilingual Expressive TTS.

Use neucodec as speech detokenizer, 50 TPS, output in 24k sample rate.
Multi-speaker multilingual Expressive TTS, up to 1.15B tokens.
Flash Attention 3 10k context length varlen multipacking.
Mixed precision FP32-BF16.
MuonAdamW optimizer.

How to

First load Neucodec,

from neucodec import NeuCodec

codec = NeuCodec.from_pretrained("neuphonic/neucodec")
_ = codec.eval().to('cuda')

TTS

You can use any speaker name available at https://huggingface.co/datasets/Scicom-intl/ExpressiveSpeech/viewer/default/train

import re
import soundfile as sf
import torch
import librosa
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('Scicom-intl/Multilingual-Expressive-TTS-1.7B')
tokenizer = AutoTokenizer.from_pretrained('Scicom-intl/Multilingual-Expressive-TTS-1.7B')

speaker = 'DisfluencySpeech'
text = "Hi nama saya Husein, I am so cute, 我喜欢吃鸡饭, boire du thé glacé, ולהירגע על החוף, وأحب أن أتعرض لبعض أشعة الشمس."
prompt = f"<|im_start|>{speaker}: {text}<|speech_start|>"

inputs = tokenizer(prompt,return_tensors="pt", add_special_tokens=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=True,
        temperature=0.8,
        repetition_penalty=1.15,
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
audio_tokens = re.findall(r'<\|s_(\d+)\|>', generated_text.split('<|speech_start|>')[1])
audio_tokens = [int(token) for token in audio_tokens]
audio_codes = torch.tensor(audio_tokens)[None, None]

with torch.no_grad():
    audio_waveform = codec.decode_code(audio_codes.cuda())

sf.write('DisfluencySpeech-ms-en-zh-fr-he-ar.mp3', audio_waveform[0, 0].cpu().numpy(), 24000)

You can check the audio at DisfluencySpeech-ms-en-zh-fr-he-ar.mp3.

Expressive TTS

speaker = 'genshin-voice_audio_Rahman'
text = "Hi nama saya Husein, I am so cute, 我喜欢吃鸡饭, boire du thé glacé, ולהירגע על החוף, وأحب أن أتعرض لبعض أشعة الشمس."
description = """
Vocal qualities: Very low pitch, clear and steady, with a neutral and composed demeanor.
Speaking style: Neutral and restrained, with a monotone delivery that lacks significant pitch variation or emotional expression. The speech is methodical and precise, suitable for instructional or educational content.
Pace: Very slow and deliberate, allowing ample time for each word to be fully articulated and understood.
Fluency: Consistently fluent throughout, with no pauses, hesitations, or stutters.
Acoustic environment: The recording has a very confined and enclosed sound, suggesting it was made in a small room or similar close space.
Audio quality: The audio is generally clear but has a slight background noise, which adds a subtle layer of ambient sound without detracting from the clarity of the speech.
Content style: Educational or instructional. The tone and pace are well-suited for teaching or explaining a step-by-step process, such as a math problem or a simple task.
""".strip()
prompt = f"<|im_start|>{speaker}: {text}<|description|>{description}<|speech_start|>"

inputs = tokenizer(prompt,return_tensors="pt", add_special_tokens=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=True,
        temperature=0.8,
        repetition_penalty=1.15,
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
audio_tokens = re.findall(r'<\|s_(\d+)\|>', generated_text.split('<|speech_start|>')[1])
audio_tokens = [int(token) for token in audio_tokens]
audio_codes = torch.tensor(audio_tokens)[None, None]

with torch.no_grad():
    audio_waveform = codec.decode_code(audio_codes.cuda())

sf.write('Rahman-ms-en-zh-fr-he-ar.mp3', audio_waveform[0, 0].cpu().numpy(), 24000)

You can check the audio at Rahman-ms-en-zh-fr-he-ar.mp3.

Now compare with non description, default attribute of the speaker,

speaker = 'genshin-voice_audio_Rahman'
text = "Hi nama saya Husein, I am so cute, 我喜欢吃鸡饭, boire du thé glacé, ולהירגע על החוף, وأحب أن أتعرض لبعض أشعة الشمس."
prompt = f"<|im_start|>{speaker}: {text}<|speech_start|>"

inputs = tokenizer(prompt,return_tensors="pt", add_special_tokens=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=True,
        temperature=0.8,
        repetition_penalty=1.15,
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
audio_tokens = re.findall(r'<\|s_(\d+)\|>', generated_text.split('<|speech_start|>')[1])
audio_tokens = [int(token) for token in audio_tokens]
audio_codes = torch.tensor(audio_tokens)[None, None]

with torch.no_grad():
    audio_waveform = codec.decode_code(audio_codes.cuda())

sf.write('Rahman-ms-en-zh-fr-he-ar-nondescription.mp3', audio_waveform[0, 0].cpu().numpy(), 24000)

You can check the audio at Rahman-ms-en-zh-fr-he-ar-nondescription.mp3.

Generate description

Jenny from https://huggingface.co/datasets/reach-vb/jenny_tts_dataset

y, sr = librosa.load('jenny.wav', sr = 16000)
with torch.no_grad():
    codes = codec.encode_code(torch.tensor(y)[None, None])
tokens = ''.join([f'<|s_{i}|>' for i in codes[0, 0]])
prompt = f"<|im_start|>{tokens}<|description|>"

inputs = tokenizer(prompt,return_tensors="pt", add_special_tokens=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=True,
        temperature=0.8,
        repetition_penalty=1.15,
    )
tokenizer.decode(outputs[0], skip_special_tokens=False).split('<|description|>')[1]

Output,

The audio features a young adult female with an East Asian accent, speaking in a neutral tone at a slow pace. Her speech is consistently fluent and very clear, despite the moderate ambient sounds in the background. The environment has a slightly confined quality to it, giving the recording a subtle echo that adds character to her voice. Interestingly, despite the happy emotion she conveys, her speech remains quite monotone, lacking the usual variations in pitch and rhythm that typically accompany positive emotions. This contrast between her cheerful mood and the steady delivery creates an interesting dynamic.\n\nThe low pitch of her voice lends a soothing yet serious quality to the content. Given the poetic nature of the transcript, this audio could be categorized as part of a literary reading or audiobook, perhaps describing a serene morning scene from a poet's work.

Generate categorized description

Jenny from https://huggingface.co/datasets/reach-vb/jenny_tts_dataset

y, sr = librosa.load('jenny.wav', sr = 16000)
with torch.no_grad():
    codes = codec.encode_code(torch.tensor(y)[None, None])
tokens = ''.join([f'<|s_{i}|>' for i in codes[0, 0]])
prompt = f"<|im_start|>{tokens}<|description_category|>"

inputs = tokenizer(prompt,return_tensors="pt", add_special_tokens=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=True,
        temperature=0.8,
        repetition_penalty=1.15,
    )
tokenizer.decode(outputs[0], skip_special_tokens=False).split('<|description_category|>')[1]

Speaker profile: Young adult female with an East Asian accent.
Vocal qualities: Quite low pitch, clear and precise articulation. The voice is steady and controlled, reflecting a sense of calm and maturity.
Speaking style: Neutral and composed, with minimal emotional expression. The delivery is quite monotone, maintaining a consistent tone throughout without much variation in pitch or emphasis.
Pace: Slightly slow, allowing for a deliberate and measured flow of words. This pace enhances clarity and comprehension.
Fluency: Fluent and smooth, with no hesitations or interruptions. The speech is continuous and well-paced.
Acoustic environment: Simulates a very confined indoor space, producing a slightly boxed-in sound that adds to the intimacy of the recording.
Audio quality: Exceptionally clear and clean, with almost no background noise. The recording is crisp and professional-sounding, making it easy to follow along.
Content style: Informative and descriptive, similar to a travel guide or cultural segment. The tone is suitable for providing factual information about natural attractions, such as lakes in Europe.