Inline expression tags (e.g. [chuckle], [emphasis], [long pause]) not working

#2
by O-Q - opened

I tested the model using the exact sample text/audio from the fish.audio website (Sarah voice demo) with inline expression tags, and the tags don't seem to have any effect on the generated audio. The tags seem simply to be ignored.

Text used (directly from the fish.audio homepage Sarah demo):

[chuckle] When you're creating something new, there's this [emphasis] beautiful mix of wonder and fear. [long pause] And it's overwhelming sometimes, and scary, but also incredibly magical. And even though the process isn't perfect, those moments of uncertainty make the whole [emphasis] journey more meaningful and real.

Code:

from mlx_audio.tts.utils import load_model
from mlx_audio.tts.generate import generate_audio

model = load_model("mlx-community/fish-audio-s2-pro-8bit")
generate_audio(
    model=model,
    text=text,
    lang_code="en",
    ref_audio="sarah.mp3",
    ref_text="When you're creating something new, there's this beautiful mix of wonder and fear...",
    file_prefix="test_audio",
    max_tokens=1024,
    temperature=0.7,
    repetition_penalty=1.2,
    top_p=0.7,
    top_k=30,
)

Environment:

mlx-audio (latest from github)
macOS / Apple Silicon
Model: mlx-community/fish-audio-s2-pro-8bit
Is this a known limitation of the MLX conversion / 8-bit quantization, or is there something additional needed to enable inline tag support?

O-Q changed discussion status to closed

Sign up or log in to comment