Model Card for Model ID

Model Details

Model Description

This model is a fine-tuned version of csm-1B for medical text-to-speech tasks. It was trained on a curated dataset of ~2,000 medical text-to-speech pairs, focusing on clinical terminology, healthcare instructions, and patient–doctor communication scenarios.

Fine-tuned for: Medical-domain text-to-speech synthesis
Language(s) (NLP): English
License: MIT
Finetuned from model : csm-1b

Uses

Direct Use

Generating synthetic speech from medical text for research, prototyping, and educational purposes
Assisting in medical transcription-to-speech applications
Supporting voice-based healthcare assistants

Bias, Risks, and Limitations

The model is not a substitute for professional medical advice.
Trained on a relatively small dataset (~2K samples) → performance may be limited outside the fine-tuned domain.
Bias & hallucinations: The model may mispronounce rare terms or produce inaccurate speech in critical scenarios.
Should not be used in real clinical decision-making without proper validation.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
import soundfile as sf
from peft import PeftModel


model_id = "unsloth/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"


processor = AutoProcessor.from_pretrained(model_id)
base_model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

model = PeftModel.from_pretrained(base_model, "khazarai/Medical-TTS")

text = "Mild dorsal angulation of the distal radius reflective of the fracture."

speaker_id = 0

conversation = [
    {"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]
audio_values = model.generate(
    **processor.apply_chat_template(
        conversation,
        tokenize=True,
        return_dict=True,
    ).to("cuda"),
    max_new_tokens=650, 
    # play with these parameters to tweak results
    # depth_decoder_top_k=0,
    # depth_decoder_top_p=0.9,
    # depth_decoder_do_sample=True,
    # depth_decoder_temperature=0.9,
    # top_k=0,
    # top_p=1.0,
    # temperature=0.9,
    # do_sample=True,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
sf.write("example.wav", audio, 24000)

Framework versions

PEFT 0.15.2

Downloads last month: 14

Model tree for khazarai/Medical-TTS

Base model

sesame/csm-1b

Finetuned

unsloth/csm-1b

Adapter

(29)

this model

Dataset used to train khazarai/Medical-TTS

Collection including khazarai/Medical-TTS

Text-to-Speech Models

Collection

3 items • Updated Apr 16 • 1