metadata
base_model: unsloth/csm-1b
library_name: peft
license: mit
datasets:
- Dev372/Medical_STT_Dataset_1.0
language:
- en
pipeline_tag: text-to-speech
tags:
- unsloth
- trl
- transformers
Model Card for Model ID
Model Details
Model Description
This model is a fine-tuned version of csm-1B for medical text-to-speech tasks. It was trained on a curated dataset of ~2,000 medical text-to-speech pairs, focusing on clinical terminology, healthcare instructions, and patient–doctor communication scenarios.
- Fine-tuned for: Medical-domain text-to-speech synthesis
- Language(s) (NLP): English
- License: MIT
- Finetuned from model : csm-1b
Uses
Direct Use
- Generating synthetic speech from medical text for research, prototyping, and educational purposes
- Assisting in medical transcription-to-speech applications
- Supporting voice-based healthcare assistants
Bias, Risks, and Limitations
- The model is not a substitute for professional medical advice.
- Trained on a relatively small dataset (~2K samples) → performance may be limited outside the fine-tuned domain.
- Bias & hallucinations: The model may mispronounce rare terms or produce inaccurate speech in critical scenarios.
- Should not be used in real clinical decision-making without proper validation.
How to Get Started with the Model
Use the code below to get started with the model.
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
import soundfile as sf
from peft import PeftModel
model_id = "unsloth/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
base_model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
model = PeftModel.from_pretrained(base_model, "khazarai/Medical-TTS")
text = "Mild dorsal angulation of the distal radius reflective of the fracture."
speaker_id = 0
conversation = [
{"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]
audio_values = model.generate(
**processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to("cuda"),
max_new_tokens=650,
# play with these parameters to tweak results
# depth_decoder_top_k=0,
# depth_decoder_top_p=0.9,
# depth_decoder_do_sample=True,
# depth_decoder_temperature=0.9,
# top_k=0,
# top_p=1.0,
# temperature=0.9,
# do_sample=True,
#########################################################
output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
sf.write("example.wav", audio, 24000)
Framework versions
- PEFT 0.15.2