|
|
--- |
|
|
base_model: unsloth/csm-1b |
|
|
library_name: peft |
|
|
license: mit |
|
|
datasets: |
|
|
- Dev372/Medical_STT_Dataset_1.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-to-speech |
|
|
tags: |
|
|
- unsloth |
|
|
- trl |
|
|
- transformers |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This model is a fine-tuned version of csm-1B for medical text-to-speech tasks. |
|
|
It was trained on a curated dataset of ~2,000 medical text-to-speech pairs, focusing on clinical terminology, healthcare instructions, and patient–doctor communication scenarios. |
|
|
|
|
|
- **Fine-tuned for:** Medical-domain text-to-speech synthesis |
|
|
- **Language(s) (NLP):** English |
|
|
- **License:** MIT |
|
|
- **Finetuned from model :** csm-1b |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
- Generating synthetic speech from medical text for research, prototyping, and educational purposes |
|
|
- Assisting in medical transcription-to-speech applications |
|
|
- Supporting voice-based healthcare assistants |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- The model is not a substitute for professional medical advice. |
|
|
- Trained on a relatively small dataset (~2K samples) → performance may be limited outside the fine-tuned domain. |
|
|
- Bias & hallucinations: The model may mispronounce rare terms or produce inaccurate speech in critical scenarios. |
|
|
- Should not be used in real clinical decision-making without proper validation. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
```python |
|
|
import torch |
|
|
from transformers import CsmForConditionalGeneration, AutoProcessor |
|
|
import soundfile as sf |
|
|
from peft import PeftModel |
|
|
|
|
|
|
|
|
model_id = "unsloth/csm-1b" |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
base_model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device) |
|
|
|
|
|
model = PeftModel.from_pretrained(base_model, "khazarai/Medical-TTS") |
|
|
|
|
|
text = "Mild dorsal angulation of the distal radius reflective of the fracture." |
|
|
|
|
|
speaker_id = 0 |
|
|
|
|
|
conversation = [ |
|
|
{"role": str(speaker_id), "content": [{"type": "text", "text": text}]}, |
|
|
] |
|
|
audio_values = model.generate( |
|
|
**processor.apply_chat_template( |
|
|
conversation, |
|
|
tokenize=True, |
|
|
return_dict=True, |
|
|
).to("cuda"), |
|
|
max_new_tokens=650, |
|
|
# play with these parameters to tweak results |
|
|
# depth_decoder_top_k=0, |
|
|
# depth_decoder_top_p=0.9, |
|
|
# depth_decoder_do_sample=True, |
|
|
# depth_decoder_temperature=0.9, |
|
|
# top_k=0, |
|
|
# top_p=1.0, |
|
|
# temperature=0.9, |
|
|
# do_sample=True, |
|
|
######################################################### |
|
|
output_audio=True |
|
|
) |
|
|
audio = audio_values[0].to(torch.float32).cpu().numpy() |
|
|
sf.write("example.wav", audio, 24000) |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- PEFT 0.15.2 |