File size: 2,702 Bytes
de3c829
 
 
971b9c8
 
 
 
 
 
 
 
 
 
de3c829
 
 
 
 
 
 
 
4d11694
 
de3c829
4d11694
 
 
 
de3c829
 
 
 
 
4d11694
 
 
de3c829
 
 
4d11694
 
 
 
de3c829
 
 
 
4d11694
936aee0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
de3c829
4d11694
de3c829
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
base_model: unsloth/csm-1b
library_name: peft
license: mit
datasets:
- Dev372/Medical_STT_Dataset_1.0
language:
- en
pipeline_tag: text-to-speech
tags:
- unsloth
- trl
- transformers
---

# Model Card for Model ID

## Model Details

### Model Description

This model is a fine-tuned version of csm-1B for medical text-to-speech tasks.
It was trained on a curated dataset of ~2,000 medical text-to-speech pairs, focusing on clinical terminology, healthcare instructions, and patient–doctor communication scenarios.

- **Fine-tuned for:** Medical-domain text-to-speech synthesis
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model :** csm-1b

## Uses

### Direct Use

- Generating synthetic speech from medical text for research, prototyping, and educational purposes
- Assisting in medical transcription-to-speech applications
- Supporting voice-based healthcare assistants

## Bias, Risks, and Limitations

- The model is not a substitute for professional medical advice.
- Trained on a relatively small dataset (~2K samples) → performance may be limited outside the fine-tuned domain.
- Bias & hallucinations: The model may mispronounce rare terms or produce inaccurate speech in critical scenarios.
- Should not be used in real clinical decision-making without proper validation.

## How to Get Started with the Model

Use the code below to get started with the model.
```python
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
import soundfile as sf
from peft import PeftModel


model_id = "unsloth/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"


processor = AutoProcessor.from_pretrained(model_id)
base_model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

model = PeftModel.from_pretrained(base_model, "khazarai/Medical-TTS")

text = "Mild dorsal angulation of the distal radius reflective of the fracture."

speaker_id = 0

conversation = [
    {"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]
audio_values = model.generate(
    **processor.apply_chat_template(
        conversation,
        tokenize=True,
        return_dict=True,
    ).to("cuda"),
    max_new_tokens=650, 
    # play with these parameters to tweak results
    # depth_decoder_top_k=0,
    # depth_decoder_top_p=0.9,
    # depth_decoder_do_sample=True,
    # depth_decoder_temperature=0.9,
    # top_k=0,
    # top_p=1.0,
    # temperature=0.9,
    # do_sample=True,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
sf.write("example.wav", audio, 24000)

```


### Framework versions

- PEFT 0.15.2