khazarai
/

Medical-TTS

Model card Files Files and versions

Medical-TTS / README.md

Rustamshry's picture

Update README.md

936aee0 verified 4 months ago

|

history blame contribute delete

2.7 kB

	---
	base_model: unsloth/csm-1b
	library_name: peft
	license: mit
	datasets:
	- Dev372/Medical_STT_Dataset_1.0
	language:
	- en
	pipeline_tag: text-to-speech
	tags:
	- unsloth
	- trl
	- transformers
	---

	# Model Card for Model ID

	## Model Details

	### Model Description

	This model is a fine-tuned version of csm-1B for medical text-to-speech tasks.
	It was trained on a curated dataset of ~2,000 medical text-to-speech pairs, focusing on clinical terminology, healthcare instructions, and patient–doctor communication scenarios.

	- Fine-tuned for: Medical-domain text-to-speech synthesis
	- Language(s) (NLP): English
	- License: MIT
	- Finetuned from model : csm-1b

	## Uses

	### Direct Use

	- Generating synthetic speech from medical text for research, prototyping, and educational purposes
	- Assisting in medical transcription-to-speech applications
	- Supporting voice-based healthcare assistants

	## Bias, Risks, and Limitations

	- The model is not a substitute for professional medical advice.
	- Trained on a relatively small dataset (~2K samples) → performance may be limited outside the fine-tuned domain.
	- Bias & hallucinations: The model may mispronounce rare terms or produce inaccurate speech in critical scenarios.
	- Should not be used in real clinical decision-making without proper validation.

	## How to Get Started with the Model

	Use the code below to get started with the model.
	```python
	import torch
	from transformers import CsmForConditionalGeneration, AutoProcessor
	import soundfile as sf
	from peft import PeftModel


	model_id = "unsloth/csm-1b"
	device = "cuda" if torch.cuda.is_available() else "cpu"


	processor = AutoProcessor.from_pretrained(model_id)
	base_model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

	model = PeftModel.from_pretrained(base_model, "khazarai/Medical-TTS")

	text = "Mild dorsal angulation of the distal radius reflective of the fracture."

	speaker_id = 0

	conversation = [
	{"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
	]
	audio_values = model.generate(
	**processor.apply_chat_template(
	conversation,
	tokenize=True,
	return_dict=True,
	).to("cuda"),
	max_new_tokens=650,
	# play with these parameters to tweak results
	# depth_decoder_top_k=0,
	# depth_decoder_top_p=0.9,
	# depth_decoder_do_sample=True,
	# depth_decoder_temperature=0.9,
	# top_k=0,
	# top_p=1.0,
	# temperature=0.9,
	# do_sample=True,
	#########################################################
	output_audio=True
	)
	audio = audio_values[0].to(torch.float32).cpu().numpy()
	sf.write("example.wav", audio, 24000)

	```


	### Framework versions

	- PEFT 0.15.2