Automatic Speech Recognition
Transformers
Safetensors
English
lasr_ctc
medical-asr
radiology
medical
medasr / README.md
fmahvar's picture
Pin to a specific commit to improve reproducibility (#3)
570203e verified
---
license: other
license_name: health-ai-developer-foundations
license_link: https://developers.google.com/health-ai-developer-foundations/terms
language:
- en
pipeline_tag: automatic-speech-recognition
library_name: transformers
tags:
- medical-asr
- radiology
- medical
---
# MedASR Model Card
## **Model documentation:** [MedASR](https://developers.google.com/health-ai-developer-foundations/medasr)
**Resources:**
* Model on Google Cloud Model Garden: [MedASR](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/medasr)
* Model on Hugging Face: [MedASR](https://huggingface.co/google/medasr)
* GitHub repository (supporting code, Colab notebooks, discussions, and
issues): [MedASR](https://github.com/google-health/medasr)
* Quick start notebook: [GitHub](https://github.com/google-health/medasr/blob/main/notebooks/quick_start_with_hugging_face.ipynb)
* Fine-tuning notebook: [GitHub](https://github.com/google-health/medasr/blob/main/notebooks/fine_tune_with_hugging_face.ipynb)
* Support: See [Contact](https://developers.google.com/health-ai-developer-foundations/medasr/get-started.md#contact)
* License: The use of MedASR is governed by the [Health AI Developer
Foundations terms of
use](https://developers.google.com/health-ai-developer-foundations/terms).
**Author:** Google
## Model information
This section describes the MedASR (Medical Automated Speech Recognition) model
and how to use it.
### Description
MedASR is a speech-to-text model based on the [Conformer
architecture](https://arxiv.org/abs/2005.08100) pre-trained for medical
dictation. MedASR is intended as a starting point for developers, and is
well-suited for dictation tasks involving medical terminologies, such as
radiology dictation, and transcribing physician-patient conversations. While
MedASR has been extensively pre-trained on a corpus of medical audio data, it
may occasionally exhibit performance variability when encountering terms outside
of its pre-training data, such as non-standard medication names or consistent
handling of temporal data (dates, times, or durations).
### How to use
The following are some example code snippets to help you quickly get started
running the model locally. If you want to use the model at scale, we recommend
that you create a production version using [Model
Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/medasr).
First, install the Transformers library. MedASR is supported starting from
transformers 5.0.0. You may need to install transformers from GitHub.
```shell
$ uv pip install git+https://github.com/huggingface/transformers.git@65dc261512cbdb1ee72b88ae5b222f2605aad8e5
```
**Run model with the pipeline API**
```py
from transformers import pipeline
import huggingface_hub
from IPython.display import Audio, display
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
model_id = "google/medasr"
pipe = pipeline("automatic-speech-recognition", model=model_id)
result = pipe(audio,chunk_length_s=20,stride_length_s=2)
# the chunk length is how long in seconds MedASR batches audio and the stride length is the overlap between chunks.
print(result)
```
**Run the model directly**
```py
from transformers import AutoModelForCTC, AutoProcessor
import huggingface_hub
import librosa
import torch
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
model_id = f"google/medasr"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCTC.from_pretrained(model_id).to(device)
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
speech, sample_rate = librosa.load(audio, sr=16000)
inputs = processor(speech, sampling_rate=sample_rate, return_tensors="pt", padding=True)
inputs = inputs.to(device)
outputs = model.generate(**inputs)
decoded_text = processor.batch_decode(outputs)[0]
print(f"result={decoded_text}")
```
### Examples
See the following tutorial notebooks for examples of how to use MedASR:
* To give the model a quick try, running it locally with weights from Hugging
Face, see [Quick start notebook in
Colab](https://colab.research.google.com/github/google-health/medasr/blob/main/notebooks/quick_start_with_hugging_face.ipynb).
* For an example of fine-tuning the, see the [Fine-tuning notebook in
Colab](https://colab.research.google.com/github/google-health/medasr/blob/main/notebooks/fine_tune_with_hugging_face.ipynb).
### Model architecture overview
The MedASR model is built based on the
[Conformer](https://arxiv.org/abs/2005.08100) architecture.
### Technical specifications
* **Model type**: Automated-speech-detector
* **Input Modalities**: Mono-channel audio 16kHz, int16 waveform
* **Output Modality:** Text only
* **Number of parameters:** 105M
* **Key publication**: [LAST: Scalable Lattice-Based Speech Modelling in JAX](https://arxiv.org/pdf/2304.13134)
* **Model created**: December 18, 2025
* **Model version**: 1.0.0
### Citation
When using this model, cite: \
@inproceedings{wu2023last, \
title={Last: Scalable Lattice-Based Speech Modelling in Jax}, \
author={Wu, Ke and Variani, Ehsan and Bagby, Tom and Riley, Michael}, \
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP)}, \
pages={1--5}, \
year={2023}, \
organization={IEEE} \
}
### **Performance and Evaluations**
Our evaluation methods include evaluating word-error rate (WER) of MedASR
against held out medical audio examples. We also evaluate specifically medical
WER, where we only look at words that have a medical context. These audio
samples have been transcribed by human experts, but there is always some noise
in such transcriptions.
**Key performance metrics**
Word error rate of MedASR versus other models\*
Dataset name | Dataset description | MedASR with greedy decoding | MedASR \+ 6-gram language model | Gemini 2.5 Pro | Gemini 2.5 Flash | Whisper v3 Large
:------------------------------------------------------- | :---------------------------------------------------------- | :-------------------------- | :------------------------------ | :------------- | :--------------- | :---------------
RAD-DICT | Private radiologist dictation dataset | 6.6% | **4.6%** | 10.0% | 24.4% | 25.3%
GENERAL-DICT | Private general and internal medicine dataset | 9.3% | **6.9%** | 16.4% | 27.1% | 33.1%
FM-DICT | Private family medicine dataset | 8.1% | **5.8%** | 14.6% | 19.9% | 32.5%
[Eye Gaze](https://physionet.org/content/egd-cxr/1.0.0/) | Dictation of audio from 998 MIMIC cases (multiple speakers) | 6.6% | **5.2%** | 5.9% | 9.3% | 12.5%
\*All results except "MedASR \+ 6-gram language model" in the preceding table
use greedy decoding. "MedASR \+ 6-gram language model" uses beam search with
beam size 8.
#### **Safety evaluation**
Our evaluation methods include structured evaluations and internal red-teaming
testing of relevant safety policies. This model was evaluated across various
dimensions to assess safety. Human evaluations were conducted on 100 example
outputs to assess for potential safety impact, specifically related to incorrect
transcriptions associated with medication names, dosages, diagnoses, semantic
changes, and medical terminology. The results of these evaluations were
determined to be acceptable in regards to internal policies for overall safety.
## Data card
### Dataset overview
#### Training
The MedASR model is specifically trained on a diverse set of de-identified
medical speech data. Its training utilizes approximately 5000 hours of physician
dictations across a range of specialities (proprietary dataset 1\) and
de-identified medical conversations, primarily physician-patient dialogue
(proprietary dataset 2). The model is trained on audio segments paired with
corresponding transcripts and metadata, with subsets of the conversational data
also including extensive annotations for medical named entities such as
symptoms, medications, and conditions. MedASR therefore has a strong
understanding of vocabulary used in medical contexts.
#### Evaluation
MedASR has been evaluated using a mix of internal and public datasets as noted
in the Key Performance Metrics section. We used argmax of the model for
posterior probability (greedy decoding) to get the output model's hypothesis
tokens. The hypothesis is compared against ground truth transcript using jiwer
library to calculate the word error rate.
#### Source
The datasets used to train MedASR include a public dataset for pre-training and
a proprietary dataset that was licensed and incorporated (described in the
following section).
### Data ownership and documentation
Pre-training with the full [LibriHeavy training
set.](https://arxiv.org/abs/2309.08105) Fine-tuning was conducted on
de-identified, licensed datasets described in the following section
Private Medical Dict: Google internal dataset consisting of de-identified
dictations made by physicians of different specialities including radiology,
internal medicine, family medicine, and other subspecialties totaling more than
5000 hours of audio. This dataset was split into test sets that constitute
RAD-DICT, FM-DICT and General and Internal Medicine\-DICT referenced previously
in Performance and Evaluations.
### Data citation
Eye Gaze Data for Chest X-rays (evaluation set described previously in
Performance and Evaluations) was derived from:
MIMIC-CXR Database v1.0.0 and MIMIC-IV v0.4
### De-identification/anonymization:
Google and its partners utilize datasets that have been rigorously anonymized or
de-identified to ensure the protection of individual research participants and
patient privacy.
## **Implementation Information**
Details about the model internals.
### **Hardware**
[Tensor Processing Unit (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu)
hardware (TPUv4p, TPUv5p and TPUv5e). Training speech-to text models requires
significant computational power. TPUs, designed specifically for matrix
operations common in machine learning, offer several advantages in this domain:
* Performance: TPUs are specifically designed to handle the massive
computations involved in training VLMs. They can speed up training
considerably compared to CPUs.
* Memory: TPUs often come with large amounts of high-bandwidth memory,
allowing for the handling of large models and batch sizes during training.
This can lead to better model quality.
* Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution
for handling the growing complexity of large foundation models. You can
distribute training across multiple TPU devices for faster and more
efficient processing.
* Cost-effectiveness: In many scenarios, TPUs can provide a more
cost-effective solution for training large models compared to CPU-based
infrastructure, especially when considering the time and resources saved due
to faster training.
* These advantages are aligned with [Google's commitments to operate
sustainably](https://sustainability.google/operating-sustainably/).
### **Software**
Training was done using [JAX](https://github.com/jax-ml/jax) and [ML
Pathways](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/).
JAX allows researchers to take advantage of the latest generation of hardware,
including TPUs, for faster and more efficient training of large models. ML
Pathways is Google's latest effort to build artificially intelligent systems
capable of generalizing across multiple tasks. This is specially suitable for
foundation models, including large language models like these ones.
Together, JAX and ML Pathways are used as described in the [paper about the
Gemini family of models](https://goo.gle/gemma2report); *"the 'single
controller' programming model of JAX and Pathways allows a single Python process
to orchestrate the entire training run, dramatically simplifying the development
workflow."*
## **Usage and Limitations**
The MedASR model has certain limitations that users should be aware of.
### **Intended Use**
MedASR is a speech-to-text model intended to be used as a starting point that
enables more efficient development of downstream healthcare applications
requiring speech as input. MedASR is intended for developers in the healthcare
and life sciences space. Developers are responsible for training, adapting, and
making meaningful changes to MedASR to accomplish their specific intended use.
The MedASR model can be fine-tuned by developers using their own proprietary
data for their specific tasks or solutions.
MedASR is trained on many medical audio, speech, and text and enables further
development and integration, or both with generative models like
[MedGemma](https://developers.google.com/health-ai-developer-foundations/medgemma),
where MedASR converts speech to text, which can then be used as input for a
text-to-text response. Full details of all the tasks MedASR has been evaluated
and pre-trained on can be found in the MedASR model card.
MedASR is not intended to be used without appropriate validation, adaptation, or
making meaningful modification by developers for their specific use case. The
outputs generated by MedASR may include transcription errors and are not
intended to directly inform clinical diagnosis, patient management decisions,
treatment recommendations, or any other direct clinical practice applications.
All outputs from MedASR should be considered preliminary and require independent
verification, clinical correlation, and further investigation through
established research and development methodologies.
### **Limitations**
* Training Data
* English-only: All training data is in English
* Speaker diversity: Most training data comes from speakers where English
is their first language and were raised in the United States. The base
model's performance may be lower for other types of speakers,
necessitating the need for fine-tuning.
* Speaker Sex/Gender: Training data included both men and women but had a
higher proportion of men.
* Audio quality: Training data is mostly from high quality microphones.
The base model's performance may deteriorate on low quality audio with
background noise, necessitating the need for fine-tuning.
* Specialized medical terminology: Although MedASR has specialized medical
audio training, its training may not include all medications, procedures
or terminology, especially ones that have come into usage in the past 10
years.
* Dates: MedASR has been trained on de-identified data so its performance
on different date formats may be lacking. This can be rectified with
further finetuning or alternative decoding approaches such as language
model decoding debiasing.
### Benefits
At the time of release, MedASR is a high performing open speech-to-text model,
with specific training for medical applications. Users can update its vocabulary
with few-shot fine-tuning or decoding with external language models.
Based on the benchmark evaluation metrics in this document, MedASR represents a
significant leap forward in medical speech-to-text performance relative to other
comparably-sized open model alternatives.