|
|
---
|
|
|
license: other
|
|
|
license_name: health-ai-developer-foundations
|
|
|
license_link: https://developers.google.com/health-ai-developer-foundations/terms
|
|
|
language:
|
|
|
- en
|
|
|
pipeline_tag: automatic-speech-recognition
|
|
|
library_name: transformers
|
|
|
tags:
|
|
|
- medical-asr
|
|
|
- radiology
|
|
|
- medical
|
|
|
---
|
|
|
|
|
|
# MedASR Model Card
|
|
|
|
|
|
## **Model documentation:** [MedASR](https://developers.google.com/health-ai-developer-foundations/medasr)
|
|
|
|
|
|
**Resources:**
|
|
|
|
|
|
* Model on Google Cloud Model Garden: [MedASR](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/medasr)
|
|
|
|
|
|
* Model on Hugging Face: [MedASR](https://huggingface.co/google/medasr)
|
|
|
|
|
|
* GitHub repository (supporting code, Colab notebooks, discussions, and
|
|
|
issues): [MedASR](https://github.com/google-health/medasr)
|
|
|
|
|
|
* Quick start notebook: [GitHub](https://github.com/google-health/medasr/blob/main/notebooks/quick_start_with_hugging_face.ipynb)
|
|
|
|
|
|
* Fine-tuning notebook: [GitHub](https://github.com/google-health/medasr/blob/main/notebooks/fine_tune_with_hugging_face.ipynb)
|
|
|
|
|
|
* Support: See [Contact](https://developers.google.com/health-ai-developer-foundations/medasr/get-started.md#contact)
|
|
|
|
|
|
* License: The use of MedASR is governed by the [Health AI Developer
|
|
|
Foundations terms of
|
|
|
use](https://developers.google.com/health-ai-developer-foundations/terms).
|
|
|
|
|
|
**Author:** Google
|
|
|
|
|
|
## Model information
|
|
|
|
|
|
This section describes the MedASR (Medical Automated Speech Recognition) model
|
|
|
and how to use it.
|
|
|
|
|
|
### Description
|
|
|
|
|
|
MedASR is a speech-to-text model based on the [Conformer
|
|
|
architecture](https://arxiv.org/abs/2005.08100) pre-trained for medical
|
|
|
dictation. MedASR is intended as a starting point for developers, and is
|
|
|
well-suited for dictation tasks involving medical terminologies, such as
|
|
|
radiology dictation, and transcribing physician-patient conversations. While
|
|
|
MedASR has been extensively pre-trained on a corpus of medical audio data, it
|
|
|
may occasionally exhibit performance variability when encountering terms outside
|
|
|
of its pre-training data, such as non-standard medication names or consistent
|
|
|
handling of temporal data (dates, times, or durations).
|
|
|
|
|
|
### How to use
|
|
|
|
|
|
The following are some example code snippets to help you quickly get started
|
|
|
running the model locally. If you want to use the model at scale, we recommend
|
|
|
that you create a production version using [Model
|
|
|
Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/medasr).
|
|
|
|
|
|
First, install the Transformers library. MedASR is supported starting from
|
|
|
transformers 5.0.0. You may need to install transformers from GitHub.
|
|
|
|
|
|
```shell
|
|
|
$ uv pip install git+https://github.com/huggingface/transformers.git@65dc261512cbdb1ee72b88ae5b222f2605aad8e5
|
|
|
```
|
|
|
|
|
|
**Run model with the pipeline API**
|
|
|
|
|
|
```py
|
|
|
from transformers import pipeline
|
|
|
import huggingface_hub
|
|
|
from IPython.display import Audio, display
|
|
|
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
|
|
|
model_id = "google/medasr"
|
|
|
pipe = pipeline("automatic-speech-recognition", model=model_id)
|
|
|
result = pipe(audio,chunk_length_s=20,stride_length_s=2)
|
|
|
# the chunk length is how long in seconds MedASR batches audio and the stride length is the overlap between chunks.
|
|
|
print(result)
|
|
|
```
|
|
|
|
|
|
**Run the model directly**
|
|
|
|
|
|
```py
|
|
|
from transformers import AutoModelForCTC, AutoProcessor
|
|
|
import huggingface_hub
|
|
|
import librosa
|
|
|
import torch
|
|
|
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
|
|
|
model_id = f"google/medasr"
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
|
processor = AutoProcessor.from_pretrained(model_id)
|
|
|
model = AutoModelForCTC.from_pretrained(model_id).to(device)
|
|
|
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
|
|
|
speech, sample_rate = librosa.load(audio, sr=16000)
|
|
|
inputs = processor(speech, sampling_rate=sample_rate, return_tensors="pt", padding=True)
|
|
|
inputs = inputs.to(device)
|
|
|
outputs = model.generate(**inputs)
|
|
|
decoded_text = processor.batch_decode(outputs)[0]
|
|
|
print(f"result={decoded_text}")
|
|
|
```
|
|
|
|
|
|
### Examples
|
|
|
|
|
|
See the following tutorial notebooks for examples of how to use MedASR:
|
|
|
|
|
|
* To give the model a quick try, running it locally with weights from Hugging
|
|
|
Face, see [Quick start notebook in
|
|
|
Colab](https://colab.research.google.com/github/google-health/medasr/blob/main/notebooks/quick_start_with_hugging_face.ipynb).
|
|
|
|
|
|
* For an example of fine-tuning the, see the [Fine-tuning notebook in
|
|
|
Colab](https://colab.research.google.com/github/google-health/medasr/blob/main/notebooks/fine_tune_with_hugging_face.ipynb).
|
|
|
|
|
|
### Model architecture overview
|
|
|
|
|
|
The MedASR model is built based on the
|
|
|
[Conformer](https://arxiv.org/abs/2005.08100) architecture.
|
|
|
|
|
|
### Technical specifications
|
|
|
|
|
|
* **Model type**: Automated-speech-detector
|
|
|
|
|
|
* **Input Modalities**: Mono-channel audio 16kHz, int16 waveform
|
|
|
|
|
|
* **Output Modality:** Text only
|
|
|
|
|
|
* **Number of parameters:** 105M
|
|
|
|
|
|
* **Key publication**: [LAST: Scalable Lattice-Based Speech Modelling in JAX](https://arxiv.org/pdf/2304.13134)
|
|
|
|
|
|
* **Model created**: December 18, 2025
|
|
|
|
|
|
* **Model version**: 1.0.0
|
|
|
|
|
|
### Citation
|
|
|
|
|
|
When using this model, cite: \
|
|
|
@inproceedings{wu2023last, \
|
|
|
title={Last: Scalable Lattice-Based Speech Modelling in Jax}, \
|
|
|
author={Wu, Ke and Variani, Ehsan and Bagby, Tom and Riley, Michael}, \
|
|
|
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech
|
|
|
and Signal Processing (ICASSP)}, \
|
|
|
pages={1--5}, \
|
|
|
year={2023}, \
|
|
|
organization={IEEE} \
|
|
|
}
|
|
|
|
|
|
### **Performance and Evaluations**
|
|
|
|
|
|
Our evaluation methods include evaluating word-error rate (WER) of MedASR
|
|
|
against held out medical audio examples. We also evaluate specifically medical
|
|
|
WER, where we only look at words that have a medical context. These audio
|
|
|
samples have been transcribed by human experts, but there is always some noise
|
|
|
in such transcriptions.
|
|
|
|
|
|
**Key performance metrics**
|
|
|
|
|
|
Word error rate of MedASR versus other models\*
|
|
|
|
|
|
Dataset name | Dataset description | MedASR with greedy decoding | MedASR \+ 6-gram language model | Gemini 2.5 Pro | Gemini 2.5 Flash | Whisper v3 Large
|
|
|
:------------------------------------------------------- | :---------------------------------------------------------- | :-------------------------- | :------------------------------ | :------------- | :--------------- | :---------------
|
|
|
RAD-DICT | Private radiologist dictation dataset | 6.6% | **4.6%** | 10.0% | 24.4% | 25.3%
|
|
|
GENERAL-DICT | Private general and internal medicine dataset | 9.3% | **6.9%** | 16.4% | 27.1% | 33.1%
|
|
|
FM-DICT | Private family medicine dataset | 8.1% | **5.8%** | 14.6% | 19.9% | 32.5%
|
|
|
[Eye Gaze](https://physionet.org/content/egd-cxr/1.0.0/) | Dictation of audio from 998 MIMIC cases (multiple speakers) | 6.6% | **5.2%** | 5.9% | 9.3% | 12.5%
|
|
|
|
|
|
\*All results except "MedASR \+ 6-gram language model" in the preceding table
|
|
|
use greedy decoding. "MedASR \+ 6-gram language model" uses beam search with
|
|
|
beam size 8.
|
|
|
|
|
|
#### **Safety evaluation**
|
|
|
|
|
|
Our evaluation methods include structured evaluations and internal red-teaming
|
|
|
testing of relevant safety policies. This model was evaluated across various
|
|
|
dimensions to assess safety. Human evaluations were conducted on 100 example
|
|
|
outputs to assess for potential safety impact, specifically related to incorrect
|
|
|
transcriptions associated with medication names, dosages, diagnoses, semantic
|
|
|
changes, and medical terminology. The results of these evaluations were
|
|
|
determined to be acceptable in regards to internal policies for overall safety.
|
|
|
|
|
|
## Data card
|
|
|
|
|
|
### Dataset overview
|
|
|
|
|
|
#### Training
|
|
|
|
|
|
The MedASR model is specifically trained on a diverse set of de-identified
|
|
|
medical speech data. Its training utilizes approximately 5000 hours of physician
|
|
|
dictations across a range of specialities (proprietary dataset 1\) and
|
|
|
de-identified medical conversations, primarily physician-patient dialogue
|
|
|
(proprietary dataset 2). The model is trained on audio segments paired with
|
|
|
corresponding transcripts and metadata, with subsets of the conversational data
|
|
|
also including extensive annotations for medical named entities such as
|
|
|
symptoms, medications, and conditions. MedASR therefore has a strong
|
|
|
understanding of vocabulary used in medical contexts.
|
|
|
|
|
|
#### Evaluation
|
|
|
|
|
|
MedASR has been evaluated using a mix of internal and public datasets as noted
|
|
|
in the Key Performance Metrics section. We used argmax of the model for
|
|
|
posterior probability (greedy decoding) to get the output model's hypothesis
|
|
|
tokens. The hypothesis is compared against ground truth transcript using jiwer
|
|
|
library to calculate the word error rate.
|
|
|
|
|
|
#### Source
|
|
|
|
|
|
The datasets used to train MedASR include a public dataset for pre-training and
|
|
|
a proprietary dataset that was licensed and incorporated (described in the
|
|
|
following section).
|
|
|
|
|
|
### Data ownership and documentation
|
|
|
|
|
|
Pre-training with the full [LibriHeavy training
|
|
|
set.](https://arxiv.org/abs/2309.08105) Fine-tuning was conducted on
|
|
|
de-identified, licensed datasets described in the following section
|
|
|
|
|
|
Private Medical Dict: Google internal dataset consisting of de-identified
|
|
|
dictations made by physicians of different specialities including radiology,
|
|
|
internal medicine, family medicine, and other subspecialties totaling more than
|
|
|
5000 hours of audio. This dataset was split into test sets that constitute
|
|
|
RAD-DICT, FM-DICT and General and Internal Medicine\-DICT referenced previously
|
|
|
in Performance and Evaluations.
|
|
|
|
|
|
### Data citation
|
|
|
|
|
|
Eye Gaze Data for Chest X-rays (evaluation set described previously in
|
|
|
Performance and Evaluations) was derived from:
|
|
|
|
|
|
MIMIC-CXR Database v1.0.0 and MIMIC-IV v0.4
|
|
|
|
|
|
### De-identification/anonymization:
|
|
|
|
|
|
Google and its partners utilize datasets that have been rigorously anonymized or
|
|
|
de-identified to ensure the protection of individual research participants and
|
|
|
patient privacy.
|
|
|
|
|
|
## **Implementation Information**
|
|
|
|
|
|
Details about the model internals.
|
|
|
|
|
|
### **Hardware**
|
|
|
|
|
|
[Tensor Processing Unit (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu)
|
|
|
hardware (TPUv4p, TPUv5p and TPUv5e). Training speech-to text models requires
|
|
|
significant computational power. TPUs, designed specifically for matrix
|
|
|
operations common in machine learning, offer several advantages in this domain:
|
|
|
|
|
|
* Performance: TPUs are specifically designed to handle the massive
|
|
|
computations involved in training VLMs. They can speed up training
|
|
|
considerably compared to CPUs.
|
|
|
* Memory: TPUs often come with large amounts of high-bandwidth memory,
|
|
|
allowing for the handling of large models and batch sizes during training.
|
|
|
This can lead to better model quality.
|
|
|
* Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution
|
|
|
for handling the growing complexity of large foundation models. You can
|
|
|
distribute training across multiple TPU devices for faster and more
|
|
|
efficient processing.
|
|
|
* Cost-effectiveness: In many scenarios, TPUs can provide a more
|
|
|
cost-effective solution for training large models compared to CPU-based
|
|
|
infrastructure, especially when considering the time and resources saved due
|
|
|
to faster training.
|
|
|
* These advantages are aligned with [Google's commitments to operate
|
|
|
sustainably](https://sustainability.google/operating-sustainably/).
|
|
|
|
|
|
### **Software**
|
|
|
|
|
|
Training was done using [JAX](https://github.com/jax-ml/jax) and [ML
|
|
|
Pathways](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/).
|
|
|
JAX allows researchers to take advantage of the latest generation of hardware,
|
|
|
including TPUs, for faster and more efficient training of large models. ML
|
|
|
Pathways is Google's latest effort to build artificially intelligent systems
|
|
|
capable of generalizing across multiple tasks. This is specially suitable for
|
|
|
foundation models, including large language models like these ones.
|
|
|
|
|
|
Together, JAX and ML Pathways are used as described in the [paper about the
|
|
|
Gemini family of models](https://goo.gle/gemma2report); *"the 'single
|
|
|
controller' programming model of JAX and Pathways allows a single Python process
|
|
|
to orchestrate the entire training run, dramatically simplifying the development
|
|
|
workflow."*
|
|
|
|
|
|
## **Usage and Limitations**
|
|
|
|
|
|
The MedASR model has certain limitations that users should be aware of.
|
|
|
|
|
|
### **Intended Use**
|
|
|
|
|
|
MedASR is a speech-to-text model intended to be used as a starting point that
|
|
|
enables more efficient development of downstream healthcare applications
|
|
|
requiring speech as input. MedASR is intended for developers in the healthcare
|
|
|
and life sciences space. Developers are responsible for training, adapting, and
|
|
|
making meaningful changes to MedASR to accomplish their specific intended use.
|
|
|
The MedASR model can be fine-tuned by developers using their own proprietary
|
|
|
data for their specific tasks or solutions.
|
|
|
|
|
|
MedASR is trained on many medical audio, speech, and text and enables further
|
|
|
development and integration, or both with generative models like
|
|
|
[MedGemma](https://developers.google.com/health-ai-developer-foundations/medgemma),
|
|
|
where MedASR converts speech to text, which can then be used as input for a
|
|
|
text-to-text response. Full details of all the tasks MedASR has been evaluated
|
|
|
and pre-trained on can be found in the MedASR model card.
|
|
|
|
|
|
MedASR is not intended to be used without appropriate validation, adaptation, or
|
|
|
making meaningful modification by developers for their specific use case. The
|
|
|
outputs generated by MedASR may include transcription errors and are not
|
|
|
intended to directly inform clinical diagnosis, patient management decisions,
|
|
|
treatment recommendations, or any other direct clinical practice applications.
|
|
|
All outputs from MedASR should be considered preliminary and require independent
|
|
|
verification, clinical correlation, and further investigation through
|
|
|
established research and development methodologies.
|
|
|
|
|
|
### **Limitations**
|
|
|
|
|
|
* Training Data
|
|
|
* English-only: All training data is in English
|
|
|
* Speaker diversity: Most training data comes from speakers where English
|
|
|
is their first language and were raised in the United States. The base
|
|
|
model's performance may be lower for other types of speakers,
|
|
|
necessitating the need for fine-tuning.
|
|
|
* Speaker Sex/Gender: Training data included both men and women but had a
|
|
|
higher proportion of men.
|
|
|
* Audio quality: Training data is mostly from high quality microphones.
|
|
|
The base model's performance may deteriorate on low quality audio with
|
|
|
background noise, necessitating the need for fine-tuning.
|
|
|
* Specialized medical terminology: Although MedASR has specialized medical
|
|
|
audio training, its training may not include all medications, procedures
|
|
|
or terminology, especially ones that have come into usage in the past 10
|
|
|
years.
|
|
|
* Dates: MedASR has been trained on de-identified data so its performance
|
|
|
on different date formats may be lacking. This can be rectified with
|
|
|
further finetuning or alternative decoding approaches such as language
|
|
|
model decoding debiasing.
|
|
|
|
|
|
### Benefits
|
|
|
|
|
|
At the time of release, MedASR is a high performing open speech-to-text model,
|
|
|
with specific training for medical applications. Users can update its vocabulary
|
|
|
with few-shot fine-tuning or decoding with external language models.
|
|
|
|
|
|
Based on the benchmark evaluation metrics in this document, MedASR represents a
|
|
|
significant leap forward in medical speech-to-text performance relative to other
|
|
|
comparably-sized open model alternatives.
|
|
|
|