--- datasets: - openslr/librispeech_asr - facebook/multilingual_librispeech language: - en - fr - de - pt - es metrics: - wer base_model: - openai/whisper-large-v2 - openai/whisper-small - openai/whisper-base pipeline_tag: automatic-speech-recognition tags: - streaming - asr - Transformer - encoder-decoder - pytorch - audio - speech - Whisper model-index: - name: CarelessWhisper-large-v2 results: - task: type: streaming-transcription-chunk-300msec dataset: name: test-clean type: LibriSpeech metrics: - name: Word Error Rate (WER) [%] type: Word Error Rate (WER) [%] value: 5.29 - name: Aligned-Relative Word Error Rate (ARWER) [%] type: Aligned-Relative Word Error Rate (WER) [%] value: 6 - task: type: streaming-transcription-chunk-300msec dataset: name: test-other type: LibriSpeech metrics: - name: Word Error Rate (WER) [%] type: Word Error Rate (WER) [%] value: 10.74 - name: Aligned-Relative Word Error Rate (ARWER) [%] type: Aligned-Relative Word Error Rate (WER) [%] value: 11.38 - task: type: streaming-transcription-chunk-200msec dataset: name: test-clean type: LibriSpeech metrics: - name: Word Error Rate (WER) [%] type: Word Error Rate (WER) [%] value: 5.92 - name: Aligned-Relative Word Error Rate (ARWER) [%] type: Aligned-Relative Word Error Rate (WER) [%] value: 6.63 - task: type: streaming-transcription-chunk-200msec dataset: name: test-other type: LibriSpeech metrics: - name: Word Error Rate (WER) [%] type: Word Error Rate (WER) [%] value: 11.41 - name: Aligned-Relative Word Error Rate (ARWER) [%] type: Aligned-Relative Word Error Rate (WER) [%] value: 12.6 - task: type: streaming-transcription-chunk-100msec dataset: name: test-clean type: LibriSpeech metrics: - name: Word Error Rate (WER) [%] type: Word Error Rate (WER) [%] value: 6.33 - name: Aligned-Relative Word Error Rate (ARWER) [%] type: Aligned-Relative Word Error Rate (WER) [%] value: 7.76 - task: type: streaming-transcription-chunk-100msec dataset: name: test-other type: LibriSpeech metrics: - name: Word Error Rate (WER) [%] type: Word Error Rate (WER) [%] value: 13.06 - name: Aligned-Relative Word Error Rate (ARWER) [%] type: Aligned-Relative Word Error Rate (WER) [%] value: 14.99 - task: type: streaming-transcription-chunk-40msec dataset: name: test-clean type: LibriSpeech metrics: - name: Word Error Rate (WER) [%] type: Word Error Rate (WER) [%] value: 7.76 - name: Aligned-Relative Word Error Rate (ARWER) [%] type: Aligned-Relative Word Error Rate (WER) [%] value: 9.94 - task: type: streaming-transcription-chunk-40msec dataset: name: test-other type: LibriSpeech metrics: - name: Word Error Rate (WER) [%] type: Word Error Rate (WER) [%] value: 16.73 - name: Aligned-Relative Word Error Rate (ARWER) [%] type: Aligned-Relative Word Error Rate (WER) [%] value: 19.28 --- # CarelessWhisper - Causal Whisper Streaming Model Causal Whisper Streaming is a fine tuned version of OpenAI Whisper, which can handle causal data and perform real-time transcription. [![arXiv](https://img.shields.io/badge/arXiv-2508.12301-b31b1b.svg)](https://arxiv.org/abs/2508.12301) [![Demo on Hugging Face](https://img.shields.io/badge/πŸ€—%20Demo-Hugging%20Face-blueviolet?logo=huggingface&logoColor=white)](https://huggingface.co/spaces/MLSpeech/CarelessWhisper-causal-streaming) ## πŸ“„ Paper For more details, see our [paper](https://arxiv.org/abs/2508.12301). ## πŸ”§ Setup We used Python 3.9.16, PyTorch 2.6.0, and PyTorch-Lightning 2.5.0 to train and test our models. Portions of this code are adapted from [OpenAI's Whisper](https://github.com/openai/whisper). To set up the project environment using `conda`, follow these steps: 1. **Clone the repository** ```bash git clone https://github.com/tomer9080/CarelessWhisper-streaming cd CarelessWhisper-streaming ``` > πŸ’‘ Make sure you have [Miniconda](https://docs.conda.io/en/latest/miniconda.html) or [Anaconda](https://www.anaconda.com/products/distribution) installed before proceeding. 2. **Create the conda environment** ```bash conda env create -f environment.yml ``` 3. **Activate The environment** ```bash conda activate careless_whisper ``` 4. **Install the appropriate PyTorch version** Depending on your hardware and CUDA version, install PyTorch by following the instructions at [https://pytorch.org/get-started/locally](https://pytorch.org/get-started/locally). This project was tested with CUDA 12.4, but it should also work with compatible earlier or later versions. After installing all of the dependencies, you can try to run inference. ## πŸ€– Available Models We fine-tuned three different sizes of Whisper, all support english only transcription. A `large-v2` that was fine tuned on multilingual data is available, and supports English, French, Spanish, German and Portuguese with chunk size of 300 miliseconds. | Size | Chunk Size [msec] | Multilingual | |:----:|:-----------------:|:------------:| | base | 40, 100, 200, 300 | N/A | | small| 40, 100, 200, 300, 1000| N/A | |large-v2| 40, 100, 200, 300, 1000| 300 | ## 🎀 Running Inference To run inference, download the repo content, and run from the repository root accroding to following sections. > **Note:** The models are hosted on the [Hugging Face Hub](https://huggingface.co/), which requires an access token. > Make sure you are logged in with your token to access the models. ### How to Apply Your Hugging Face πŸ€— Access Token 1. **Create a Hugging Face account** (if you don’t have one) at [https://huggingface.co/join](https://huggingface.co/join). 2. **Generate an access token:** - Go to your Hugging Face account settings: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) - Click on **"New token"**, give it a name, select the appropriate scopes (usually `read` is enough), and create it. 3. **Login using the Hugging Face CLI:** Install the CLI if you don’t have it: ```bash pip install huggingface_hub ``` Then login: ```bash huggingface-cli login ``` Paste your token when prompted. ### πŸ–₯️ CLI Usage The transcription model is easily activated using the next command: ```bash # Using a local microphone for streaming transcription, dumping the recording to out.wav python transcribe.py \ --output_filename out.wav \ --channels 2 \ --model small \ --chunk_size 300 \ --device cuda \ --beam_size 5 \ --ca_kv_cache \ ``` A simulation of a stream on a wav file is also available: ```bash # Simulating a stream on a wav file python transcribe.py \ --model small \ --chunk_size 300 \ --device cuda \ --beam_size 5 \ --ca_kv_cache \ --wav_file /path/to/audio.wav \ --simulate_stream \ --use_latency ``` ### 🐍 Python Usage If you prefer using python, a code sinppet utilizing a microphone or a wav file is provided below: ```python import torch import careless_whisper_stream model_size = "small" # model size chunk_size = 300 # chunk size in milliseconds multilingual = False # currently on large-v2_300msec supports other languages than english. device = "cuda" if torch.cuda.is_available() else "cpu" model = careless_whisper_stream.load_streaming_model(name=model_size, gran=chunk_size, multilingual=multilingual, device=device) # using a local microphone recording texts_microphone = model.transcribe(output_filename="/path/to/dump/file.wav", channels=2, beam_size=5, ca_kv_cache=True) # Simulating on a wav file texts_wav_simulation = model.transcribe(simulate_stream=True, wav_file="/path/to/file/you/want/to/transcribe.wav", beam_size=5, ca_kv_cache=True) ``` ## 🦾 Training In order to train using LoRA, you can use our existing code. Make sure all the requirements are installed. ### πŸ“‚ Dataset Structure Before starting model training using the command-line interface provided below, you must first configure your dataset dictionary file located at `training_code/ds_dict.py`. This file defines a Python dictionary named `ds_paths`, where you should specify paths to the `train`, `val`, and `test` partitions of your dataset. Each partition should be a CSV file with the following three columns: 1. `wav_path` β€” Path to the WAV audio file. 2. `tg_path` β€” Path to the corresponding `.TextGrid` file containing forced alignment. 3. `raw_text` β€” Ground truth transcription. > **Note:** The dictionary key (i.e., the name of the dataset) will be used by the training script to identify and load the dataset correctly. You can find an example entry in `training_code/ds_dict.py`. ### πŸ–₯️ CLI Interface ```bash python training_code/train.py \ --lora \ --streaming_train \ --simulate_stream \ --dataset LIBRI-960-ALIGNED \ --name example_training_base_model \ --size base \ --batch_size 32 \ --epochs 10 \ --learning_rate 1e-5 \ --rank 32 \ --gran 15 \ --extra_gran_blocks 1 \ --streaming_fraction 0.25 \ --top_k 5 \ ``` For more options and training configurations, run: ```bash python training_code/train.py --help ``` ## πŸ“œ License This repository uses a dual license: [![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) Portions derived from [OpenAI Whisper](https://github.com/openai/whisper) are licensed under the **MIT License**. [![CC BY-NC 4.0 License](https://img.shields.io/badge/License-CC--BY--NC%204.0-blue.svg)](https://creativecommons.org/licenses/by-nc/4.0/) All other original code in this repository is licensed under the **Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)**. See the [LICENSE](./LICENSE) file for full details.