| | --- |
| | license: apache-2.0 |
| | pipeline_tag: automatic-speech-recognition |
| | tags: |
| | - pytorch |
| | - audio |
| | - speech |
| | - automatic-speech-recognition |
| | - whisper |
| | - wav2vec2 |
| |
|
| | model-index: |
| | - name: whisper_large_v2_fp16_transformers |
| | results: |
| | - task: |
| | type: automatic-speech-recognition |
| | name: Automatic Speech Recognition |
| | dataset: |
| | type: librispeech_asr |
| | name: LibriSpeech (clean) |
| | config: clean |
| | split: test |
| | args: |
| | language: en |
| | metrics: |
| | - type: wer |
| | value: 0 |
| | name: Test WER |
| | description: Word Error Rate |
| | - type: mer |
| | value: 0 |
| | name: Test MER |
| | description: Match Error Rate |
| | - type: wil |
| | value: 0 |
| | name: Test WIL |
| | description: Word Information Lost |
| | - type: wip |
| | value: 0 |
| | name: Test WIP |
| | description: Word Information Preserved |
| | - type: cer |
| | value: 0 |
| | name: Test CER |
| | description: Character Error Rate |
| |
|
| | - task: |
| | type: automatic-speech-recognition |
| | name: Automatic Speech Recognition |
| | dataset: |
| | type: librispeech_asr |
| | name: LibriSpeech (other) |
| | config: other |
| | split: test |
| | args: |
| | language: en |
| | metrics: |
| | - type: wer |
| | value: 0 |
| | name: Test WER |
| | description: Word Error Rate |
| | - type: mer |
| | value: 0 |
| | name: Test MER |
| | description: Match Error Rate |
| | - type: wil |
| | value: 0 |
| | name: Test WIL |
| | description: Word Information Lost |
| | - type: wip |
| | value: 0 |
| | name: Test WIP |
| | description: Word Information Preserved |
| | - type: cer |
| | value: 0 |
| | name: Test CER |
| | description: Character Error Rate |
| |
|
| | - task: |
| | type: automatic-speech-recognition |
| | name: Automatic Speech Recognition |
| | dataset: |
| | type: mozilla-foundation/common_voice_14_0 |
| | name: Common Voice (14.0) (Hindi) |
| | config: hi |
| | split: test |
| | args: |
| | language: hi |
| | metrics: |
| | - type: wer |
| | value: 44.64 |
| | name: Test WER |
| | description: Word Error Rate |
| | - type: mer |
| | value: 41.69 |
| | name: Test MER |
| | description: Match Error Rate |
| | - type: wil |
| | value: 59.53 |
| | name: Test WIL |
| | description: Word Information Lost |
| | - type: wip |
| | value: 40.46 |
| | name: Test WIP |
| | description: Word Information Preserved |
| | - type: cer |
| | value: 16.80 |
| | name: Test CER |
| | description: Character Error Rate |
| |
|
| | widget: |
| | - example_title: Hinglish Sample |
| | src: https://huggingface.co/devasheeshG/whisper_large_v2_fp16_transformers/resolve/main/test.wav |
| | - example_title: Librispeech sample 1 |
| | src: https://cdn-media.huggingface.co/speech_samples/sample1.flac |
| | - example_title: Librispeech sample 2 |
| | src: https://cdn-media.huggingface.co/speech_samples/sample2.flac |
| |
|
| | language: |
| | - en |
| | - zh |
| | - de |
| | - es |
| | - ru |
| | - ko |
| | - fr |
| | - ja |
| | - pt |
| | - tr |
| | - pl |
| | - ca |
| | - nl |
| | - ar |
| | - sv |
| | - it |
| | - id |
| | - hi |
| | - fi |
| | - vi |
| | - he |
| | - uk |
| | - el |
| | - ms |
| | - cs |
| | - ro |
| | - da |
| | - hu |
| | - ta |
| | - "no" |
| | - th |
| | - ur |
| | - hr |
| | - bg |
| | - lt |
| | - la |
| | - mi |
| | - ml |
| | - cy |
| | - sk |
| | - te |
| | - fa |
| | - lv |
| | - bn |
| | - sr |
| | - az |
| | - sl |
| | - kn |
| | - et |
| | - mk |
| | - br |
| | - eu |
| | - is |
| | - hy |
| | - ne |
| | - mn |
| | - bs |
| | - kk |
| | - sq |
| | - sw |
| | - gl |
| | - mr |
| | - pa |
| | - si |
| | - km |
| | - sn |
| | - yo |
| | - so |
| | - af |
| | - oc |
| | - ka |
| | - be |
| | - tg |
| | - sd |
| | - gu |
| | - am |
| | - yi |
| | - lo |
| | - uz |
| | - fo |
| | - ht |
| | - ps |
| | - tk |
| | - nn |
| | - mt |
| | - sa |
| | - lb |
| | - my |
| | - bo |
| | - tl |
| | - mg |
| | - as |
| | - tt |
| | - haw |
| | - ln |
| | - ha |
| | - ba |
| | - jw |
| | - su |
| | --- |
| | ## Versions: |
| |
|
| | - CUDA: 12.1 |
| | - cuDNN Version: 8.9.2.26_1.0-1_amd64 |
| |
|
| | * tensorflow Version: 2.12.0 |
| | * torch Version: 2.1.0.dev20230606+cu12135 |
| | * transformers Version: 4.30.2 |
| | * accelerate Version: 0.20.3 |
| |
|
| | ## Model Benchmarks: |
| |
|
| | - RAM: 3 GB (Original_Model: 6GB) |
| | - VRAM: 3.7 GB (Original_Model: 11GB) |
| | - test.wav: 23 s (Multilingual Speech i.e. English+Hindi) |
| |
|
| | - **Time in seconds for Processing by each device** |
| |
|
| | | Device Name | float32 (Original) | float16 | CudaCores | TensorCores | |
| | | ----------------- | ------------------ | ------- | --------- | ----------- | |
| | | 3060 | 2.2 | 1.3 | 3,584 | 112 | |
| | | 1660 Super | OOM | 6 | 1,408 | N/A | |
| | | Collab (Tesla T4) | - | - | 2,560 | 320 | |
| | | Collab (CPU) | - | N/A | N/A | N/A | |
| | | M1 (CPU) | - | - | N/A | N/A | |
| | | M1 (GPU -> 'mps') | - | - | N/A | N/A | |
| |
|
| |
|
| | - **NOTE: TensorCores are efficient in mixed-precision calculations** |
| | - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)** |
| | - Punchuation: Sometimes False ('I don't know the exact reason why this is happening') |
| |
|
| | ## Model Error Benchmarks: |
| |
|
| | - **WER: Word Error Rate** |
| | - **MER: Match Error Rate** |
| | - **WIL: Word Information Lost** |
| | - **WIP: Word Information Preserved** |
| | - **CER: Character Error Rate** |
| |
|
| | ### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets) |
| |
|
| | **Test done on RTX 3060 on 1000 Samples** |
| |
|
| | | | WER | MER | WIL | WIP | CER | |
| | | ----------------------- | ----- | ----- | ----- | ----- | ----- | |
| | | Original_Model (30 min) | 43.99 | 41.65 | 59.47 | 40.52 | 16.23 | |
| | | This_Model (20 min) | 44.64 | 41.69 | 59.53 | 40.46 | 16.80 | |
| |
|
| | ### Hindi to English (test.csv) [Custom Dataset](https://huggingface.co/datasets/devasheeshG/common_voices_14_0_hi2en_hi2hi) |
| |
|
| | **Test done on RTX 3060 on 1000 Samples** |
| |
|
| | | | WER | MER | WIL | WIP | CER | |
| | | ----------------------- | --- | --- | --- | --- | --- | |
| | | Original_Model (30 min) | - | - | - | - | - | |
| | | This_Model (20 min) | - | - | - | - | - | |
| |
|
| | ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean) |
| |
|
| | **Test done on RTX 3060 on \_\_\_ Samples** |
| |
|
| | | | WER | MER | WIL | WIP | CER | |
| | | -------------- | --- | --- | --- | --- | --- | |
| | | Original_Model | - | - | - | - | - | |
| | | This_Model | - | - | - | - | - | |
| |
|
| | ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other) |
| |
|
| | **Test done on RTX 3060 on \_\_\_ Samples** |
| |
|
| | | | WER | MER | WIL | WIP | CER | |
| | | -------------- | --- | --- | --- | --- | --- | |
| | | Original_Model | - | - | - | - | - | |
| | | This_Model | - | - | - | - | - | |
| |
|
| | - **'jiwer' library is used for calculations** |
| |
|
| | ## Code for conversion: |
| |
|
| | - ### [Will be soon Uploaded on Github](https://github.com/devasheeshG) |
| |
|
| | ## Usage |
| |
|
| | A file `__init__.py` is contained inside this repo which contains all the code to use this model. |
| |
|
| | Firstly, clone this repo and place all the files inside a folder. |
| |
|
| | ### Make sure you have git-lfs installed (https://git-lfs.com) |
| |
|
| | ```bash |
| | git lfs install |
| | git clone https://huggingface.co/devasheeshG/whisper_large_v2_fp16_transformers |
| | ``` |
| |
|
| | **Please try in jupyter notebook** |
| |
|
| | ```python |
| | # Import the Model |
| | from whisper_large_v2_fp16_transformers import Model, load_audio, pad_or_trim |
| | ``` |
| |
|
| | ```python |
| | # Initilise the model |
| | model = Model( |
| | model_name_or_path='whisper_large_v2_fp16_transformers', |
| | cuda_visible_device="0", |
| | device='cuda', |
| | ) |
| | ``` |
| |
|
| | ```python |
| | # Load Audio |
| | audio = load_audio('whisper_large_v2_fp16_transformers/test.wav') |
| | audio = pad_or_trim(audio) |
| | ``` |
| |
|
| | ```python |
| | # Transcribe (First transcription takes time) |
| | model.transcribe(audio) |
| | ``` |
| |
|
| | ## Credits |
| |
|
| | It is fp16 version of ``openai/whisper-large-v2`` |
| |
|