devasheeshG
/

whisper_large_v2_fp16_transformers

+---
+license: apache-2.0
+pipeline_tag: automatic-speech-recognition
+tags:
+  - pytorch
+  - audio
+  - speech
+  - automatic-speech-recognition
+  - whisper
+  - wav2vec2
+model-index:
+  - name: whisper_large_v2_fp16_transformers
+    results:
+      - task:
+          type: automatic-speech-recognition
+          name: Automatic Speech Recognition
+        dataset:
+          type: librispeech_asr
+          name: LibriSpeech (clean)
+          config: clean
+          split: test
+          args:
+            language: en
+        metrics:
+          - type: wer
+            value: 0
+            name: Test WER
+            description: Word Error Rate
+          - type: mer
+            value: 0
+            name: Test MER
+            description: Match Error Rate
+          - type: wil
+            value: 0
+            name: Test WIL
+            description: Word Information Lost
+          - type: wip
+            value: 0
+            name: Test WIP
+            description: Word Information Preserved
+          - type: cer
+            value: 0
+            name: Test CER
+            description: Character Error Rate
+      - task:
+          type: automatic-speech-recognition
+          name: Automatic Speech Recognition
+        dataset:
+          type: librispeech_asr
+          name: LibriSpeech (other)
+          config: other
+          split: test
+          args:
+            language: en
+        metrics:
+          - type: wer
+            value: 0
+            name: Test WER
+            description: Word Error Rate
+          - type: mer
+            value: 0
+            name: Test MER
+            description: Match Error Rate
+          - type: wil
+            value: 0
+            name: Test WIL
+            description: Word Information Lost
+          - type: wip
+            value: 0
+            name: Test WIP
+            description: Word Information Preserved
+          - type: cer
+            value: 0
+            name: Test CER
+            description: Character Error Rate
+      - task:
+          type: automatic-speech-recognition
+          name: Automatic Speech Recognition
+        dataset:
+          type: mozilla-foundation/common_voice_14_0
+          name: Common Voice (14.0) (Hindi)
+          config: hi
+          split: test
+          args:
+            language: hi
+        metrics:
+          - type: wer
+            value: 44.64
+            name: Test WER
+            description: Word Error Rate
+          - type: mer
+            value: 41.69
+            name: Test MER
+            description: Match Error Rate
+          - type: wil
+            value: 59.53
+            name: Test WIL
+            description: Word Information Lost
+          - type: wip
+            value: 40.46
+            name: Test WIP
+            description: Word Information Preserved
+          - type: cer
+            value: 16.80
+            name: Test CER
+            description: Character Error Rate
+widget:
+  - example_title: Hinglish Sample
+    src: https://huggingface.co/devasheeshG/whisper_large_v2_fp16_transformers/resolve/main/test.wav
+  - example_title: Librispeech sample 1
+    src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
+  - example_title: Librispeech sample 2
+    src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
+language:
+  - en
+  - zh
+  - de
+  - es
+  - ru
+  - ko
+  - fr
+  - ja
+  - pt
+  - tr
+  - pl
+  - ca
+  - nl
+  - ar
+  - sv
+  - it
+  - id
+  - hi
+  - fi
+  - vi
+  - he
+  - uk
+  - el
+  - ms
+  - cs
+  - ro
+  - da
+  - hu
+  - ta
+  - "no"
+  - th
+  - ur
+  - hr
+  - bg
+  - lt
+  - la
+  - mi
+  - ml
+  - cy
+  - sk
+  - te
+  - fa
+  - lv
+  - bn
+  - sr
+  - az
+  - sl
+  - kn
+  - et
+  - mk
+  - br
+  - eu
+  - is
+  - hy
+  - ne
+  - mn
+  - bs
+  - kk
+  - sq
+  - sw
+  - gl
+  - mr
+  - pa
+  - si
+  - km
+  - sn
+  - yo
+  - so
+  - af
+  - oc
+  - ka
+  - be
+  - tg
+  - sd
+  - gu
+  - am
+  - yi
+  - lo
+  - uz
+  - fo
+  - ht
+  - ps
+  - tk
+  - nn
+  - mt
+  - sa
+  - lb
+  - my
+  - bo
+  - tl
+  - mg
+  - as
+  - tt
+  - haw
+  - ln
+  - ha
+  - ba
+  - jw
+  - su
+---
+## Versions:
+- CUDA: 12.1
+- cuDNN Version: 8.9.2.26_1.0-1_amd64
+* tensorflow Version: 2.12.0
+* torch Version: 2.1.0.dev20230606+cu12135
+* transformers Version: 4.30.2
+* accelerate Version: 0.20.3
+## Model Benchmarks:
+- RAM: 3 GB (Original_Model: 6GB)
+- VRAM: 3.7 GB (Original_Model: 11GB)
+- test.wav: 23 s (Multilingual Speech i.e. English+Hindi)
+  - **Time in seconds for Processing by each device**
+  | Device Name       | float32 (Original) | float16 | CudaCores | TensorCores |
+  | ----------------- | ------------------ | ------- | --------- | ----------- |
+  | 3060              | 2.2                | 1.3     | 3,584     | 112         |
+  | 1660 Super        | OOM                | 6       | 1,408     | N/A         |
+  | Collab (Tesla T4) | -                  | -       | 2,560     | 320         |
+  | Collab (CPU)      | -                  | N/A     | N/A       | N/A         |
+  | M1 (CPU)          | -                  | -       | N/A       | N/A         |
+  | M1 (GPU -> 'mps') | -                  | -       | N/A       | N/A         |
+  - **NOTE: TensorCores are efficient in mixed-precision calculations**
+  - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)**
+- Punchuation: Sometimes False ('I don't know the exact reason why this is happening')
+## Model Error Benchmarks:
+- **WER: Word Error Rate**
+- **MER: Match Error Rate**
+- **WIL: Word Information Lost**
+- **WIP: Word Information Preserved**
+- **CER: Character Error Rate**
+### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets)
+**Test done on RTX 3060 on 1000 Samples**
+|                         | WER   | MER   | WIL   | WIP   | CER   |
+| ----------------------- | ----- | ----- | ----- | ----- | ----- |
+| Original_Model (30 min) | 43.99 | 41.65 | 59.47 | 40.52 | 16.23 |
+| This_Model (20 min)     | 44.64 | 41.69 | 59.53 | 40.46 | 16.80 |
+### Hindi to English (test.csv) [Custom Dataset](https://huggingface.co/datasets/devasheeshG/common_voices_14_0_hi2en_hi2hi)
+**Test done on RTX 3060 on 1000 Samples**
+|                         | WER | MER | WIL | WIP | CER |
+| ----------------------- | --- | --- | --- | --- | --- |
+| Original_Model (30 min) | -   | -   | -   | -   | -   |
+| This_Model (20 min)     | -   | -   | -   | -   | -   |
+### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean)
+**Test done on RTX 3060 on \_\_\_ Samples**
+|                | WER | MER | WIL | WIP | CER |
+| -------------- | --- | --- | --- | --- | --- |
+| Original_Model | -   | -   | -   | -   | -   |
+| This_Model     | -   | -   | -   | -   | -   |
+### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other)
+**Test done on RTX 3060 on \_\_\_ Samples**
+|                | WER | MER | WIL | WIP | CER |
+| -------------- | --- | --- | --- | --- | --- |
+| Original_Model | -   | -   | -   | -   | -   |
+| This_Model     | -   | -   | -   | -   | -   |
+- **'jiwer' library is used for calculations**
+## Code for conversion:
+- ### [Will be soon Uploaded on Github](https://github.com/devasheeshG)
+## Usage
+A file `__init__.py` is contained inside this repo which contains all the code to use this model.
+Firstly, clone this repo and place all the files inside a folder.
+### Make sure you have git-lfs installed (https://git-lfs.com)
+```bash
+git lfs install
+git clone https://huggingface.co/devasheeshG/whisper_large_v2_fp16_transformers
+```
+**Please try in jupyter notebook**
+```python
+# Import the Model
+from whisper_large_v2_fp16_transformers import Model, load_audio, pad_or_trim
+```
+```python
+# Initilise the model
+model = Model(
+            model_name_or_path='whisper_large_v2_fp16_transformers',
+            cuda_visible_device="0",
+            device='cuda',
+      )
+```
+```python
+# Load Audio
+audio = load_audio('whisper_large_v2_fp16_transformers/test.wav')
+audio = pad_or_trim(audio)
+```
+```python
+# Transcribe (First transcription takes time)
+model.transcribe(audio)
+```
+## Credits
+It is fp16 version of ``openai/whisper-large-v2``

__init__.py ADDED Viewed

	@@ -0,0 +1,125 @@

+from transformers import (
+    WhisperForConditionalGeneration,
+    WhisperProcessor,
+    WhisperConfig,
+)
+import torch
+import ffmpeg
+import torch
+import torch.nn.functional as F
+import numpy as np
+import os
+# load_audio and pad_or_trim functions
+SAMPLE_RATE = 16000
+CHUNK_LENGTH = 30  # 30-second chunks
+N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE  # 480000 samples in a 30-second chunk
+# audio = whisper.load_audio('test.wav')
+def load_audio(file: str, sr: int = SAMPLE_RATE, start_time: int = 0, dtype=np.float16):
+    """
+    Load an audio file into a numpy array at the specified sampling rate.
+    """
+    try:
+        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
+        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
+        out, _ = (
+            ffmpeg.input(file, ss=start_time, threads=0)
+            .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
+            .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
+        )
+    except ffmpeg.Error as e:
+        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
+    # return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
+    return np.frombuffer(out, np.int16).flatten().astype(dtype) / 32768.0
+# audio = whisper.pad_or_trim(audio)
+def pad_or_trim(array, length: int = N_SAMPLES, *, axis: int = -1):
+    """
+    Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
+    """
+    if torch.is_tensor(array):
+        if array.shape[axis] > length:
+            array = array.index_select(
+                dim=axis, index=torch.arange(length, device=array.device)
+            )
+        if array.shape[axis] < length:
+            pad_widths = [(0, 0)] * array.ndim
+            pad_widths[axis] = (0, length - array.shape[axis])
+            array = F.pad(array, [pad for sizes in pad_widths[::-1] for pad in sizes])
+    else:
+        if array.shape[axis] > length:
+            array = array.take(indices=range(length), axis=axis)
+        if array.shape[axis] < length:
+            pad_widths = [(0, 0)] * array.ndim
+            pad_widths[axis] = (0, length - array.shape[axis])
+            array = np.pad(array, pad_widths)
+    return array
+class Model:
+    def __init__(
+        self,
+        model_name_or_path: str,
+        cuda_visible_device: str = "0",
+        device: str = "cuda",  # torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    ):
+        os.environ["CUDA_VISIBLE_DEVICES"] = cuda_visible_device
+        self.DEVICE = device
+        self.processor = WhisperProcessor.from_pretrained(model_name_or_path)
+        self.tokenizer = self.processor.tokenizer
+        self.config = WhisperConfig.from_pretrained(model_name_or_path)
+        self.model = WhisperForConditionalGeneration(
+            config=self.config
+        ).from_pretrained(
+            pretrained_model_name_or_path=model_name_or_path,
+            torch_dtype=self.config.torch_dtype,
+            # device_map=DEVICE,      # 'balanced', 'balanced_low_0', 'sequential', 'cuda', 'cpu'
+            low_cpu_mem_usage=True,
+        )
+        # Move model to GPU
+        if self.model.device.type != self.DEVICE:
+            print(f"Moving model to {self.DEVICE}")
+            self.model = self.model.to(self.DEVICE)
+            self.model.eval()
+        else:
+            print(f"Model is already on {self.DEVICE}")
+            self.model.eval()
+        print("dtype of model acc to config: ", self.config.torch_dtype)
+        print("dtype of loaded model: ", self.model.dtype)
+    def transcribe(
+        self, audio, language: str = "english", skip_special_tokens: bool = True
+    ) -> str:
+        input_features = (
+            self.processor(audio, sampling_rate=SAMPLE_RATE, return_tensors="pt")
+            .input_features.half()
+            .to(self.DEVICE)
+        )
+        with torch.no_grad():
+            predicted_ids = self.model.generate(
+                input_features,
+                num_beams=1,
+                language=language,
+                task="transcribe",
+                use_cache=True,
+                is_multilingual=True,
+                return_timestamps=True,
+            )
+        transcription = self.tokenizer.batch_decode(
+            predicted_ids, skip_special_tokens=skip_special_tokens
+        )[0]
+        return transcription.strip()