devasheeshG
/

whisper_medium_fp16_transformers

@@ -75,7 +75,7 @@ model-index:
             value: 0
             name: Test CER
             description: Character Error Rate
       - task:
           type: automatic-speech-recognition
           name: Automatic Speech Recognition
@@ -216,7 +216,6 @@ language:
   - ba
   - jw
   - su
 ---
 ## Versions:
@@ -233,17 +232,18 @@ language:
 - RAM: 2.8 GB (Original_Model: 5.5GB)
 - VRAM: 1812 MB (Original_Model: 6GB)
 - test.wav: 23 s (Multilingual Speech i.e. English+Hindi)
   - **Time in seconds for Processing by each device**
-  | Device Name       | float32 (Original)   | float16 | CudaCores | TensorCores |
-  | ----------------- | -------------------- | ------- | --------- | ----------- |
-  | 3060              | 1.7                  | 1.1     | 3,584     | 112         |
-  | 1660 Super        | OOM                  | 3.3     | 1,408     | N/A         |
-  | Collab (Tesla T4) | 2.8                  | 2.2     | 2,560     | 320         |
-  | Collab (CPU)      | 35                   | N/A     | N/A       | N/A         |
-  | M1 (CPU)          | -                    | -       | -         | -           |
-  | M1 (GPU -> 'mps') | -                    | -       | -         | -           |
   - **NOTE: TensorCores are efficient in mixed-precision calculations**
   - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)**
@@ -257,46 +257,66 @@ language:
 - **WIP: Word Information Preserved**
 - **CER: Character Error Rate**
-### Hindi (test.tsv) [Common Voice 14.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_14_0)
 **Test done on RTX 3060 on 2557 Samples**
-  |                         | WER                  | MER     | WIL       | WIP         | CER   |
-  | ----------------------- | -------------------- | ------- | --------- | ----------- | ----- |
-  | Original_Model (54 min) | 52.02                | 47.86   | 66.82     | 33.17       | 23.76 |
-  | This_Model (38 min)     | 54.97                | 47.86   | 66.83     | 33.16       | 30.23 |
-### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean)
 **Test done on RTX 3060 on __ Samples**
-  |                   | WER                  | MER     | WIL       | WIP         | CER |
-  | ----------------- | -------------------- | ------- | --------- | ----------- | --- |
-  | Original_Model    | -                    | -       | -         | -           | -   |
-  | This_Model        | -                    | -       | -         | -           | -   |
-### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other)
 **Test done on RTX 3060 on __ Samples**
-  |                   | WER                  | MER     | WIL       | WIP         | CER |
-  | ----------------- | -------------------- | ------- | --------- | ----------- | --- |
-  | Original_Model    | -                    | -       | -         | -           | -   |
-  | This_Model        | -                    | -       | -         | -           | -   |
 - **'jiwer' library is used for calculations**
 ## Code for conversion:
-  - ### [Will be soon Uploaded on Github](https://github.com/devasheeshG)
 ## Usage
 A file ``__init__.py`` is contained inside this repo which contains all the code to use this model.
 Firstly, clone this repo and place all the files inside a folder.
 ### Make sure you have git-lfs installed (https://git-lfs.com)
 ```bash
 git lfs install
 git clone https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers
 ```
 **Please try in jupyter notebook**
 ```python
 # Import the Model
-from whisper_medium_fp16_transformers import Model
 ```
 ```python
@@ -310,12 +330,15 @@ model = Model(
 ```python
 # Load Audio
-audio = model.load_audio('whisper_medium_fp16_transformers/test.wav')
 ```
 ```python
 # Transcribe (First transcription takes time)
 model.transcribe(audio)
 ```
 ## Credits
-It is fp16 version of ```openai/whisper-medium```

             value: 0
             name: Test CER
             description: Character Error Rate
       - task:
           type: automatic-speech-recognition
           name: Automatic Speech Recognition
   - ba
   - jw
   - su
 ---
 ## Versions:
 - RAM: 2.8 GB (Original_Model: 5.5GB)
 - VRAM: 1812 MB (Original_Model: 6GB)
 - test.wav: 23 s (Multilingual Speech i.e. English+Hindi)
   - **Time in seconds for Processing by each device**
+  | Device Name       | float32 (Original) | float16 | CudaCores | TensorCores |
+  | ----------------- | ------------------ | ------- | --------- | ----------- |
+  | 3060              | 1.7                | 1.1     | 3,584     | 112         |
+  | 1660 Super        | OOM                | 3.3     | 1,408     | N/A         |
+  | Collab (Tesla T4) | 2.8                | 2.2     | 2,560     | 320         |
+  | Collab (CPU)      | 35                 | N/A     | N/A       | N/A         |
+  | M1 (CPU)          | -                  | -       | -         | -           |
+  | M1 (GPU -> 'mps') | -                  | -       | -         | -           |
   - **NOTE: TensorCores are efficient in mixed-precision calculations**
   - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)**
 - **WIP: Word Information Preserved**
 - **CER: Character Error Rate**
+### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets)
 **Test done on RTX 3060 on 2557 Samples**
+|                         | WER   | MER   | WIL   | WIP   | CER   |
+| ----------------------- | ----- | ----- | ----- | ----- | ----- |
+| Original_Model (54 min) | 52.02 | 47.86 | 66.82 | 33.17 | 23.76 |
+| This_Model (38 min)     | 54.97 | 47.86 | 66.83 | 33.16 | 30.23 |
+### Hindi to English (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets)
+**Test done on RTX 3060 on 1000 Samples**
+|                         | WER | MER | WIL | WIP | CER |
+| ----------------------- | --- | --- | --- | --- | --- |
+| Original_Model (30 min) | -   | -   | -   | -   | -   |
+| This_Model (20 min)     | -   | -   | -   | -   | -   |
+### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean)
 **Test done on RTX 3060 on __ Samples**
+|                | WER | MER | WIL | WIP | CER |
+| -------------- | --- | --- | --- | --- | --- |
+| Original_Model | -   | -   | -   | -   | -   |
+| This_Model     | -   | -   | -   | -   | -   |
+### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other)
 **Test done on RTX 3060 on __ Samples**
+|                | WER | MER | WIL | WIP | CER |
+| -------------- | --- | --- | --- | --- | --- |
+| Original_Model | -   | -   | -   | -   | -   |
+| This_Model     | -   | -   | -   | -   | -   |
 - **'jiwer' library is used for calculations**
 ## Code for conversion:
+- ### [Will be soon Uploaded on Github](https://github.com/devasheeshG)
 ## Usage
 A file ``__init__.py`` is contained inside this repo which contains all the code to use this model.
 Firstly, clone this repo and place all the files inside a folder.
 ### Make sure you have git-lfs installed (https://git-lfs.com)
 ```bash
 git lfs install
 git clone https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers
 ```
 **Please try in jupyter notebook**
 ```python
 # Import the Model
+from whisper_medium_fp16_transformers import Model, load_audio, pad_or_trim
 ```
 ```python
 ```python
 # Load Audio
+audio = load_audio('whisper_medium_fp16_transformers/test.wav')
+audio = pad_or_trim(audio)
 ```
 ```python
 # Transcribe (First transcription takes time)
 model.transcribe(audio)
 ```
 ## Credits
+It is fp16 version of ``openai/whisper-medium``

__init__.py CHANGED Viewed

@@ -1,5 +1,5 @@
 from transformers import (
-    WhisperForConditionalGeneration, WhisperProcessor, WhisperConfig
 )
 import torch
 import ffmpeg
@@ -13,6 +13,52 @@ SAMPLE_RATE = 16000
 CHUNK_LENGTH = 30  # 30-second chunks
 N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE  # 480000 samples in a 30-second chunk
 class Model:
     def __init__(self,
                  model_name_or_path: str,
@@ -49,54 +95,8 @@ class Model:
         print('dtype of model acc to config: ', self.config.torch_dtype)
         print('dtype of loaded model: ', self.model.dtype)
-    # audio = whisper.load_audio('test.wav')
-    def load_audio(self, file: str, sr: int = SAMPLE_RATE, start_time: int = 0, dtype=np.float16):
-        try:
-            # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
-            # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
-            out, _ = (
-                ffmpeg.input(file, ss=start_time, threads=0)
-                .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
-                .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
-            )
-        except ffmpeg.Error as e:
-            raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
-        # return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
-        return np.frombuffer(out, np.int16).flatten().astype(dtype) / 32768.0
-    # audio = whisper.pad_or_trim(audio)
-    def _pad_or_trim(self, array, length: int = N_SAMPLES, *, axis: int = -1):
-        """
-        Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
-        """
-        if torch.is_tensor(array):
-            if array.shape[axis] > length:
-                array = array.index_select(
-                    dim=axis, index=torch.arange(length, device=array.device)
-                )
-            if array.shape[axis] < length:
-                pad_widths = [(0, 0)] * array.ndim
-                pad_widths[axis] = (0, length - array.shape[axis])
-                array = F.pad(array, [pad for sizes in pad_widths[::-1] for pad in sizes])
-        else:
-            if array.shape[axis] > length:
-                array = array.take(indices=range(length), axis=axis)
-            if array.shape[axis] < length:
-                pad_widths = [(0, 0)] * array.ndim
-                pad_widths[axis] = (0, length - array.shape[axis])
-                array = np.pad(array, pad_widths)
-        return array
-    def transcribe(self, audio: np.ndarray, language: str = "english"):
-        # audio = load_audio(audio)
-        audio = self._pad_or_trim(audio)
         input_features = self.processor(audio, sampling_rate=SAMPLE_RATE, return_tensors="pt").input_features.half().to(self.DEVICE)
         with torch.no_grad():
             predicted_ids = self.model.generate(
@@ -109,5 +109,5 @@ class Model:
                 return_timestamps=True,
             )
-        transcription = self.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True)[0]
         return transcription.strip()

 from transformers import (
+    WhisperForConditionalGeneration, WhisperProcessor, WhisperConfig,
 )
 import torch
 import ffmpeg
 CHUNK_LENGTH = 30  # 30-second chunks
 N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE  # 480000 samples in a 30-second chunk
+# audio = whisper.load_audio('test.wav')
+def load_audio(file: str, sr: int = SAMPLE_RATE, start_time: int = 0, dtype=np.float16):
+    """
+    Load an audio file into a numpy array at the specified sampling rate.
+    """
+    try:
+        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
+        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
+        out, _ = (
+            ffmpeg.input(file, ss=start_time, threads=0)
+            .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
+            .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
+        )
+    except ffmpeg.Error as e:
+        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
+    # return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
+    return np.frombuffer(out, np.int16).flatten().astype(dtype) / 32768.0
+# audio = whisper.pad_or_trim(audio)
+def pad_or_trim(array, length: int = N_SAMPLES, *, axis: int = -1):
+    """
+    Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
+    """
+    if torch.is_tensor(array):
+        if array.shape[axis] > length:
+            array = array.index_select(
+                dim=axis, index=torch.arange(length, device=array.device)
+            )
+        if array.shape[axis] < length:
+            pad_widths = [(0, 0)] * array.ndim
+            pad_widths[axis] = (0, length - array.shape[axis])
+            array = F.pad(array, [pad for sizes in pad_widths[::-1] for pad in sizes])
+    else:
+        if array.shape[axis] > length:
+            array = array.take(indices=range(length), axis=axis)
+        if array.shape[axis] < length:
+            pad_widths = [(0, 0)] * array.ndim
+            pad_widths[axis] = (0, length - array.shape[axis])
+            array = np.pad(array, pad_widths)
+    return array
 class Model:
     def __init__(self,
                  model_name_or_path: str,
         print('dtype of model acc to config: ', self.config.torch_dtype)
         print('dtype of loaded model: ', self.model.dtype)
+    def transcribe(self, audio, language: str = "english", skip_special_tokens: bool = True) -> str:
         input_features = self.processor(audio, sampling_rate=SAMPLE_RATE, return_tensors="pt").input_features.half().to(self.DEVICE)
         with torch.no_grad():
             predicted_ids = self.model.generate(
                 return_timestamps=True,
             )
+        transcription = self.tokenizer.batch_decode(predicted_ids, skip_special_tokens=skip_special_tokens)[0]
         return transcription.strip()