vhadzhykhanov
/

GigaAM-v3-fastapi-supported

PyTorch

gigaam

custom_code

Model card Files Files and versions

xet

Community

Vladislav Gadzhikhanov commited on Dec 15, 2025

Commit

74e40ab

1 Parent(s): ec1dc1f

Added BinaryIO input for transcribe_longform

Browse files

Files changed (2) hide show

README.md +0 -65
modeling_gigaam.py +38 -7

README.md CHANGED Viewed

@@ -1,65 +0,0 @@
----
-license: mit
-language:
-- ru
-- en
-pipeline_tag: automatic-speech-recognition
----
-# GigaAM-v3
-GigaAM-v3 is a Conformer-based foundation model with 220–240M parameters, pretrained on diverse Russian speech data using the HuBERT-CTC objective.
-It is the third generation of the GigaAM family and provides state-of-the-art performance on Russian ASR across a wide range of domains.
-GigaAM-v3 includes the following model variants:
-- `ssl` — self-supervised HuBERT–CTC encoder pre-trained on 700,000 hours of Russian speech
-- `ctc` — ASR model fine-tuned with a CTC decoder
-- `rnnt` — ASR model fine-tuned with an RNN-T decoder
-- `e2e_ctc` — end-to-end CTC model with punctuation and text normalization
-- `e2e_rnnt` — end-to-end RNN-T model with punctuation and text normalization
-`GigaAM-v3` training incorporates new internal datasets: callcenter conversations, speech with background music, natural speech, and speech with atypical characteristics.
-the models perform on average **30%** better on these new domains, while maintaining the same quality as previous GigaAM generations on public benchmarks.
-The table below reports the Word Error Rate (%) for `GigaAM-v3` and other existing models over diverse domains.
-| Set Name          | V3_CTC | V3_RNNT | T-One + LM | Whisper |
-|:------------------|-------:|--------:|-----------:|--------:|
-| Open Datasets     |   3.0  |     2.6 |        5.7 |    12.0 |
-| Golos Farfield    |   4.5  |     3.9 |       12.2 |    16.7 |
-| Natural Speech    |   7.8  |     6.9 |       14.5 |    13.6 |
-| Disordered Speech |  20.6  |    19.2 |       51.0 |    59.3 |
-| Callcenter        |  10.3  |     9.5 |       13.5 |    23.9 |
-| **Average**       | **9.2**| **8.4** |       19.4 |    25.1 |
-The end-to-end ASR models (`e2e_ctc` and `e2e_rnnt`) produce punctuated, normalized text directly.
-In end-to-end ASR comparisons of `e2e_ctc` and `e2e_rnnt` against Whisper-large-v3, using Gemini 2.5 Pro as an LLM-as-a-judge, GigaAM-v3 models win by an average margin of **70:30**.
-For detailed results, see [metrics](https://github.com/salute-developers/GigaAM/blob/main/evaluation.md).
-## Usage
-```python
-from transformers import AutoModel
-revision = "e2e_rnnt"  # can be any v3 model: ssl, ctc, rnnt, e2e_ctc, e2e_rnnt
-model = AutoModel.from_pretrained(
-    "ai-sage/GigaAM-v3",
-    revision=revision,
-    trust_remote_code=True,
-)
-transcription = model.transcribe("example.wav")
-print(transcription)
-```
-Recommended versions:
-- `torch==2.8.0`, `torchaudio==2.8.0`
-- `transformers==4.57.1`
-- `pyannote-audio==4.0.0`, `torchcodec==0.7.0`
-- (any) `hydra-core`, `omegaconf`, `sentencepiece`
-Full usage guide can be found in the [example](https://github.com/salute-developers/GigaAM/blob/main/colab_example.ipynb).
-**License:** MIT
-**Paper:** [GigaAM: Efficient Self-Supervised Learner for Speech Recognition (InterSpeech 2025)](https://arxiv.org/abs/2506.01192)

modeling_gigaam.py CHANGED Viewed

@@ -21,6 +21,7 @@ from torch import Tensor, nn
 from torch.jit import TracerWarning
 from transformers import PretrainedConfig, PreTrainedModel
 from transformers.utils import cached_file
 DIR_NAME = os.path.dirname(os.path.abspath(__file__))
 sys.path.append(DIR_NAME)  # enable using modules through modeling_gigaam.<module_name>
@@ -66,6 +67,35 @@ def load_audio(audio_path: str, sample_rate: int = SAMPLE_RATE) -> Tensor:
         return torch.frombuffer(audio, dtype=torch.int16).float() / 32768.0
 class SpecScaler(nn.Module):
     """
     Module that applies logarithmic scaling to spectrogram values.
@@ -296,7 +326,7 @@ def get_pipeline(device: torch.device):
 def segment_audio_file(
-    wav_file: str,
     sr: int,
     max_duration: float = 22.0,
     min_duration: float = 15.0,
@@ -309,9 +339,10 @@ def segment_audio_file(
     The segmentation is performed using a PyAnnote voice activity detection pipeline.
     """
-    audio = load_audio(wav_file)
     pipeline = get_pipeline(device)
-    sad_segments = pipeline(wav_file)
     segments: List[torch.Tensor] = []
     curr_duration = 0.0
@@ -1296,7 +1327,7 @@ class GigaAMASR(GigaAM):
     @torch.inference_mode()
     def transcribe_longform(
-        self, wav_file: str, **kwargs
     ) -> List[Dict[str, Union[str, Tuple[float, float]]]]:
         """
         Transcribes a long audio file by splitting it into segments and
@@ -1304,7 +1335,7 @@ class GigaAMASR(GigaAM):
         """
         transcribed_segments = []
         segments, boundaries = segment_audio_file(
-            wav_file, SAMPLE_RATE, device=self._device, **kwargs
         )
         for segment, segment_boundaries in zip(segments, boundaries):
             wav = segment.to(self._device).unsqueeze(0).to(self._dtype)
@@ -1411,8 +1442,8 @@ class GigaAMModel(PreTrainedModel):
     def transcribe(self, wav_file: str) -> str:
         return self.model.transcribe(wav_file)
-    def transcribe_longform(self, wav_file: str) -> List[Dict[str, Union[str, Tuple[float, float]]]]:
-        return self.model.transcribe_longform(wav_file)
     def get_probs(self, wav_file: str) -> Dict[str, float]:
         return self.model.get_probs(wav_file)

 from torch.jit import TracerWarning
 from transformers import PretrainedConfig, PreTrainedModel
 from transformers.utils import cached_file
+from typing import BinaryIO
 DIR_NAME = os.path.dirname(os.path.abspath(__file__))
 sys.path.append(DIR_NAME)  # enable using modules through modeling_gigaam.<module_name>
         return torch.frombuffer(audio, dtype=torch.int16).float() / 32768.0
+def load_audio_binary(file: BinaryIO, sample_rate: int = SAMPLE_RATE) -> Tensor:
+    """
+    Load audio from binary stream using ffmpeg pipe.
+    Note: Requires ffmpeg compiled with proper stdin support.
+    """
+    cmd = [
+        "ffmpeg",
+        "-i", "pipe:0",  # Читаем из stdin
+        "-f", "s16le",
+        "-ac", "1",
+        "-acodec", "pcm_s16le",
+        "-ar", str(sample_rate),
+        "pipe:1"  # Пишем в stdout
+    ]
+    if hasattr(file, 'seek'):
+        file.seek(0)
+    try:
+        result = run(cmd, input=file.read(), capture_output=True, check=True)
+        audio_bytes = result.stdout
+    except CalledProcessError as exc:
+        raise RuntimeError(f"FFmpeg failed: {exc.stderr.decode()}") from exc
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore", category=UserWarning)
+        return torch.frombuffer(audio_bytes, dtype=torch.int16).float() / 32768.0
 class SpecScaler(nn.Module):
     """
     Module that applies logarithmic scaling to spectrogram values.
 def segment_audio_file(
+    file: BinaryIO,
     sr: int,
     max_duration: float = 22.0,
     min_duration: float = 15.0,
     The segmentation is performed using a PyAnnote voice activity detection pipeline.
     """
+    audio = load_audio_binary(file)
     pipeline = get_pipeline(device)
+    if hasattr(file, 'seek'): file.seek(0)
+    sad_segments = pipeline(file)
     segments: List[torch.Tensor] = []
     curr_duration = 0.0
     @torch.inference_mode()
     def transcribe_longform(
+        self, file: BinaryIO, **kwargs
     ) -> List[Dict[str, Union[str, Tuple[float, float]]]]:
         """
         Transcribes a long audio file by splitting it into segments and
         """
         transcribed_segments = []
         segments, boundaries = segment_audio_file(
+            file, SAMPLE_RATE, device=self._device, **kwargs
         )
         for segment, segment_boundaries in zip(segments, boundaries):
             wav = segment.to(self._device).unsqueeze(0).to(self._dtype)
     def transcribe(self, wav_file: str) -> str:
         return self.model.transcribe(wav_file)
+    def transcribe_longform(self, file: BinaryIO) -> List[Dict[str, Union[str, Tuple[float, float]]]]:
+        return self.model.transcribe_longform(file)
     def get_probs(self, wav_file: str) -> Dict[str, float]:
         return self.model.get_probs(wav_file)