GigaAM-v3 Transformers

Local GigaAM-v3 implementation in Hugging Face Transformers format.

This project provides a standard Transformers interface:

  • AutoModel.from_pretrained(...)
  • pipeline(task="automatic-speech-recognition", ...)
  • custom AutoConfig / AutoFeatureExtractor / AutoTokenizer via trust_remote_code=True

Branches

This repository contains multiple branches with different ASR model architectures:

  • main (this branch) - RNN-T End-to-End (rnnt_e2e)
  • rnnt - RNN-T model
  • rnnt_e2e - RNN-T End-to-End model
  • ctc - CTC model
  • ctc_e2e - CTC End-to-End model

What's Inside (main branch)

  • Conformer encoder
  • RNN-T End-to-End head (decoder + joint)
  • greedy decoding for rnnt_e2e
  • SentencePiece tokenizer
  • custom ASR pipeline (GigaAMPipeline)

Important: this model does not provide model.transcribe(...) in this repository. The recommended inference path is pipeline or direct model(...) call + decode.

Installation

Minimal:

pip install torch transformers sentencepiece

Recommended versions:

  • torch==2.8.0
  • transformers==4.57.1

Quick Start

1) Via pipeline (recommended)

from transformers import pipeline

asr = pipeline(
    task="automatic-speech-recognition",
    model="./GigaAM-v3-transformers",
    trust_remote_code=True,
    device=-1,  # CPU; for CUDA use 0
)

result = asr("audio.wav")
print(result["text"])

Long audio (chunked):

result = asr("long_audio.wav", chunk_length_s=30)
print(result["text"])

Long audio with overlap:

result = asr("long_audio.wav", chunk_length_s=30, stride_length_s=5)
print(result["text"])

2) Direct model usage

from transformers import AutoModel, AutoFeatureExtractor, AutoTokenizer
import torchaudio

model = AutoModel.from_pretrained(
    "./GigaAM-v3-transformers",
    trust_remote_code=True,
)
feature_extractor = AutoFeatureExtractor.from_pretrained(
    "./GigaAM-v3-transformers",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "./GigaAM-v3-transformers",
    trust_remote_code=True,
)

wav, sr = torchaudio.load("audio.wav")
wav = wav.mean(dim=0).numpy()  # mono audio

features = feature_extractor(
    wav,
    sampling_rate=sr,
    return_tensors="pt",
)

outputs = model(**features)
token_ids = model.model.decoding.decode(
    model.model.head,
    outputs.encoded,
    outputs.encoded_lengths,
)[0]
text = tokenizer.decode(token_ids)
print(text)

Current Limitations

  • Pipeline does not support return_timestamps.
  • Pipeline supports chunk_length_s and stride_length_s (overlap is merged in postprocess).
  • Only rnnt_e2e ASR mode is supported in this branch.

License

MIT

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support