qnaug's picture
Update README.md
ebcac21 verified
metadata
tags:
  - audio
  - automatic-speech-recognition
  - whisper
  - ctranslate2
  - faster-whisper
  - whisperx
license: apache-2.0
base_model: vinai/PhoWhisper-large
pipeline_tag: automatic-speech-recognition

PhoWhisper Large - CTranslate2 Version (Float32)

This repository contains the vinai/PhoWhisper-large model converted to the CTranslate2 format in full Float32 precision.

By hosting the model in Float32, users have the flexibility to load it in any precision they prefer at runtime (e.g., float16, bfloat16, or int8) depending on their hardware (GPU/CPU).

This version is fully compatible with libraries like faster-whisper and WhisperX.

Model Details

  • Original Model: vinai/PhoWhisper-large
  • Format: CTranslate2 (CT2)
  • Quantization: None (Full float32 precision)

How to Use

1. Using with WhisperX (Python API)

You can load this model directly into WhisperX and specify your preferred runtime precision using compute_type:

import whisperx

device = "cuda" # or "cpu"
batch_size = 16 

# Load the model in Float16 for fast GPU inference
model = whisperx.load_model(
    "qnaug/phowhisper-large-ctranslate2", 
    device=device, 
    compute_type="float16" # Choose: "float32", "float16", "int8"
)

# Transcribe audio
audio = whisperx.load_audio("sample_audio.mp3")
result = model.transcribe(audio, batch_size=batch_size, language="vi")

# Optional: Align timestamps
model_a, metadata = whisperx.load_align_model(language_code="vi", device=device)
result_aligned = whisperx.align(result["segments"], model_a, metadata, audio, device)

print(result_aligned["segments"])

2. Using with WhisperX (CLI)

whisperx --model qnaug/phowhisper-large-ctranslate2 --language vi --device cuda --compute_type float16 sample_audio.mp3

3. Using with faster-whisper (Python API)

from faster_whisper import WhisperModel

# Load the model in Float16
model = WhisperModel(
    "qnaug/phowhisper-large-ctranslate2", 
    device="cuda", 
    compute_type="float16" # Choose: "float32", "float16", "int8"
)

# Transcribe
segments, info = model.transcribe("sample_audio.mp3", beam_size=5, language="vi")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

How the Model Was Converted

This model was converted using the ct2-transformers-converter tool with the following command:

ct2-transformers-converter --model vinai/PhoWhisper-large \
    --output_dir ./phowhisper-large-ctranslate2 \
    --copy_files tokenizer.json preprocessor_config.json

Credits

All credits go to the authors of the original model: VinAI Research. If you use this model in your research, please cite the original PhoWhisper repository/paper.