| --- |
| tags: |
| - audio |
| - automatic-speech-recognition |
| - whisper |
| - ctranslate2 |
| - faster-whisper |
| - whisperx |
| license: apache-2.0 |
| base_model: vinai/PhoWhisper-large |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # PhoWhisper Large - CTranslate2 Version (Float32) |
|
|
| This repository contains the [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large) model converted to the **CTranslate2** format in full **Float32** precision. |
|
|
| By hosting the model in Float32, users have the flexibility to load it in any precision they prefer at runtime (e.g., `float16`, `bfloat16`, or `int8`) depending on their hardware (GPU/CPU). |
|
|
| This version is fully compatible with libraries like [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and [WhisperX](https://github.com/m-bain/whisperX). |
|
|
| ## Model Details |
| - **Original Model**: [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large) |
| - **Format**: CTranslate2 (CT2) |
| - **Quantization**: None (Full `float32` precision) |
|
|
| --- |
|
|
| ## How to Use |
|
|
| ### 1. Using with WhisperX (Python API) |
| You can load this model directly into WhisperX and specify your preferred runtime precision using `compute_type`: |
|
|
| ```python |
| import whisperx |
| |
| device = "cuda" # or "cpu" |
| batch_size = 16 |
| |
| # Load the model in Float16 for fast GPU inference |
| model = whisperx.load_model( |
| "qnaug/phowhisper-large-ctranslate2", |
| device=device, |
| compute_type="float16" # Choose: "float32", "float16", "int8" |
| ) |
| |
| # Transcribe audio |
| audio = whisperx.load_audio("sample_audio.mp3") |
| result = model.transcribe(audio, batch_size=batch_size, language="vi") |
| |
| # Optional: Align timestamps |
| model_a, metadata = whisperx.load_align_model(language_code="vi", device=device) |
| result_aligned = whisperx.align(result["segments"], model_a, metadata, audio, device) |
| |
| print(result_aligned["segments"]) |
| ``` |
|
|
| ### 2. Using with WhisperX (CLI) |
| ```bash |
| whisperx --model qnaug/phowhisper-large-ctranslate2 --language vi --device cuda --compute_type float16 sample_audio.mp3 |
| ``` |
|
|
| ### 3. Using with faster-whisper (Python API) |
| ```python |
| from faster_whisper import WhisperModel |
| |
| # Load the model in Float16 |
| model = WhisperModel( |
| "qnaug/phowhisper-large-ctranslate2", |
| device="cuda", |
| compute_type="float16" # Choose: "float32", "float16", "int8" |
| ) |
| |
| # Transcribe |
| segments, info = model.transcribe("sample_audio.mp3", beam_size=5, language="vi") |
| |
| for segment in segments: |
| print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}") |
| ``` |
|
|
| --- |
|
|
| ## How the Model Was Converted |
| This model was converted using the `ct2-transformers-converter` tool with the following command: |
|
|
| ```bash |
| ct2-transformers-converter --model vinai/PhoWhisper-large \ |
| --output_dir ./phowhisper-large-ctranslate2 \ |
| --copy_files tokenizer.json preprocessor_config.json |
| ``` |
|
|
| ## Credits |
| All credits go to the authors of the original model: **VinAI Research**. If you use this model in your research, please cite the original PhoWhisper repository/paper. |