qnaug
/

phowhisper-large-ctranslate2

Automatic Speech Recognition

Model card Files Files and versions

phowhisper-large-ctranslate2 / README.md

qnaug's picture

Update README.md

ebcac21 verified 3 days ago

|

history blame contribute delete

2.96 kB

	---
	tags:
	- audio
	- automatic-speech-recognition
	- whisper
	- ctranslate2
	- faster-whisper
	- whisperx
	license: apache-2.0
	base_model: vinai/PhoWhisper-large
	pipeline_tag: automatic-speech-recognition
	---

	# PhoWhisper Large - CTranslate2 Version (Float32)

	This repository contains the [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large) model converted to the CTranslate2 format in full Float32 precision.

	By hosting the model in Float32, users have the flexibility to load it in any precision they prefer at runtime (e.g., `float16`, `bfloat16`, or `int8`) depending on their hardware (GPU/CPU).

	This version is fully compatible with libraries like [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and [WhisperX](https://github.com/m-bain/whisperX).

	## Model Details
	- Original Model: [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large)
	- Format: CTranslate2 (CT2)
	- Quantization: None (Full `float32` precision)

	---

	## How to Use

	### 1. Using with WhisperX (Python API)
	You can load this model directly into WhisperX and specify your preferred runtime precision using `compute_type`:

	```python
	import whisperx

	device = "cuda" # or "cpu"
	batch_size = 16

	# Load the model in Float16 for fast GPU inference
	model = whisperx.load_model(
	"qnaug/phowhisper-large-ctranslate2",
	device=device,
	compute_type="float16" # Choose: "float32", "float16", "int8"
	)

	# Transcribe audio
	audio = whisperx.load_audio("sample_audio.mp3")
	result = model.transcribe(audio, batch_size=batch_size, language="vi")

	# Optional: Align timestamps
	model_a, metadata = whisperx.load_align_model(language_code="vi", device=device)
	result_aligned = whisperx.align(result["segments"], model_a, metadata, audio, device)

	print(result_aligned["segments"])
	```

	### 2. Using with WhisperX (CLI)
	```bash
	whisperx --model qnaug/phowhisper-large-ctranslate2 --language vi --device cuda --compute_type float16 sample_audio.mp3
	```

	### 3. Using with faster-whisper (Python API)
	```python
	from faster_whisper import WhisperModel

	# Load the model in Float16
	model = WhisperModel(
	"qnaug/phowhisper-large-ctranslate2",
	device="cuda",
	compute_type="float16" # Choose: "float32", "float16", "int8"
	)

	# Transcribe
	segments, info = model.transcribe("sample_audio.mp3", beam_size=5, language="vi")

	for segment in segments:
	print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
	```

	---

	## How the Model Was Converted
	This model was converted using the `ct2-transformers-converter` tool with the following command:

	```bash
	ct2-transformers-converter --model vinai/PhoWhisper-large \
	--output_dir ./phowhisper-large-ctranslate2 \
	--copy_files tokenizer.json preprocessor_config.json
	```

	## Credits
	All credits go to the authors of the original model: VinAI Research. If you use this model in your research, please cite the original PhoWhisper repository/paper.