YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π»π³ Whisper Vietnamese CTranslate2
This repository contains a finetuned Vietnamese ASR (automatic speech recognition) model converted from openai/whisper-small using CTranslate2. It is optimized for fast inference on CPU or GPU.
π Try it Online
π Test this model directly on Hugging Face Space
π§ͺ Example Usage (Python)
import ctranslate2
import librosa
import transformers
from huggingface_hub import snapshot_download
# Step 1: Download the CTranslate2 model from Hugging Face
model_repo = "duonguyen/whisper-vietnamese-ct2"
model_dir = snapshot_download(repo_id=model_repo)
# Step 2: Load and preprocess the audio
audio_path = "replace with your audio path"
audio, _ = librosa.load(audio_path, sr=16000, mono=True)
# Step 3: Use the original Whisper processor for feature extraction
processor = transformers.WhisperProcessor.from_pretrained("openai/whisper-small", chunk_length=12)
inputs = processor(audio, return_tensors="np", sampling_rate=16000, do_normalize=True)
features = ctranslate2.StorageView.from_array(inputs.input_features)
# Step 4: Load the CTranslate2 model
model = ctranslate2.models.Whisper(model_dir)
# Step 5: Prepare prompt and language
language = "vi"
prompt = processor.tokenizer.convert_tokens_to_ids(
[
"<|startoftranscript|>",
f"<|{language}|>",
"<|transcribe|>",
"<|notimestamps|>",
]
)
# Step 6: Transcribe
results = model.generate(features, [prompt])
transcription = processor.decode(results[0].sequences_ids[0], skip_special_tokens=True)
print("Transcription:", transcription)
β οΈ Important:
This model is currently optimized for audio chunks shorter than 12 seconds.
For longer audio inputs, it's recommended to pre-segment the audio using a VAD (Voice Activity Detection) model such as Silero VAD.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support