metythorn commited on
Commit
b88889a
·
verified ·
1 Parent(s): 83d2860

Update model card README

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ metrics:
4
+ - wer
5
+ - cer
6
+ pipeline_tag: automatic-speech-recognition
7
+ ---
8
+ # Whisper Small Khmer ASR
9
+
10
+ Fine-tuned variant of [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) for Khmer automatic speech recognition. The model was trained with the utilities in `whisper` and is intended for transcription workloads that prioritize Khmer text normalization, including numerals, currency, and date expressions.
11
+
12
+ ## Model Card
13
+
14
+ | Attribute | Value |
15
+ | --- | --- |
16
+ | **Base model** | `openai/whisper-small` |
17
+ | **Language** | Khmer (`km-KH`) |
18
+ | **Task** | Automatic Speech Recognition (speech-to-text) |
19
+ | **Sample rate** | 16 kHz audio, automatically resampled |
20
+ | **Input length** | Up to 30 s clips (truncated during batching) |
21
+ | **Finetuning data** | `asr_mixed_dataset.txt` (internal manifests, normalized through `dataset_builder.segment_text`) |
22
+ | **Epochs** | 10 |
23
+ | **Batch size** | 2 (gradient accumulation 1) |
24
+ | **Optimizer** | AdamW (managed by `Seq2SeqTrainer`) |
25
+ | **Learning rate** | 1e-6 with cosine scheduler & 1k warmup steps |
26
+ | **Normalization** | Khmer-specific regex and rule-based normalization (`khmerspeech`, `khmercut`) |
27
+ | **Dataset** | Training with Mixed Khmer & English audio with 199K samples (225 hours), train all khmer public dataset + humaned label dataset
28
+ | **Training Time** | Training with Mixed precision with RTX-5090 VRAM 32GB for 1 days
29
+
30
+ > Limitations: performance has been validated only on internal validation/test splits. Long-form audio, accents outside the training distribution, or noisy backgrounds may degrade accuracy.
31
+
32
+
33
+ ## Inference Examples
34
+
35
+ ```python
36
+ import torch
37
+ import torchaudio
38
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
39
+
40
+
41
+ AUDIO_PATH = "audio_path.wav"
42
+
43
+
44
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
45
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
46
+ model_id = "metythorn/whisper-small"
47
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
48
+ model_id,
49
+ torch_dtype=torch_dtype,
50
+ low_cpu_mem_usage=True,
51
+ use_safetensors=True,
52
+ )
53
+ model.to(device)
54
+ processor = AutoProcessor.from_pretrained(model_id)
55
+
56
+ pipe = pipeline(
57
+ task="automatic-speech-recognition",
58
+ model=model,
59
+ tokenizer=processor.tokenizer,
60
+ feature_extractor=processor.feature_extractor,
61
+ torch_dtype=torch_dtype,
62
+ device=device,
63
+ )
64
+
65
+ speech_waveform, sr = torchaudio.load(AUDIO_PATH)
66
+
67
+ # Whisper expects 16kHz mono
68
+ if sr != 16000:
69
+ speech_waveform = torchaudio.functional.resample(
70
+ speech_waveform,
71
+ orig_freq=sr,
72
+ new_freq=16000
73
+ )
74
+ speech_waveform = speech_waveform.squeeze().numpy()
75
+ result = pipe(speech_waveform)
76
+
77
+ print("Transcription:", result["text"])
78
+
79
+ ```