Nguyen Thai Khanh commited on
Commit ·
7788774
1
Parent(s): dc70350
Upload fine-tuned Vietnamese wav2vec2 ASR model
Browse files- README.md +125 -0
- preprocessor_config.json +9 -0
- vocab.json +1 -0
README.md
ADDED
|
@@ -0,0 +1,125 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: vi
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
base_model: nguyenvulebinh/wav2vec2-base-vi
|
| 5 |
+
tags:
|
| 6 |
+
- wav2vec2
|
| 7 |
+
- automatic-speech-recognition
|
| 8 |
+
- speech
|
| 9 |
+
- audio
|
| 10 |
+
- vietnamese
|
| 11 |
+
- pytorch
|
| 12 |
+
- CTC
|
| 13 |
+
datasets:
|
| 14 |
+
- custom-vietnamese-speech
|
| 15 |
+
metrics:
|
| 16 |
+
- wer
|
| 17 |
+
model-index:
|
| 18 |
+
- name: khanusa/nd_asr_wav2vec2
|
| 19 |
+
results:
|
| 20 |
+
- task:
|
| 21 |
+
name: Automatic Speech Recognition
|
| 22 |
+
type: automatic-speech-recognition
|
| 23 |
+
dataset:
|
| 24 |
+
name: Custom Vietnamese Speech Dataset
|
| 25 |
+
type: custom
|
| 26 |
+
metrics:
|
| 27 |
+
- name: WER
|
| 28 |
+
type: wer
|
| 29 |
+
value: "TBD" # Update with your actual WER score
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
# khanusa/nd_asr_wav2vec2
|
| 33 |
+
|
| 34 |
+
This is a fine-tuned wav2vec2 model for Vietnamese Automatic Speech Recognition (ASR), based on `nguyenvulebinh/wav2vec2-base-vi`.
|
| 35 |
+
|
| 36 |
+
## Model Description
|
| 37 |
+
|
| 38 |
+
- **Language:** Vietnamese
|
| 39 |
+
- **Task:** Automatic Speech Recognition
|
| 40 |
+
- **Base Model:** nguyenvulebinh/wav2vec2-base-vi
|
| 41 |
+
- **Architecture:** Wav2Vec2 + CTC Head
|
| 42 |
+
- **Training Framework:** PyTorch
|
| 43 |
+
- **Fine-tuning:** Custom Vietnamese speech dataset
|
| 44 |
+
|
| 45 |
+
## Usage
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
import torch
|
| 49 |
+
import librosa
|
| 50 |
+
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
|
| 51 |
+
|
| 52 |
+
# Load model and processor
|
| 53 |
+
processor = Wav2Vec2Processor.from_pretrained("khanusa/nd_asr_wav2vec2")
|
| 54 |
+
model = Wav2Vec2ForCTC.from_pretrained("khanusa/nd_asr_wav2vec2")
|
| 55 |
+
|
| 56 |
+
# Load and preprocess audio
|
| 57 |
+
audio, sr = librosa.load("path_to_your_audio.wav", sr=16000)
|
| 58 |
+
|
| 59 |
+
# Tokenize and predict
|
| 60 |
+
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
|
| 61 |
+
with torch.no_grad():
|
| 62 |
+
logits = model(inputs.input_values).logits
|
| 63 |
+
|
| 64 |
+
# Decode predictions
|
| 65 |
+
predicted_ids = torch.argmax(logits, dim=-1)
|
| 66 |
+
transcription = processor.batch_decode(predicted_ids)[0]
|
| 67 |
+
print(transcription)
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## Training Details
|
| 71 |
+
|
| 72 |
+
### Training Data
|
| 73 |
+
Custom Vietnamese speech dataset
|
| 74 |
+
|
| 75 |
+
### Training Procedure
|
| 76 |
+
- **Optimizer:** AdamW
|
| 77 |
+
- **Learning Rate:** 5e-6
|
| 78 |
+
- **Batch Size:** 8 (with gradient accumulation steps: 4)
|
| 79 |
+
- **Epochs:** 50
|
| 80 |
+
- **Audio Duration:** 7-11 seconds clips
|
| 81 |
+
- **Sampling Rate:** 16kHz
|
| 82 |
+
- **Features:** 16-bit PCM audio
|
| 83 |
+
- **Label Smoothing:** 0.1
|
| 84 |
+
|
| 85 |
+
### Training Configuration
|
| 86 |
+
- Mixed Precision Training (AMP)
|
| 87 |
+
- Gradient Clipping: 1.0
|
| 88 |
+
- Warmup Steps: 2000
|
| 89 |
+
- Early Stopping Patience: 8 epochs
|
| 90 |
+
|
| 91 |
+
## Performance
|
| 92 |
+
|
| 93 |
+
| Metric | Value |
|
| 94 |
+
|--------|-------|
|
| 95 |
+
| WER | 0.2123 |
|
| 96 |
+
|
| 97 |
+
*Note: Please update the WER value with your actual evaluation results.*
|
| 98 |
+
|
| 99 |
+
## Limitations and Bias
|
| 100 |
+
|
| 101 |
+
This model was fine-tuned from an English base model on a specific Vietnamese speech dataset and may not generalize well to:
|
| 102 |
+
- Different Vietnamese dialects
|
| 103 |
+
- Noisy environments not represented in training data
|
| 104 |
+
- Domain-specific vocabulary outside of training scope
|
| 105 |
+
- Cross-lingual transfer limitations (base model was trained on English)
|
| 106 |
+
- Audio quality different from training conditions
|
| 107 |
+
|
| 108 |
+
## Citation
|
| 109 |
+
|
| 110 |
+
```bibtex
|
| 111 |
+
@article{baevski2020wav2vec,
|
| 112 |
+
title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
|
| 113 |
+
author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
|
| 114 |
+
journal={Advances in neural information processing systems},
|
| 115 |
+
volume={33},
|
| 116 |
+
pages={12449--12460},
|
| 117 |
+
year={2020}
|
| 118 |
+
}
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
## License
|
| 124 |
+
|
| 125 |
+
This model is released under the Apache 2.0 License.
|
preprocessor_config.json
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
|
| 3 |
+
"normalizer": {
|
| 4 |
+
"do_lower_case": true,
|
| 5 |
+
"strip_accents": null,
|
| 6 |
+
"keep_accents": true
|
| 7 |
+
},
|
| 8 |
+
"tokenizer_type": "Wav2Vec2CTCTokenizer"
|
| 9 |
+
}
|
vocab.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"a": 1, "b": 2, "c": 3, "d": 4, "e": 5, "f": 6, "g": 7, "h": 8, "i": 9, "j": 10, "k": 11, "l": 12, "m": 13, "n": 14, "o": 15, "p": 16, "q": 17, "r": 18, "s": 19, "t": 20, "u": 21, "v": 22, "w": 23, "x": 24, "y": 25, "z": 26, "\u00e0": 27, "\u00e1": 28, "\u00e2": 29, "\u00e3": 30, "\u00e8": 31, "\u00e9": 32, "\u00ea": 33, "\u00ec": 34, "\u00ed": 35, "\u00f2": 36, "\u00f3": 37, "\u00f4": 38, "\u00f5": 39, "\u00f9": 40, "\u00fa": 41, "\u00fd": 42, "\u0103": 43, "\u0111": 44, "\u0129": 45, "\u0169": 46, "\u01a1": 47, "\u01b0": 48, "\u1ea1": 49, "\u1ea3": 50, "\u1ea5": 51, "\u1ea7": 52, "\u1ea9": 53, "\u1eab": 54, "\u1ead": 55, "\u1eaf": 56, "\u1eb1": 57, "\u1eb3": 58, "\u1eb5": 59, "\u1eb7": 60, "\u1eb9": 61, "\u1ebb": 62, "\u1ebd": 63, "\u1ebf": 64, "\u1ec1": 65, "\u1ec3": 66, "\u1ec5": 67, "\u1ec7": 68, "\u1ec9": 69, "\u1ecb": 70, "\u1ecd": 71, "\u1ecf": 72, "\u1ed1": 73, "\u1ed3": 74, "\u1ed5": 75, "\u1ed7": 76, "\u1ed9": 77, "\u1edb": 78, "\u1edd": 79, "\u1edf": 80, "\u1ee1": 81, "\u1ee3": 82, "\u1ee5": 83, "\u1ee7": 84, "\u1ee9": 85, "\u1eeb": 86, "\u1eed": 87, "\u1eef": 88, "\u1ef1": 89, "\u1ef3": 90, "\u1ef5": 91, "\u1ef7": 92, "\u1ef9": 93, "|": 0, "<bos>": 94, "<eos>": 95, "<unk>": 96, "<pad>": 97}
|