Upload fine-tuned Vietnamese wav2vec2 ASR model

Files changed (3) hide show

README.md +125 -0
preprocessor_config.json +9 -0
vocab.json +1 -0

README.md ADDED Viewed

	@@ -0,0 +1,125 @@

+---
+language: vi
+license: apache-2.0
+base_model: nguyenvulebinh/wav2vec2-base-vi
+tags:
+- wav2vec2
+- automatic-speech-recognition
+- speech
+- audio
+- vietnamese
+- pytorch
+- CTC
+datasets:
+- custom-vietnamese-speech
+metrics:
+- wer
+model-index:
+- name: khanusa/nd_asr_wav2vec2
+  results:
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Custom Vietnamese Speech Dataset
+      type: custom
+    metrics:
+    - name: WER
+      type: wer
+      value: "TBD"  # Update with your actual WER score
+---
+# khanusa/nd_asr_wav2vec2
+This is a fine-tuned wav2vec2 model for Vietnamese Automatic Speech Recognition (ASR), based on `nguyenvulebinh/wav2vec2-base-vi`.
+## Model Description
+- **Language:** Vietnamese
+- **Task:** Automatic Speech Recognition
+- **Base Model:** nguyenvulebinh/wav2vec2-base-vi
+- **Architecture:** Wav2Vec2 + CTC Head
+- **Training Framework:** PyTorch
+- **Fine-tuning:** Custom Vietnamese speech dataset
+## Usage
+```python
+import torch
+import librosa
+from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+# Load model and processor
+processor = Wav2Vec2Processor.from_pretrained("khanusa/nd_asr_wav2vec2")
+model = Wav2Vec2ForCTC.from_pretrained("khanusa/nd_asr_wav2vec2")
+# Load and preprocess audio
+audio, sr = librosa.load("path_to_your_audio.wav", sr=16000)
+# Tokenize and predict
+inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
+with torch.no_grad():
+    logits = model(inputs.input_values).logits
+# Decode predictions
+predicted_ids = torch.argmax(logits, dim=-1)
+transcription = processor.batch_decode(predicted_ids)[0]
+print(transcription)
+```
+## Training Details
+### Training Data
+Custom Vietnamese speech dataset
+### Training Procedure
+- **Optimizer:** AdamW
+- **Learning Rate:** 5e-6
+- **Batch Size:** 8 (with gradient accumulation steps: 4)
+- **Epochs:** 50
+- **Audio Duration:** 7-11 seconds clips
+- **Sampling Rate:** 16kHz
+- **Features:** 16-bit PCM audio
+- **Label Smoothing:** 0.1
+### Training Configuration
+- Mixed Precision Training (AMP)
+- Gradient Clipping: 1.0
+- Warmup Steps: 2000
+- Early Stopping Patience: 8 epochs
+## Performance
+| Metric | Value |
+|--------|-------|
+| WER    | 0.2123   |
+*Note: Please update the WER value with your actual evaluation results.*
+## Limitations and Bias
+This model was fine-tuned from an English base model on a specific Vietnamese speech dataset and may not generalize well to:
+- Different Vietnamese dialects
+- Noisy environments not represented in training data
+- Domain-specific vocabulary outside of training scope
+- Cross-lingual transfer limitations (base model was trained on English)
+- Audio quality different from training conditions
+## Citation
+```bibtex
+@article{baevski2020wav2vec,
+  title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
+  author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
+  journal={Advances in neural information processing systems},
+  volume={33},
+  pages={12449--12460},
+  year={2020}
+}
+```
+## License
+This model is released under the Apache 2.0 License.

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
+  "normalizer": {
+    "do_lower_case": true,
+    "strip_accents": null,
+    "keep_accents": true
+  },
+  "tokenizer_type": "Wav2Vec2CTCTokenizer"
+}

vocab.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"a": 1, "b": 2, "c": 3, "d": 4, "e": 5, "f": 6, "g": 7, "h": 8, "i": 9, "j": 10, "k": 11, "l": 12, "m": 13, "n": 14, "o": 15, "p": 16, "q": 17, "r": 18, "s": 19, "t": 20, "u": 21, "v": 22, "w": 23, "x": 24, "y": 25, "z": 26, "\u00e0": 27, "\u00e1": 28, "\u00e2": 29, "\u00e3": 30, "\u00e8": 31, "\u00e9": 32, "\u00ea": 33, "\u00ec": 34, "\u00ed": 35, "\u00f2": 36, "\u00f3": 37, "\u00f4": 38, "\u00f5": 39, "\u00f9": 40, "\u00fa": 41, "\u00fd": 42, "\u0103": 43, "\u0111": 44, "\u0129": 45, "\u0169": 46, "\u01a1": 47, "\u01b0": 48, "\u1ea1": 49, "\u1ea3": 50, "\u1ea5": 51, "\u1ea7": 52, "\u1ea9": 53, "\u1eab": 54, "\u1ead": 55, "\u1eaf": 56, "\u1eb1": 57, "\u1eb3": 58, "\u1eb5": 59, "\u1eb7": 60, "\u1eb9": 61, "\u1ebb": 62, "\u1ebd": 63, "\u1ebf": 64, "\u1ec1": 65, "\u1ec3": 66, "\u1ec5": 67, "\u1ec7": 68, "\u1ec9": 69, "\u1ecb": 70, "\u1ecd": 71, "\u1ecf": 72, "\u1ed1": 73, "\u1ed3": 74, "\u1ed5": 75, "\u1ed7": 76, "\u1ed9": 77, "\u1edb": 78, "\u1edd": 79, "\u1edf": 80, "\u1ee1": 81, "\u1ee3": 82, "\u1ee5": 83, "\u1ee7": 84, "\u1ee9": 85, "\u1eeb": 86, "\u1eed": 87, "\u1eef": 88, "\u1ef1": 89, "\u1ef3": 90, "\u1ef5": 91, "\u1ef7": 92, "\u1ef9": 93, "|": 0, "<bos>": 94, "<eos>": 95, "<unk>": 96, "<pad>": 97}