Cong123779
/

AI2Text-Bilingual-ASR

+---
+language:
+- vi
+- en
+license: apache-2.0
+tags:
+- asr
+- automatic-speech-recognition
+- transformer
+- vietnamese
+- english
+- bilingual
+datasets:
+- Cong123779/AI2Text-Bilingual-ASR-Dataset
+metrics:
+- wer
+- cer
+---
+# AI2Text – Bilingual ASR (Vietnamese + English)
+A **~30M-parameter** Transformer Seq2Seq Automatic Speech Recognition model
+trained on **~224k** bilingual (Vietnamese + English) audio samples.
+## Model Description
+| Attribute | Value |
+|---|---|
+| Architecture | Encoder-Decoder Transformer |
+| Parameters | ~30,325,164 |
+| d_model | 256 |
+| Encoder layers | 14 (RoPE + Flash Attention) |
+| Decoder layers | 6 (causal, cross-attention) |
+| Vocabulary size | 3,500 (SentencePiece BPE) |
+| Language embedding | Yes (Vietnamese=0, English=1) |
+| Normalization | RMSNorm |
+| Activation | SiLU (Swish) |
+| Positional encoding | Rotary (RoPE) |
+### Modern Components
+- **RMSNorm** – more efficient than LayerNorm
+- **SiLU (Swish)** activation
+- **Rotary Positional Embedding (RoPE)** – better generalization
+- **Flash Attention (SDPA)** – memory-efficient attention
+- **Hybrid CTC / Attention loss** – helps encoder learn alignment
+## Training Data
+Trained on `Cong123779/AI2Text-Bilingual-ASR-Dataset`:
+- **Train**: ~194,167 samples (77% Vietnamese, 23% English)
+- **Validation**: ~30,123 samples
+Audio format: 16 kHz mono WAV, 80-dim Mel-spectrogram features.
+## Training Configuration
+| Hyperparameter | Value |
+|---|---|
+| Batch size | 32 (effective 128 w/ grad-accum × 4) |
+| Learning rate | 3e-4 |
+| Epochs | 50 |
+| Warmup | 3% of training steps |
+| Mixed precision | bfloat16 (AMP) |
+| Gradient clipping | 0.5 |
+| CTC weight | 0.2 |
+| Scheduled sampling | 1.0 → 0.5 (linear) |
+## Usage
+```python
+import torch
+from pathlib import Path
+import sys
+# Clone the repo and add to path
+sys.path.insert(0, "AI2Text")
+from models.asr_base import ASRModel
+from preprocessing.sentencepiece_tokenizer import SentencePieceTokenizer
+from preprocessing.audio_processing import AudioProcessor
+# Load tokenizer
+tokenizer = SentencePieceTokenizer("models/tokenizer_vi_en_3500.model")
+# Load model
+checkpoint = torch.load("best_model.pt", map_location="cpu")
+config = checkpoint.get("config", {})
+model = ASRModel(
+    input_dim=80,
+    vocab_size=3500,
+    d_model=256,
+    num_encoder_layers=14,
+    num_decoder_layers=6,
+    num_heads=8,
+    d_ff=2048,
+    num_languages=2,
+)
+model.load_state_dict(checkpoint["model_state_dict"])
+model.eval()
+# Transcribe
+audio_processor = AudioProcessor(sample_rate=16000, n_mels=80)
+features = audio_processor.process("audio.wav")  # (time, 80)
+features = features.unsqueeze(0)                 # (1, time, 80)
+lengths  = torch.tensor([features.size(1)])
+with torch.no_grad():
+    tokens = model.generate(
+        features, lengths=lengths,
+        language_ids=torch.tensor([0]),   # 0=vi, 1=en
+        max_len=128,
+        sos_token_id=tokenizer.sos_token_id,
+        eos_token_id=tokenizer.eos_token_id,
+        pad_token_id=tokenizer.pad_token_id,
+    )
+    text = tokenizer.decode(tokens[0].tolist())
+    print(text)
+```
+## Framework
+Built with PyTorch. Optimized for **RTX 5060TI 16GB / Ryzen 9 9990X / 64GB RAM**.
+## License
+Apache 2.0