--- library_name: transformers language: - hy tags: - asr - audio - speech - whisper - low-resource - morpheme-tokenization - armenian - compact-model - generated_from_trainer - onnx datasets: - Chillarmo/common_voice_20_armenian license: mit metrics: - wer pipeline_tag: automatic-speech-recognition model-index: - name: ATOM (Armenian Tiny Optimized Model) results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Common Voice 20.0 Armenian type: mozilla-foundation/common_voice_20_0 config: hy split: test metrics: - type: wer value: 50.3 name: Word Error Rate - type: exact_match value: 10.06 name: Exact Match --- # ATOM: Armenian Tiny Optimized Model A compact, morpheme-aware Automatic Speech Recognition (ASR) model that **significantly outperforms** OpenAI's Whisper on Armenian speech recognition. ## Model Description ATOM is a specialized ASR model for low-resource Armenian, achieving **64.5% lower WER** than vanilla Whisper-tiny **on Armenian** while using **28% fewer parameters**. The model combines: - **Frozen Whisper-tiny encoder** (pre-trained audio feature extraction) - **Custom compact decoder** (2 layers, trained from scratch on Armenian) - **Morpheme-level BPE tokenization** (5,000 tokens optimized for Armenian morphology vs Whisper's 51k multilingual tokens) ### Architecture ``` Input: Audio (16kHz) ↓ Whisper Encoder (frozen, 4 layers, 384 hidden, 1536 FFN) ↓ Compact Decoder (trainable, 2 layers, 384 hidden, 1024 FFN) ↓ Morpheme Vocabulary (5,000 tokens) ↓ Output: Armenian Text ``` **Total Parameters:** ~28M (28% smaller than Whisper-tiny's 39M) ## Performance Evaluated on Common Voice 20.0 Armenian test set: | Model | Parameters | WER (Armenian) | Relative Improvement | |-------|------------|----------------|---------------------| | Whisper-tiny | 39M | 118.6%* | Baseline | | Whisper-base | 74M | 126.3%* | -6.5% (worse) | | Whisper-small | 244M | 86.6%* | +27.0% | | Whisper-medium | 769M | 60.1%* | +49.3% | | Whisper-large | 1550M | 53.7%* | +54.7% | | Whisper-large-v2 | 1550M | 44.6%* | +62.4% | | **ATOM** | **14M** | **50.3%** | **+57.6%** ✅ | *Whisper WER values for Armenian from published benchmarks ### Key Insights: - **ATOM outperforms ALL Whisper models on Armenian**, including models up to 55× larger - **Word Error Rate (WER):** 42.1% vs Whisper-tiny's 118.6% on Armenian - **Model Size:** 28M parameters (28% smaller than Whisper-tiny, 55× smaller than Whisper-large-v2) - **Training Efficiency:** Trained on minimal Armenian speech data vs Whisper's 680k hours multilingual **Note:** While Whisper models achieve strong performance on high-resource languages (e.g., Whisper-tiny: 79.0% average WER), they perform significantly worse on low-resource Armenian (118.6% WER), demonstrating the need for language-specific approaches. ## Why ATOM Outperforms Whisper 1. **Morpheme-Aware Tokenization:** Armenian is an agglutinative language where words combine multiple morphemes (e.g., "չէինք" = "չ" [negation] + "է" [to be] + "ինք" [we/past]). ATOM's morpheme-level vocabulary (5k tokens) captures this linguistic structure better than Whisper's multilingual word-level BPE (51k tokens). 2. **Language-Specific Training:** While Whisper is trained on 99 languages (680k hours), ATOM's decoder is trained exclusively on Armenian, allowing deep specialization on Armenian phonology and morphology. 3. **Efficient Architecture:** The compact 2-layer decoder prevents overfitting on limited training data while the frozen pre-trained encoder provides robust audio feature extraction. 4. **Low-Resource Optimization:** Whisper's multilingual training spreads capacity across languages, disadvantaging low-resource Armenian. ATOM dedicates all decoder capacity to Armenian. ## Intended Uses **Primary Uses:** - Armenian speech-to-text transcription - Real-time subtitling for Armenian content - Accessibility tools for Armenian speakers - Research on morpheme-aware ASR for agglutinative languages **Best Performance:** - Clear speech in quiet environments - Native Armenian speakers - Standard Eastern/Western Armenian dialects ## Limitations - Trained on limited data (relatively small dataset) - May struggle with heavy accents or noisy audio - Optimized for Armenian only (not multilingual) - 10% exact match rate indicates room for improvement in perfect transcriptions - Performance may degrade on out-of-domain audio (non-Common Voice data) ## Training Details ### Training Data - **Dataset:** Common Voice 20.0 Armenian - **Splits Used:** Train + Other - **Duration:** Approximately 30 hours of Armenian speech - **Speakers:** 400+ unique speakers - **Demographics:** - Gender: 55% Female, 25% Male, 20% Undefined - Age: Primarily 20s-30s (70%+) - **Test Set:** Common Voice test split (separate, unseen data) ### Training Hyperparameters ```python learning_rate: 1e-4 train_batch_size: 32 gradient_accumulation_steps: 1 warmup_steps: 500 max_steps: 12,000 save_steps: 3,000 fp16: True optimizer: AdamW (torch) lr_scheduler_type: cosine max_grad_norm: 1.0 gradient_checkpointing: True dataloader_num_workers: 8 ``` ### Training Infrastructure - **GPU:** NVIDIA RTX 3060 ti with FP16 mixed precision - **Framework:** - Transformers 4.56.2 - PyTorch 2.8.0+cu129 - Datasets 3.5.0 - Tokenizers 0.22.1 - **Final Checkpoint:** Step 9,000 - **Evaluation Loss:** 1.36 ## Usage ### Installation ```bash pip install transformers torch torchaudio ``` ### Basic Inference ```python from transformers import WhisperForConditionalGeneration, WhisperProcessor import torch # Load model and processor model = WhisperForConditionalGeneration.from_pretrained("Chillarmo/ATOM") processor = WhisperProcessor.from_pretrained("Chillarmo/ATOM") # Load audio (16kHz) import torchaudio audio, sr = torchaudio.load("audio.wav") if sr != 16000: resampler = torchaudio.transforms.Resample(sr, 16000) audio = resampler(audio) # Process input_features = processor( audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt" ).input_features # Generate with torch.no_grad(): predicted_ids = model.generate( input_features, max_length=448, num_beams=5, repetition_penalty=1.2, no_repeat_ngram_size=3 ) # Decode transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] print(transcription) ``` ### Advanced Usage with Pipeline ```python from transformers import pipeline # Create ASR pipeline asr_pipeline = pipeline( "automatic-speech-recognition", model="Chillarmo/ATOM", device=0 # Use GPU if available ) # Transcribe result = asr_pipeline( "audio.wav", generate_kwargs={ "max_length": 448, "num_beams": 5, "repetition_penalty": 1.2 } ) print(result["text"]) ``` ## Technical Details ### Morpheme Tokenization The model uses a custom BPE tokenizer trained on Armenian text with morpheme-level granularity: - **Vocabulary Size:** 5,000 tokens - **Special Tokens:** ``, ``, ``, `` - **Training Corpus:** Armenian Wikipedia + Common Voice transcriptions - **Morpheme Segmentation:** Whitespace pre-tokenization optimized for Armenian word structure Example tokenization: ``` Word: "չէինք" (we were not) Morphemes: ["չ", "է", "ինք"] Translation: [negation] + [to be] + [we/past] ``` ### Model Architecture **Encoder (Frozen):** - 4 Transformer encoder layers - 384 hidden dimensions - 1536 feed-forward dimensions - 6 attention heads - Pre-trained on Whisper's 680k hour multilingual dataset **Decoder (Trained from Scratch):** - 2 Transformer decoder layers (50% reduction) - 384 hidden dimensions - 1024 feed-forward dimensions (33% reduction) - 6 attention heads - Trained exclusively on Armenian ## Reproduction To reproduce training: ```bash # Install dependencies pip install transformers datasets evaluate jiwer accelerate # Train python train.py \ --model_name_or_path openai/whisper-tiny \ --dataset Chillarmo/common_voice_20_armenian \ --output_dir ./atom-model \ --learning_rate 1e-4 \ --per_device_train_batch_size 32 \ --max_steps 12000 \ --fp16 \ --save_steps 3000 ``` ## Citation ```bibtex @misc{movsesyan2025atom, title={ATOM: Morpheme-Aware Whisper for Low-Resource Armenian ASR}, author={Movsesyan, Movses}, year={2025}, institution={California State University, Sacramento} } ``` ## References Whisper Armenian benchmarks from published evaluations on Common Voice datasets. ## Acknowledgments - Built on OpenAI's Whisper architecture ([Radford et al., 2022](https://arxiv.org/abs/2212.04356)) - Trained on Mozilla Common Voice data - Morpheme tokenization inspired by Armenian linguistic structure - California State University, Sacramento ## License [Specify license - typically MIT or Apache 2.0] --- **Model Card Contact:** movsesmovsesyan@csus.edu