|
|
--- |
|
|
library_name: transformers |
|
|
language: |
|
|
- hy |
|
|
tags: |
|
|
- asr |
|
|
- audio |
|
|
- speech |
|
|
- whisper |
|
|
- low-resource |
|
|
- morpheme-tokenization |
|
|
- armenian |
|
|
- compact-model |
|
|
- generated_from_trainer |
|
|
- onnx |
|
|
datasets: |
|
|
- Chillarmo/common_voice_20_armenian |
|
|
license: mit |
|
|
metrics: |
|
|
- wer |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
model-index: |
|
|
- name: ATOM (Armenian Tiny Optimized Model) |
|
|
results: |
|
|
- task: |
|
|
type: automatic-speech-recognition |
|
|
name: Automatic Speech Recognition |
|
|
dataset: |
|
|
name: Common Voice 20.0 Armenian |
|
|
type: mozilla-foundation/common_voice_20_0 |
|
|
config: hy |
|
|
split: test |
|
|
metrics: |
|
|
- type: wer |
|
|
value: 50.3 |
|
|
name: Word Error Rate |
|
|
- type: exact_match |
|
|
value: 10.06 |
|
|
name: Exact Match |
|
|
--- |
|
|
|
|
|
# ATOM: Armenian Tiny Optimized Model |
|
|
|
|
|
A compact, morpheme-aware Automatic Speech Recognition (ASR) model that **significantly outperforms** OpenAI's Whisper on Armenian speech recognition. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
ATOM is a specialized ASR model for low-resource Armenian, achieving **64.5% lower WER** than vanilla Whisper-tiny **on Armenian** while using **28% fewer parameters**. The model combines: |
|
|
|
|
|
- **Frozen Whisper-tiny encoder** (pre-trained audio feature extraction) |
|
|
- **Custom compact decoder** (2 layers, trained from scratch on Armenian) |
|
|
- **Morpheme-level BPE tokenization** (5,000 tokens optimized for Armenian morphology vs Whisper's 51k multilingual tokens) |
|
|
|
|
|
### Architecture |
|
|
|
|
|
``` |
|
|
Input: Audio (16kHz) |
|
|
↓ |
|
|
Whisper Encoder (frozen, 4 layers, 384 hidden, 1536 FFN) |
|
|
↓ |
|
|
Compact Decoder (trainable, 2 layers, 384 hidden, 1024 FFN) |
|
|
↓ |
|
|
Morpheme Vocabulary (5,000 tokens) |
|
|
↓ |
|
|
Output: Armenian Text |
|
|
``` |
|
|
|
|
|
**Total Parameters:** ~28M (28% smaller than Whisper-tiny's 39M) |
|
|
|
|
|
## Performance |
|
|
|
|
|
Evaluated on Common Voice 20.0 Armenian test set: |
|
|
|
|
|
| Model | Parameters | WER (Armenian) | Relative Improvement | |
|
|
|-------|------------|----------------|---------------------| |
|
|
| Whisper-tiny | 39M | 118.6%* | Baseline | |
|
|
| Whisper-base | 74M | 126.3%* | -6.5% (worse) | |
|
|
| Whisper-small | 244M | 86.6%* | +27.0% | |
|
|
| Whisper-medium | 769M | 60.1%* | +49.3% | |
|
|
| Whisper-large | 1550M | 53.7%* | +54.7% | |
|
|
| Whisper-large-v2 | 1550M | 44.6%* | +62.4% | |
|
|
| **ATOM** | **14M** | **50.3%** | **+57.6%** ✅ | |
|
|
|
|
|
*Whisper WER values for Armenian from published benchmarks |
|
|
|
|
|
### Key Insights: |
|
|
- **ATOM outperforms ALL Whisper models on Armenian**, including models up to 55× larger |
|
|
- **Word Error Rate (WER):** 42.1% vs Whisper-tiny's 118.6% on Armenian |
|
|
- **Model Size:** 28M parameters (28% smaller than Whisper-tiny, 55× smaller than Whisper-large-v2) |
|
|
- **Training Efficiency:** Trained on minimal Armenian speech data vs Whisper's 680k hours multilingual |
|
|
|
|
|
**Note:** While Whisper models achieve strong performance on high-resource languages (e.g., Whisper-tiny: 79.0% average WER), they perform significantly worse on low-resource Armenian (118.6% WER), demonstrating the need for language-specific approaches. |
|
|
|
|
|
## Why ATOM Outperforms Whisper |
|
|
|
|
|
1. **Morpheme-Aware Tokenization:** Armenian is an agglutinative language where words combine multiple morphemes (e.g., "չէինք" = "չ" [negation] + "է" [to be] + "ինք" [we/past]). ATOM's morpheme-level vocabulary (5k tokens) captures this linguistic structure better than Whisper's multilingual word-level BPE (51k tokens). |
|
|
|
|
|
2. **Language-Specific Training:** While Whisper is trained on 99 languages (680k hours), ATOM's decoder is trained exclusively on Armenian, allowing deep specialization on Armenian phonology and morphology. |
|
|
|
|
|
3. **Efficient Architecture:** The compact 2-layer decoder prevents overfitting on limited training data while the frozen pre-trained encoder provides robust audio feature extraction. |
|
|
|
|
|
4. **Low-Resource Optimization:** Whisper's multilingual training spreads capacity across languages, disadvantaging low-resource Armenian. ATOM dedicates all decoder capacity to Armenian. |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
**Primary Uses:** |
|
|
- Armenian speech-to-text transcription |
|
|
- Real-time subtitling for Armenian content |
|
|
- Accessibility tools for Armenian speakers |
|
|
- Research on morpheme-aware ASR for agglutinative languages |
|
|
|
|
|
**Best Performance:** |
|
|
- Clear speech in quiet environments |
|
|
- Native Armenian speakers |
|
|
- Standard Eastern/Western Armenian dialects |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained on limited data (relatively small dataset) |
|
|
- May struggle with heavy accents or noisy audio |
|
|
- Optimized for Armenian only (not multilingual) |
|
|
- 10% exact match rate indicates room for improvement in perfect transcriptions |
|
|
- Performance may degrade on out-of-domain audio (non-Common Voice data) |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset:** Common Voice 20.0 Armenian |
|
|
- **Splits Used:** Train + Other |
|
|
- **Duration:** Approximately 30 hours of Armenian speech |
|
|
- **Speakers:** 400+ unique speakers |
|
|
- **Demographics:** |
|
|
- Gender: 55% Female, 25% Male, 20% Undefined |
|
|
- Age: Primarily 20s-30s (70%+) |
|
|
- **Test Set:** Common Voice test split (separate, unseen data) |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
```python |
|
|
learning_rate: 1e-4 |
|
|
train_batch_size: 32 |
|
|
gradient_accumulation_steps: 1 |
|
|
warmup_steps: 500 |
|
|
max_steps: 12,000 |
|
|
save_steps: 3,000 |
|
|
fp16: True |
|
|
optimizer: AdamW (torch) |
|
|
lr_scheduler_type: cosine |
|
|
max_grad_norm: 1.0 |
|
|
gradient_checkpointing: True |
|
|
dataloader_num_workers: 8 |
|
|
``` |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
- **GPU:** NVIDIA RTX 3060 ti with FP16 mixed precision |
|
|
- **Framework:** |
|
|
- Transformers 4.56.2 |
|
|
- PyTorch 2.8.0+cu129 |
|
|
- Datasets 3.5.0 |
|
|
- Tokenizers 0.22.1 |
|
|
- **Final Checkpoint:** Step 9,000 |
|
|
- **Evaluation Loss:** 1.36 |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch torchaudio |
|
|
``` |
|
|
|
|
|
### Basic Inference |
|
|
|
|
|
```python |
|
|
from transformers import WhisperForConditionalGeneration, WhisperProcessor |
|
|
import torch |
|
|
|
|
|
# Load model and processor |
|
|
model = WhisperForConditionalGeneration.from_pretrained("Chillarmo/ATOM") |
|
|
processor = WhisperProcessor.from_pretrained("Chillarmo/ATOM") |
|
|
|
|
|
# Load audio (16kHz) |
|
|
import torchaudio |
|
|
audio, sr = torchaudio.load("audio.wav") |
|
|
if sr != 16000: |
|
|
resampler = torchaudio.transforms.Resample(sr, 16000) |
|
|
audio = resampler(audio) |
|
|
|
|
|
# Process |
|
|
input_features = processor( |
|
|
audio.squeeze().numpy(), |
|
|
sampling_rate=16000, |
|
|
return_tensors="pt" |
|
|
).input_features |
|
|
|
|
|
# Generate |
|
|
with torch.no_grad(): |
|
|
predicted_ids = model.generate( |
|
|
input_features, |
|
|
max_length=448, |
|
|
num_beams=5, |
|
|
repetition_penalty=1.2, |
|
|
no_repeat_ngram_size=3 |
|
|
) |
|
|
|
|
|
# Decode |
|
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
|
|
print(transcription) |
|
|
``` |
|
|
|
|
|
### Advanced Usage with Pipeline |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Create ASR pipeline |
|
|
asr_pipeline = pipeline( |
|
|
"automatic-speech-recognition", |
|
|
model="Chillarmo/ATOM", |
|
|
device=0 # Use GPU if available |
|
|
) |
|
|
|
|
|
# Transcribe |
|
|
result = asr_pipeline( |
|
|
"audio.wav", |
|
|
generate_kwargs={ |
|
|
"max_length": 448, |
|
|
"num_beams": 5, |
|
|
"repetition_penalty": 1.2 |
|
|
} |
|
|
) |
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
## Technical Details |
|
|
|
|
|
### Morpheme Tokenization |
|
|
|
|
|
The model uses a custom BPE tokenizer trained on Armenian text with morpheme-level granularity: |
|
|
- **Vocabulary Size:** 5,000 tokens |
|
|
- **Special Tokens:** `<pad>`, `<s>`, `</s>`, `<unk>` |
|
|
- **Training Corpus:** Armenian Wikipedia + Common Voice transcriptions |
|
|
- **Morpheme Segmentation:** Whitespace pre-tokenization optimized for Armenian word structure |
|
|
|
|
|
Example tokenization: |
|
|
``` |
|
|
Word: "չէինք" (we were not) |
|
|
Morphemes: ["չ", "է", "ինք"] |
|
|
Translation: [negation] + [to be] + [we/past] |
|
|
``` |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
**Encoder (Frozen):** |
|
|
- 4 Transformer encoder layers |
|
|
- 384 hidden dimensions |
|
|
- 1536 feed-forward dimensions |
|
|
- 6 attention heads |
|
|
- Pre-trained on Whisper's 680k hour multilingual dataset |
|
|
|
|
|
**Decoder (Trained from Scratch):** |
|
|
- 2 Transformer decoder layers (50% reduction) |
|
|
- 384 hidden dimensions |
|
|
- 1024 feed-forward dimensions (33% reduction) |
|
|
- 6 attention heads |
|
|
- Trained exclusively on Armenian |
|
|
|
|
|
## Reproduction |
|
|
|
|
|
To reproduce training: |
|
|
|
|
|
```bash |
|
|
# Install dependencies |
|
|
pip install transformers datasets evaluate jiwer accelerate |
|
|
|
|
|
# Train |
|
|
python train.py \ |
|
|
--model_name_or_path openai/whisper-tiny \ |
|
|
--dataset Chillarmo/common_voice_20_armenian \ |
|
|
--output_dir ./atom-model \ |
|
|
--learning_rate 1e-4 \ |
|
|
--per_device_train_batch_size 32 \ |
|
|
--max_steps 12000 \ |
|
|
--fp16 \ |
|
|
--save_steps 3000 |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{movsesyan2025atom, |
|
|
title={ATOM: Morpheme-Aware Whisper for Low-Resource Armenian ASR}, |
|
|
author={Movsesyan, Movses}, |
|
|
year={2025}, |
|
|
institution={California State University, Sacramento} |
|
|
} |
|
|
``` |
|
|
|
|
|
## References |
|
|
|
|
|
Whisper Armenian benchmarks from published evaluations on Common Voice datasets. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Built on OpenAI's Whisper architecture ([Radford et al., 2022](https://arxiv.org/abs/2212.04356)) |
|
|
- Trained on Mozilla Common Voice data |
|
|
- Morpheme tokenization inspired by Armenian linguistic structure |
|
|
- California State University, Sacramento |
|
|
|
|
|
## License |
|
|
|
|
|
[Specify license - typically MIT or Apache 2.0] |
|
|
|
|
|
--- |
|
|
|
|
|
**Model Card Contact:** movsesmovsesyan@csus.edu |