ATOM / README.md
Chillarmo's picture
Adding ONNX file of this model (#1)
615f1b6 verified
metadata
library_name: transformers
language:
  - hy
tags:
  - asr
  - audio
  - speech
  - whisper
  - low-resource
  - morpheme-tokenization
  - armenian
  - compact-model
  - generated_from_trainer
  - onnx
datasets:
  - Chillarmo/common_voice_20_armenian
license: mit
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
model-index:
  - name: ATOM (Armenian Tiny Optimized Model)
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Common Voice 20.0 Armenian
          type: mozilla-foundation/common_voice_20_0
          config: hy
          split: test
        metrics:
          - type: wer
            value: 50.3
            name: Word Error Rate
          - type: exact_match
            value: 10.06
            name: Exact Match

ATOM: Armenian Tiny Optimized Model

A compact, morpheme-aware Automatic Speech Recognition (ASR) model that significantly outperforms OpenAI's Whisper on Armenian speech recognition.

Model Description

ATOM is a specialized ASR model for low-resource Armenian, achieving 64.5% lower WER than vanilla Whisper-tiny on Armenian while using 28% fewer parameters. The model combines:

  • Frozen Whisper-tiny encoder (pre-trained audio feature extraction)
  • Custom compact decoder (2 layers, trained from scratch on Armenian)
  • Morpheme-level BPE tokenization (5,000 tokens optimized for Armenian morphology vs Whisper's 51k multilingual tokens)

Architecture

Input: Audio (16kHz) 
  ↓
Whisper Encoder (frozen, 4 layers, 384 hidden, 1536 FFN)
  ↓
Compact Decoder (trainable, 2 layers, 384 hidden, 1024 FFN)
  ↓
Morpheme Vocabulary (5,000 tokens)
  ↓
Output: Armenian Text

Total Parameters: ~28M (28% smaller than Whisper-tiny's 39M)

Performance

Evaluated on Common Voice 20.0 Armenian test set:

Model Parameters WER (Armenian) Relative Improvement
Whisper-tiny 39M 118.6%* Baseline
Whisper-base 74M 126.3%* -6.5% (worse)
Whisper-small 244M 86.6%* +27.0%
Whisper-medium 769M 60.1%* +49.3%
Whisper-large 1550M 53.7%* +54.7%
Whisper-large-v2 1550M 44.6%* +62.4%
ATOM 14M 50.3% +57.6%

*Whisper WER values for Armenian from published benchmarks

Key Insights:

  • ATOM outperforms ALL Whisper models on Armenian, including models up to 55× larger
  • Word Error Rate (WER): 42.1% vs Whisper-tiny's 118.6% on Armenian
  • Model Size: 28M parameters (28% smaller than Whisper-tiny, 55× smaller than Whisper-large-v2)
  • Training Efficiency: Trained on minimal Armenian speech data vs Whisper's 680k hours multilingual

Note: While Whisper models achieve strong performance on high-resource languages (e.g., Whisper-tiny: 79.0% average WER), they perform significantly worse on low-resource Armenian (118.6% WER), demonstrating the need for language-specific approaches.

Why ATOM Outperforms Whisper

  1. Morpheme-Aware Tokenization: Armenian is an agglutinative language where words combine multiple morphemes (e.g., "չէինք" = "չ" [negation] + "է" [to be] + "ինք" [we/past]). ATOM's morpheme-level vocabulary (5k tokens) captures this linguistic structure better than Whisper's multilingual word-level BPE (51k tokens).

  2. Language-Specific Training: While Whisper is trained on 99 languages (680k hours), ATOM's decoder is trained exclusively on Armenian, allowing deep specialization on Armenian phonology and morphology.

  3. Efficient Architecture: The compact 2-layer decoder prevents overfitting on limited training data while the frozen pre-trained encoder provides robust audio feature extraction.

  4. Low-Resource Optimization: Whisper's multilingual training spreads capacity across languages, disadvantaging low-resource Armenian. ATOM dedicates all decoder capacity to Armenian.

Intended Uses

Primary Uses:

  • Armenian speech-to-text transcription
  • Real-time subtitling for Armenian content
  • Accessibility tools for Armenian speakers
  • Research on morpheme-aware ASR for agglutinative languages

Best Performance:

  • Clear speech in quiet environments
  • Native Armenian speakers
  • Standard Eastern/Western Armenian dialects

Limitations

  • Trained on limited data (relatively small dataset)
  • May struggle with heavy accents or noisy audio
  • Optimized for Armenian only (not multilingual)
  • 10% exact match rate indicates room for improvement in perfect transcriptions
  • Performance may degrade on out-of-domain audio (non-Common Voice data)

Training Details

Training Data

  • Dataset: Common Voice 20.0 Armenian
  • Splits Used: Train + Other
  • Duration: Approximately 30 hours of Armenian speech
  • Speakers: 400+ unique speakers
  • Demographics:
    • Gender: 55% Female, 25% Male, 20% Undefined
    • Age: Primarily 20s-30s (70%+)
  • Test Set: Common Voice test split (separate, unseen data)

Training Hyperparameters

learning_rate: 1e-4
train_batch_size: 32
gradient_accumulation_steps: 1
warmup_steps: 500
max_steps: 12,000
save_steps: 3,000
fp16: True
optimizer: AdamW (torch)
lr_scheduler_type: cosine
max_grad_norm: 1.0
gradient_checkpointing: True
dataloader_num_workers: 8

Training Infrastructure

  • GPU: NVIDIA RTX 3060 ti with FP16 mixed precision
  • Framework:
    • Transformers 4.56.2
    • PyTorch 2.8.0+cu129
    • Datasets 3.5.0
    • Tokenizers 0.22.1
  • Final Checkpoint: Step 9,000
  • Evaluation Loss: 1.36

Usage

Installation

pip install transformers torch torchaudio

Basic Inference

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("Chillarmo/ATOM")
processor = WhisperProcessor.from_pretrained("Chillarmo/ATOM")

# Load audio (16kHz)
import torchaudio
audio, sr = torchaudio.load("audio.wav")
if sr != 16000:
    resampler = torchaudio.transforms.Resample(sr, 16000)
    audio = resampler(audio)

# Process
input_features = processor(
    audio.squeeze().numpy(), 
    sampling_rate=16000, 
    return_tensors="pt"
).input_features

# Generate
with torch.no_grad():
    predicted_ids = model.generate(
        input_features,
        max_length=448,
        num_beams=5,
        repetition_penalty=1.2,
        no_repeat_ngram_size=3
    )

# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Advanced Usage with Pipeline

from transformers import pipeline

# Create ASR pipeline
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="Chillarmo/ATOM",
    device=0  # Use GPU if available
)

# Transcribe
result = asr_pipeline(
    "audio.wav",
    generate_kwargs={
        "max_length": 448,
        "num_beams": 5,
        "repetition_penalty": 1.2
    }
)
print(result["text"])

Technical Details

Morpheme Tokenization

The model uses a custom BPE tokenizer trained on Armenian text with morpheme-level granularity:

  • Vocabulary Size: 5,000 tokens
  • Special Tokens: <pad>, <s>, </s>, <unk>
  • Training Corpus: Armenian Wikipedia + Common Voice transcriptions
  • Morpheme Segmentation: Whitespace pre-tokenization optimized for Armenian word structure

Example tokenization:

Word: "չէինք" (we were not)
Morphemes: ["չ", "է", "ինք"]
Translation: [negation] + [to be] + [we/past]

Model Architecture

Encoder (Frozen):

  • 4 Transformer encoder layers
  • 384 hidden dimensions
  • 1536 feed-forward dimensions
  • 6 attention heads
  • Pre-trained on Whisper's 680k hour multilingual dataset

Decoder (Trained from Scratch):

  • 2 Transformer decoder layers (50% reduction)
  • 384 hidden dimensions
  • 1024 feed-forward dimensions (33% reduction)
  • 6 attention heads
  • Trained exclusively on Armenian

Reproduction

To reproduce training:

# Install dependencies
pip install transformers datasets evaluate jiwer accelerate

# Train
python train.py \
  --model_name_or_path openai/whisper-tiny \
  --dataset Chillarmo/common_voice_20_armenian \
  --output_dir ./atom-model \
  --learning_rate 1e-4 \
  --per_device_train_batch_size 32 \
  --max_steps 12000 \
  --fp16 \
  --save_steps 3000

Citation

@misc{movsesyan2025atom,
  title={ATOM: Morpheme-Aware Whisper for Low-Resource Armenian ASR},
  author={Movsesyan, Movses},
  year={2025},
  institution={California State University, Sacramento}
}

References

Whisper Armenian benchmarks from published evaluations on Common Voice datasets.

Acknowledgments

  • Built on OpenAI's Whisper architecture (Radford et al., 2022)
  • Trained on Mozilla Common Voice data
  • Morpheme tokenization inspired by Armenian linguistic structure
  • California State University, Sacramento

License

[Specify license - typically MIT or Apache 2.0]


Model Card Contact: movsesmovsesyan@csus.edu