ATOM / README.md

Adding ONNX file of this model

4916958 verified about 2 months ago

9.03 kB

	---
	library_name: transformers
	language:
	- hy
	tags:
	- asr
	- audio
	- speech
	- whisper
	- low-resource
	- morpheme-tokenization
	- armenian
	- compact-model
	- generated_from_trainer
	- onnx
	datasets:
	- Chillarmo/common_voice_20_armenian
	license: mit
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	model-index:
	- name: ATOM (Armenian Tiny Optimized Model)
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Common Voice 20.0 Armenian
	type: mozilla-foundation/common_voice_20_0
	config: hy
	split: test
	metrics:
	- type: wer
	value: 50.3
	name: Word Error Rate
	- type: exact_match
	value: 10.06
	name: Exact Match
	---

	# ATOM: Armenian Tiny Optimized Model

	A compact, morpheme-aware Automatic Speech Recognition (ASR) model that significantly outperforms OpenAI's Whisper on Armenian speech recognition.

	## Model Description

	ATOM is a specialized ASR model for low-resource Armenian, achieving 64.5% lower WER than vanilla Whisper-tiny on Armenian while using 28% fewer parameters. The model combines:

	- Frozen Whisper-tiny encoder (pre-trained audio feature extraction)
	- Custom compact decoder (2 layers, trained from scratch on Armenian)
	- Morpheme-level BPE tokenization (5,000 tokens optimized for Armenian morphology vs Whisper's 51k multilingual tokens)

	### Architecture

	```
	Input: Audio (16kHz)
	↓
	Whisper Encoder (frozen, 4 layers, 384 hidden, 1536 FFN)
	↓
	Compact Decoder (trainable, 2 layers, 384 hidden, 1024 FFN)
	↓
	Morpheme Vocabulary (5,000 tokens)
	↓
	Output: Armenian Text
	```

	Total Parameters: ~28M (28% smaller than Whisper-tiny's 39M)

	## Performance

	Evaluated on Common Voice 20.0 Armenian test set:

	\| Model \| Parameters \| WER (Armenian) \| Relative Improvement \|
	\|-------\|------------\|----------------\|---------------------\|
	\| Whisper-tiny \| 39M \| 118.6%* \| Baseline \|
	\| Whisper-base \| 74M \| 126.3%* \| -6.5% (worse) \|
	\| Whisper-small \| 244M \| 86.6%* \| +27.0% \|
	\| Whisper-medium \| 769M \| 60.1%* \| +49.3% \|
	\| Whisper-large \| 1550M \| 53.7%* \| +54.7% \|
	\| Whisper-large-v2 \| 1550M \| 44.6%* \| +62.4% \|
	\| ATOM \| 14M \| 50.3% \| +57.6% ✅ \|

	*Whisper WER values for Armenian from published benchmarks

	### Key Insights:
	- ATOM outperforms ALL Whisper models on Armenian, including models up to 55× larger
	- Word Error Rate (WER): 42.1% vs Whisper-tiny's 118.6% on Armenian
	- Model Size: 28M parameters (28% smaller than Whisper-tiny, 55× smaller than Whisper-large-v2)
	- Training Efficiency: Trained on minimal Armenian speech data vs Whisper's 680k hours multilingual

	Note: While Whisper models achieve strong performance on high-resource languages (e.g., Whisper-tiny: 79.0% average WER), they perform significantly worse on low-resource Armenian (118.6% WER), demonstrating the need for language-specific approaches.

	## Why ATOM Outperforms Whisper

	1. Morpheme-Aware Tokenization: Armenian is an agglutinative language where words combine multiple morphemes (e.g., "չէինք" = "չ" [negation] + "է" [to be] + "ինք" [we/past]). ATOM's morpheme-level vocabulary (5k tokens) captures this linguistic structure better than Whisper's multilingual word-level BPE (51k tokens).

	2. Language-Specific Training: While Whisper is trained on 99 languages (680k hours), ATOM's decoder is trained exclusively on Armenian, allowing deep specialization on Armenian phonology and morphology.

	3. Efficient Architecture: The compact 2-layer decoder prevents overfitting on limited training data while the frozen pre-trained encoder provides robust audio feature extraction.

	4. Low-Resource Optimization: Whisper's multilingual training spreads capacity across languages, disadvantaging low-resource Armenian. ATOM dedicates all decoder capacity to Armenian.

	## Intended Uses

	Primary Uses:
	- Armenian speech-to-text transcription
	- Real-time subtitling for Armenian content
	- Accessibility tools for Armenian speakers
	- Research on morpheme-aware ASR for agglutinative languages

	Best Performance:
	- Clear speech in quiet environments
	- Native Armenian speakers
	- Standard Eastern/Western Armenian dialects

	## Limitations

	- Trained on limited data (relatively small dataset)
	- May struggle with heavy accents or noisy audio
	- Optimized for Armenian only (not multilingual)
	- 10% exact match rate indicates room for improvement in perfect transcriptions
	- Performance may degrade on out-of-domain audio (non-Common Voice data)

	## Training Details

	### Training Data

	- Dataset: Common Voice 20.0 Armenian
	- Splits Used: Train + Other
	- Duration: Approximately 30 hours of Armenian speech
	- Speakers: 400+ unique speakers
	- Demographics:
	- Gender: 55% Female, 25% Male, 20% Undefined
	- Age: Primarily 20s-30s (70%+)
	- Test Set: Common Voice test split (separate, unseen data)

	### Training Hyperparameters

	```python
	learning_rate: 1e-4
	train_batch_size: 32
	gradient_accumulation_steps: 1
	warmup_steps: 500
	max_steps: 12,000
	save_steps: 3,000
	fp16: True
	optimizer: AdamW (torch)
	lr_scheduler_type: cosine
	max_grad_norm: 1.0
	gradient_checkpointing: True
	dataloader_num_workers: 8
	```

	### Training Infrastructure

	- GPU: NVIDIA RTX 3060 ti with FP16 mixed precision
	- Framework:
	- Transformers 4.56.2
	- PyTorch 2.8.0+cu129
	- Datasets 3.5.0
	- Tokenizers 0.22.1
	- Final Checkpoint: Step 9,000
	- Evaluation Loss: 1.36

	## Usage

	### Installation

	```bash
	pip install transformers torch torchaudio
	```

	### Basic Inference

	```python
	from transformers import WhisperForConditionalGeneration, WhisperProcessor
	import torch

	# Load model and processor
	model = WhisperForConditionalGeneration.from_pretrained("Chillarmo/ATOM")
	processor = WhisperProcessor.from_pretrained("Chillarmo/ATOM")

	# Load audio (16kHz)
	import torchaudio
	audio, sr = torchaudio.load("audio.wav")
	if sr != 16000:
	resampler = torchaudio.transforms.Resample(sr, 16000)
	audio = resampler(audio)

	# Process
	input_features = processor(
	audio.squeeze().numpy(),
	sampling_rate=16000,
	return_tensors="pt"
	).input_features

	# Generate
	with torch.no_grad():
	predicted_ids = model.generate(
	input_features,
	max_length=448,
	num_beams=5,
	repetition_penalty=1.2,
	no_repeat_ngram_size=3
	)

	# Decode
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
	print(transcription)
	```

	### Advanced Usage with Pipeline

	```python
	from transformers import pipeline

	# Create ASR pipeline
	asr_pipeline = pipeline(
	"automatic-speech-recognition",
	model="Chillarmo/ATOM",
	device=0 # Use GPU if available
	)

	# Transcribe
	result = asr_pipeline(
	"audio.wav",
	generate_kwargs={
	"max_length": 448,
	"num_beams": 5,
	"repetition_penalty": 1.2
	}
	)
	print(result["text"])
	```

	## Technical Details

	### Morpheme Tokenization

	The model uses a custom BPE tokenizer trained on Armenian text with morpheme-level granularity:
	- Vocabulary Size: 5,000 tokens
	- Special Tokens: `<pad>`, `<s>`, `</s>`, `<unk>`
	- Training Corpus: Armenian Wikipedia + Common Voice transcriptions
	- Morpheme Segmentation: Whitespace pre-tokenization optimized for Armenian word structure

	Example tokenization:
	```
	Word: "չէինք" (we were not)
	Morphemes: ["չ", "է", "ինք"]
	Translation: [negation] + [to be] + [we/past]
	```

	### Model Architecture

	Encoder (Frozen):
	- 4 Transformer encoder layers
	- 384 hidden dimensions
	- 1536 feed-forward dimensions
	- 6 attention heads
	- Pre-trained on Whisper's 680k hour multilingual dataset

	Decoder (Trained from Scratch):
	- 2 Transformer decoder layers (50% reduction)
	- 384 hidden dimensions
	- 1024 feed-forward dimensions (33% reduction)
	- 6 attention heads
	- Trained exclusively on Armenian

	## Reproduction

	To reproduce training:

	```bash
	# Install dependencies
	pip install transformers datasets evaluate jiwer accelerate

	# Train
	python train.py \
	--model_name_or_path openai/whisper-tiny \
	--dataset Chillarmo/common_voice_20_armenian \
	--output_dir ./atom-model \
	--learning_rate 1e-4 \
	--per_device_train_batch_size 32 \
	--max_steps 12000 \
	--fp16 \
	--save_steps 3000
	```

	## Citation

	```bibtex
	@misc{movsesyan2025atom,
	title={ATOM: Morpheme-Aware Whisper for Low-Resource Armenian ASR},
	author={Movsesyan, Movses},
	year={2025},
	institution={California State University, Sacramento}
	}
	```

	## References

	Whisper Armenian benchmarks from published evaluations on Common Voice datasets.

	## Acknowledgments

	- Built on OpenAI's Whisper architecture ([Radford et al., 2022](https://arxiv.org/abs/2212.04356))
	- Trained on Mozilla Common Voice data
	- Morpheme tokenization inspired by Armenian linguistic structure
	- California State University, Sacramento

	## License

	[Specify license - typically MIT or Apache 2.0]

	---

	Model Card Contact: movsesmovsesyan@csus.edu