LocalDoc
/

mmBERT-small-en-az

+---
+library_name: transformers
+license: mit
+language:
+  - en
+  - az
+base_model: jhu-clsp/mmBERT-base
+tags:
+  - modernbert
+  - multilingual
+  - vocabulary-truncation
+  - encoder
+  - fill-mask
+  - feature-extraction
+  - azerbaijani
+  - english
+pipeline_tag: feature-extraction
+---
+# mmBERT-base-en-az
+A vocabulary-truncated version of [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base), optimized for **English** and **Azerbaijani** by removing unused tokens from the 1800+ language vocabulary.
+## What is this model?
+mmBERT is a state-of-the-art multilingual encoder built on the ModernBERT architecture with a Gemma 2 tokenizer, trained on 3T+ tokens across 1800+ languages. While powerful, the full model carries a 256K token vocabulary — most of which is unnecessary if you only need English and Azerbaijani.
+This model keeps only the ~72K tokens that actually appear in English and Azerbaijani text, reducing the model size by **46%** while preserving identical output quality for these two languages.
+## Key numbers
+| Metric | Original | Truncated |
+|---|---|---|
+| Vocabulary size | 256,000 | 71,751 |
+| Total parameters | 306.9M | 165.4M |
+| Embedding parameters | 196.6M | 55.1M |
+| Model size (fp32) | 1.14 GB | 0.62 GB |
+| Hidden size | 768 | 768 |
+| Layers | 22 | 22 |
+| Max sequence length | 8,192 | 8,192 |
+All transformer layers (110M non-embedding parameters) are completely unchanged. Only the embedding matrix was trimmed.
+## Quality verification
+Cosine similarity between Azerbaijani–English sentence pairs is identical or near-identical to the original model:
+| Sentence pair | Original | Truncated |
+|---|---|---|
+| "Bakı Azərbaycanın paytaxtıdır" ↔ "Baku is the capital of Azerbaijan" | 0.7718 | 0.7718 |
+| "Süni intellekt texnologiyası sürətlə inkişaf edir" ↔ "Artificial intelligence technology is developing rapidly" | 0.7626 | 0.7792 |
+| "Bu gün hava çox gözəldir" ↔ "The weather is very nice today" | 0.8285 | 0.8285 |
+Tokenization output is identical for both languages.
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("LocalDoc/mmBERT-base-en-az")
+model = AutoModel.from_pretrained("LocalDoc/mmBERT-base-en-az")
+inputs = tokenizer("Salam, bu gün necəsiniz?", return_tensors="pt")
+outputs = model(**inputs)
+```
+### Getting sentence embeddings (mean pooling)
+```python
+import torch
+def get_embeddings(texts, model, tokenizer):
+    encoded = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
+    with torch.no_grad():
+        output = model(**encoded)
+        mask = encoded["attention_mask"].unsqueeze(-1).expand(output.last_hidden_state.size()).float()
+        embeddings = torch.sum(output.last_hidden_state * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)
+        embeddings = torch.nn.functional.normalize(embeddings)
+    return embeddings
+embeddings = get_embeddings(
+    ["Bakı Azərbaycanın paytaxtıdır", "Baku is the capital of Azerbaijan"],
+    model, tokenizer
+)
+similarity = embeddings[0].dot(embeddings[1])
+print(f"Similarity: {similarity:.4f}")
+```
+## How it was made
+1. Tokenized 1M English and 1M Azerbaijani sentences with the original mmBERT tokenizer
+2. Counted token frequencies across both corpora
+3. Kept all special/control tokens (first 260 IDs) plus tokens appearing ≥10 times in English or ≥3 times in Azerbaijani
+4. Filtered the BPE merges to keep only those where both parts and the merged result exist in the new vocabulary
+5. Sliced the corresponding rows from the embedding matrix (`model.embeddings.tok_embeddings`)
+6. Saved the truncated model and tokenizer
+Method adapted from [vrashad/language_model_optimization](https://github.com/vrashad/language_model_optimization).
+## Limitations
+- This model is intended for **English and Azerbaijani only**. Text in other languages will produce degraded tokenization (excessive byte-level fallback) and poor embeddings.
+- The MLM head (`decoder.weight`, `decoder.bias`) was not truncated. If you need masked language modeling, load with `AutoModelForMaskedLM` and be aware of the vocabulary mismatch in the output layer.
+- Fine-tuning is recommended for downstream tasks, as the base model was not fine-tuned for any specific task.
+## Citation
+If you use this model, please cite the original mmBERT paper:
+```bibtex
+@misc{marone2025mmbertmodernmultilingualencoder,
+      title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
+      author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
+      year={2025},
+      eprint={2509.06888},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2509.06888},
+}
+```