| | --- |
| | library_name: transformers |
| | license: mit |
| | language: |
| | - en |
| | - az |
| | base_model: jhu-clsp/mmBERT-base |
| | tags: |
| | - modernbert |
| | - multilingual |
| | - vocabulary-truncation |
| | - encoder |
| | - fill-mask |
| | - feature-extraction |
| | - azerbaijani |
| | - english |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # mmBERT-base-en-az |
| |
|
| | A vocabulary-truncated version of [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base), optimized for **English** and **Azerbaijani** by removing unused tokens from the 1800+ language vocabulary. |
| |
|
| | ## What is this model? |
| |
|
| | mmBERT is a state-of-the-art multilingual encoder built on the ModernBERT architecture with a Gemma 2 tokenizer, trained on 3T+ tokens across 1800+ languages. While powerful, the full model carries a 256K token vocabulary — most of which is unnecessary if you only need English and Azerbaijani. |
| |
|
| | This model keeps only the ~72K tokens that actually appear in English and Azerbaijani text, reducing the model size by **46%** while preserving identical output quality for these two languages. |
| |
|
| | ## Key numbers |
| |
|
| | | Metric | Original | Truncated | |
| | |---|---|---| |
| | | Vocabulary size | 256,000 | 71,751 | |
| | | Total parameters | 306.9M | 165.4M | |
| | | Embedding parameters | 196.6M | 55.1M | |
| | | Model size (fp32) | 1.14 GB | 0.62 GB | |
| | | Hidden size | 768 | 768 | |
| | | Layers | 22 | 22 | |
| | | Max sequence length | 8,192 | 8,192 | |
| |
|
| | All transformer layers (110M non-embedding parameters) are completely unchanged. Only the embedding matrix was trimmed. |
| |
|
| | ## Quality verification |
| |
|
| | Cosine similarity between Azerbaijani–English sentence pairs is identical or near-identical to the original model: |
| |
|
| | | Sentence pair | Original | Truncated | |
| | |---|---|---| |
| | | "Bakı Azərbaycanın paytaxtıdır" ↔ "Baku is the capital of Azerbaijan" | 0.7718 | 0.7718 | |
| | | "Süni intellekt texnologiyası sürətlə inkişaf edir" ↔ "Artificial intelligence technology is developing rapidly" | 0.7626 | 0.7792 | |
| | | "Bu gün hava çox gözəldir" ↔ "The weather is very nice today" | 0.8285 | 0.8285 | |
| |
|
| | Tokenization output is identical for both languages. |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("LocalDoc/mmBERT-base-en-az") |
| | model = AutoModel.from_pretrained("LocalDoc/mmBERT-base-en-az") |
| | |
| | inputs = tokenizer("Salam, bu gün necəsiniz?", return_tensors="pt") |
| | outputs = model(**inputs) |
| | ``` |
| |
|
| | ### Getting sentence embeddings (mean pooling) |
| |
|
| | ```python |
| | import torch |
| | |
| | def get_embeddings(texts, model, tokenizer): |
| | encoded = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt") |
| | with torch.no_grad(): |
| | output = model(**encoded) |
| | mask = encoded["attention_mask"].unsqueeze(-1).expand(output.last_hidden_state.size()).float() |
| | embeddings = torch.sum(output.last_hidden_state * mask, 1) / torch.clamp(mask.sum(1), min=1e-9) |
| | embeddings = torch.nn.functional.normalize(embeddings) |
| | return embeddings |
| | |
| | embeddings = get_embeddings( |
| | ["Bakı Azərbaycanın paytaxtıdır", "Baku is the capital of Azerbaijan"], |
| | model, tokenizer |
| | ) |
| | similarity = embeddings[0].dot(embeddings[1]) |
| | print(f"Similarity: {similarity:.4f}") |
| | ``` |
| |
|
| | ## How it was made |
| |
|
| | 1. Tokenized 1M English and 1M Azerbaijani sentences with the original mmBERT tokenizer |
| | 2. Counted token frequencies across both corpora |
| | 3. Kept all special/control tokens (first 260 IDs) plus tokens appearing ≥10 times in English or ≥3 times in Azerbaijani |
| | 4. Filtered the BPE merges to keep only those where both parts and the merged result exist in the new vocabulary |
| | 5. Sliced the corresponding rows from the embedding matrix (`model.embeddings.tok_embeddings`) |
| | 6. Saved the truncated model and tokenizer |
| |
|
| | Method adapted from [vrashad/language_model_optimization](https://github.com/vrashad/language_model_optimization). |
| |
|
| | ## Limitations |
| |
|
| | - This model is intended for **English and Azerbaijani only**. Text in other languages will produce degraded tokenization (excessive byte-level fallback) and poor embeddings. |
| | - The MLM head (`decoder.weight`, `decoder.bias`) was not truncated. If you need masked language modeling, load with `AutoModelForMaskedLM` and be aware of the vocabulary mismatch in the output layer. |
| | - Fine-tuning is recommended for downstream tasks, as the base model was not fine-tuned for any specific task. |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite the original mmBERT paper: |
| |
|
| | ```bibtex |
| | @misc{marone2025mmbertmodernmultilingualencoder, |
| | title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, |
| | author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, |
| | year={2025}, |
| | eprint={2509.06888}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2509.06888}, |
| | } |
| | ``` |
| |
|