vrashad commited on
Commit
57098a8
·
verified ·
1 Parent(s): e50ac1b

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ language:
5
+ - en
6
+ - az
7
+ base_model: jhu-clsp/mmBERT-base
8
+ tags:
9
+ - modernbert
10
+ - multilingual
11
+ - vocabulary-truncation
12
+ - encoder
13
+ - fill-mask
14
+ - feature-extraction
15
+ - azerbaijani
16
+ - english
17
+ pipeline_tag: feature-extraction
18
+ ---
19
+
20
+ # mmBERT-base-en-az
21
+
22
+ A vocabulary-truncated version of [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base), optimized for **English** and **Azerbaijani** by removing unused tokens from the 1800+ language vocabulary.
23
+
24
+ ## What is this model?
25
+
26
+ mmBERT is a state-of-the-art multilingual encoder built on the ModernBERT architecture with a Gemma 2 tokenizer, trained on 3T+ tokens across 1800+ languages. While powerful, the full model carries a 256K token vocabulary — most of which is unnecessary if you only need English and Azerbaijani.
27
+
28
+ This model keeps only the ~72K tokens that actually appear in English and Azerbaijani text, reducing the model size by **46%** while preserving identical output quality for these two languages.
29
+
30
+ ## Key numbers
31
+
32
+ | Metric | Original | Truncated |
33
+ |---|---|---|
34
+ | Vocabulary size | 256,000 | 71,751 |
35
+ | Total parameters | 306.9M | 165.4M |
36
+ | Embedding parameters | 196.6M | 55.1M |
37
+ | Model size (fp32) | 1.14 GB | 0.62 GB |
38
+ | Hidden size | 768 | 768 |
39
+ | Layers | 22 | 22 |
40
+ | Max sequence length | 8,192 | 8,192 |
41
+
42
+ All transformer layers (110M non-embedding parameters) are completely unchanged. Only the embedding matrix was trimmed.
43
+
44
+ ## Quality verification
45
+
46
+ Cosine similarity between Azerbaijani–English sentence pairs is identical or near-identical to the original model:
47
+
48
+ | Sentence pair | Original | Truncated |
49
+ |---|---|---|
50
+ | "Bakı Azərbaycanın paytaxtıdır" ↔ "Baku is the capital of Azerbaijan" | 0.7718 | 0.7718 |
51
+ | "Süni intellekt texnologiyası sürətlə inkişaf edir" ↔ "Artificial intelligence technology is developing rapidly" | 0.7626 | 0.7792 |
52
+ | "Bu gün hava çox gözəldir" ↔ "The weather is very nice today" | 0.8285 | 0.8285 |
53
+
54
+ Tokenization output is identical for both languages.
55
+
56
+ ## Usage
57
+
58
+ ```python
59
+ from transformers import AutoTokenizer, AutoModel
60
+
61
+ tokenizer = AutoTokenizer.from_pretrained("LocalDoc/mmBERT-base-en-az")
62
+ model = AutoModel.from_pretrained("LocalDoc/mmBERT-base-en-az")
63
+
64
+ inputs = tokenizer("Salam, bu gün necəsiniz?", return_tensors="pt")
65
+ outputs = model(**inputs)
66
+ ```
67
+
68
+ ### Getting sentence embeddings (mean pooling)
69
+
70
+ ```python
71
+ import torch
72
+
73
+ def get_embeddings(texts, model, tokenizer):
74
+ encoded = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
75
+ with torch.no_grad():
76
+ output = model(**encoded)
77
+ mask = encoded["attention_mask"].unsqueeze(-1).expand(output.last_hidden_state.size()).float()
78
+ embeddings = torch.sum(output.last_hidden_state * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)
79
+ embeddings = torch.nn.functional.normalize(embeddings)
80
+ return embeddings
81
+
82
+ embeddings = get_embeddings(
83
+ ["Bakı Azərbaycanın paytaxtıdır", "Baku is the capital of Azerbaijan"],
84
+ model, tokenizer
85
+ )
86
+ similarity = embeddings[0].dot(embeddings[1])
87
+ print(f"Similarity: {similarity:.4f}")
88
+ ```
89
+
90
+ ## How it was made
91
+
92
+ 1. Tokenized 1M English and 1M Azerbaijani sentences with the original mmBERT tokenizer
93
+ 2. Counted token frequencies across both corpora
94
+ 3. Kept all special/control tokens (first 260 IDs) plus tokens appearing ≥10 times in English or ≥3 times in Azerbaijani
95
+ 4. Filtered the BPE merges to keep only those where both parts and the merged result exist in the new vocabulary
96
+ 5. Sliced the corresponding rows from the embedding matrix (`model.embeddings.tok_embeddings`)
97
+ 6. Saved the truncated model and tokenizer
98
+
99
+ Method adapted from [vrashad/language_model_optimization](https://github.com/vrashad/language_model_optimization).
100
+
101
+ ## Limitations
102
+
103
+ - This model is intended for **English and Azerbaijani only**. Text in other languages will produce degraded tokenization (excessive byte-level fallback) and poor embeddings.
104
+ - The MLM head (`decoder.weight`, `decoder.bias`) was not truncated. If you need masked language modeling, load with `AutoModelForMaskedLM` and be aware of the vocabulary mismatch in the output layer.
105
+ - Fine-tuning is recommended for downstream tasks, as the base model was not fine-tuned for any specific task.
106
+
107
+ ## Citation
108
+
109
+ If you use this model, please cite the original mmBERT paper:
110
+
111
+ ```bibtex
112
+ @misc{marone2025mmbertmodernmultilingualencoder,
113
+ title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
114
+ author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
115
+ year={2025},
116
+ eprint={2509.06888},
117
+ archivePrefix={arXiv},
118
+ primaryClass={cs.CL},
119
+ url={https://arxiv.org/abs/2509.06888},
120
+ }
121
+ ```