LocalDoc
/

mmBERT-small-en-az

@@ -4,7 +4,7 @@ license: mit
 language:
   - en
   - az
-base_model: jhu-clsp/mmBERT-base
 tags:
   - modernbert
   - multilingual
@@ -17,9 +17,9 @@ tags:
 pipeline_tag: feature-extraction
 ---
-# mmBERT-base-en-az
-A vocabulary-truncated version of [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base), optimized for **English** and **Azerbaijani** by removing unused tokens from the 1800+ language vocabulary.
 ## What is this model?
@@ -32,10 +32,10 @@ This model keeps only the ~72K tokens that actually appear in English and Azerba
 | Metric | Original | Truncated |
 |---|---|---|
 | Vocabulary size | 256,000 | 71,751 |
-| Total parameters | 306.9M | 165.4M |
 | Embedding parameters | 196.6M | 55.1M |
-| Model size (fp32) | 1.14 GB | 0.62 GB |
-| Hidden size | 768 | 768 |
 | Layers | 22 | 22 |
 | Max sequence length | 8,192 | 8,192 |
@@ -47,9 +47,9 @@ Cosine similarity between Azerbaijani–English sentence pairs is identical or n
 | Sentence pair | Original | Truncated |
 |---|---|---|
-| "Bakı Azərbaycanın paytaxtıdır" ↔ "Baku is the capital of Azerbaijan" | 0.7718 | 0.7718 |
-| "Süni intellekt texnologiyası sürətlə inkişaf edir" ↔ "Artificial intelligence technology is developing rapidly" | 0.7626 | 0.7792 |
-| "Bu gün hava çox gözəldir" ↔ "The weather is very nice today" | 0.8285 | 0.8285 |
 Tokenization output is identical for both languages.
@@ -58,8 +58,8 @@ Tokenization output is identical for both languages.
 ```python
 from transformers import AutoTokenizer, AutoModel
-tokenizer = AutoTokenizer.from_pretrained("LocalDoc/mmBERT-base-en-az")
-model = AutoModel.from_pretrained("LocalDoc/mmBERT-base-en-az")
 inputs = tokenizer("Salam, bu gün necəsiniz?", return_tensors="pt")
 outputs = model(**inputs)

 language:
   - en
   - az
+base_model: jhu-clsp/mmBERT-small
 tags:
   - modernbert
   - multilingual
 pipeline_tag: feature-extraction
 ---
+# mmBERT-small-en-az
+A vocabulary-truncated version of [jhu-clsp/mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small), optimized for **English** and **Azerbaijani** by removing unused tokens from the 1800+ language vocabulary.
 ## What is this model?
 | Metric | Original | Truncated |
 |---|---|---|
 | Vocabulary size | 256,000 | 71,751 |
+| Total parameters | 140.493M | 69.42M |
 | Embedding parameters | 196.6M | 55.1M |
+| Model size (fp32) | 0.52 GB | 0.26 GB |
+| Hidden size | 384 | 384 |
 | Layers | 22 | 22 |
 | Max sequence length | 8,192 | 8,192 |
 | Sentence pair | Original | Truncated |
 |---|---|---|
+| "Bakı Azərbaycanın paytaxtıdır" ↔ "Baku is the capital of Azerbaijan" | 0.927396 | 0.927396 |
+| "Süni intellekt texnologiyası sürətlə inkişaf edir" ↔ "Artificial intelligence technology is developing rapidly" | 0.926054 | 0.943118 |
+| "Bu gün hava çox gözəldir" ↔ "The weather is very nice today" | 0.937846 | 0.937846 |
 Tokenization output is identical for both languages.
 ```python
 from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("LocalDoc/mmBERT-small-en-az")
+model = AutoModel.from_pretrained("LocalDoc/mmBERT-small-en-az")
 inputs = tokenizer("Salam, bu gün necəsiniz?", return_tensors="pt")
 outputs = model(**inputs)