Update README.md
Browse files
README.md
CHANGED
|
@@ -4,7 +4,7 @@ license: mit
|
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
- az
|
| 7 |
-
base_model: jhu-clsp/mmBERT-
|
| 8 |
tags:
|
| 9 |
- modernbert
|
| 10 |
- multilingual
|
|
@@ -17,9 +17,9 @@ tags:
|
|
| 17 |
pipeline_tag: feature-extraction
|
| 18 |
---
|
| 19 |
|
| 20 |
-
# mmBERT-
|
| 21 |
|
| 22 |
-
A vocabulary-truncated version of [jhu-clsp/mmBERT-
|
| 23 |
|
| 24 |
## What is this model?
|
| 25 |
|
|
@@ -32,10 +32,10 @@ This model keeps only the ~72K tokens that actually appear in English and Azerba
|
|
| 32 |
| Metric | Original | Truncated |
|
| 33 |
|---|---|---|
|
| 34 |
| Vocabulary size | 256,000 | 71,751 |
|
| 35 |
-
| Total parameters |
|
| 36 |
| Embedding parameters | 196.6M | 55.1M |
|
| 37 |
-
| Model size (fp32) |
|
| 38 |
-
| Hidden size |
|
| 39 |
| Layers | 22 | 22 |
|
| 40 |
| Max sequence length | 8,192 | 8,192 |
|
| 41 |
|
|
@@ -47,9 +47,9 @@ Cosine similarity between Azerbaijani–English sentence pairs is identical or n
|
|
| 47 |
|
| 48 |
| Sentence pair | Original | Truncated |
|
| 49 |
|---|---|---|
|
| 50 |
-
| "Bakı Azərbaycanın paytaxtıdır" ↔ "Baku is the capital of Azerbaijan" | 0.
|
| 51 |
-
| "Süni intellekt texnologiyası sürətlə inkişaf edir" ↔ "Artificial intelligence technology is developing rapidly" | 0.
|
| 52 |
-
| "Bu gün hava çox gözəldir" ↔ "The weather is very nice today" | 0.
|
| 53 |
|
| 54 |
Tokenization output is identical for both languages.
|
| 55 |
|
|
@@ -58,8 +58,8 @@ Tokenization output is identical for both languages.
|
|
| 58 |
```python
|
| 59 |
from transformers import AutoTokenizer, AutoModel
|
| 60 |
|
| 61 |
-
tokenizer = AutoTokenizer.from_pretrained("LocalDoc/mmBERT-
|
| 62 |
-
model = AutoModel.from_pretrained("LocalDoc/mmBERT-
|
| 63 |
|
| 64 |
inputs = tokenizer("Salam, bu gün necəsiniz?", return_tensors="pt")
|
| 65 |
outputs = model(**inputs)
|
|
|
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
- az
|
| 7 |
+
base_model: jhu-clsp/mmBERT-small
|
| 8 |
tags:
|
| 9 |
- modernbert
|
| 10 |
- multilingual
|
|
|
|
| 17 |
pipeline_tag: feature-extraction
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# mmBERT-small-en-az
|
| 21 |
|
| 22 |
+
A vocabulary-truncated version of [jhu-clsp/mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small), optimized for **English** and **Azerbaijani** by removing unused tokens from the 1800+ language vocabulary.
|
| 23 |
|
| 24 |
## What is this model?
|
| 25 |
|
|
|
|
| 32 |
| Metric | Original | Truncated |
|
| 33 |
|---|---|---|
|
| 34 |
| Vocabulary size | 256,000 | 71,751 |
|
| 35 |
+
| Total parameters | 140.493M | 69.42M |
|
| 36 |
| Embedding parameters | 196.6M | 55.1M |
|
| 37 |
+
| Model size (fp32) | 0.52 GB | 0.26 GB |
|
| 38 |
+
| Hidden size | 384 | 384 |
|
| 39 |
| Layers | 22 | 22 |
|
| 40 |
| Max sequence length | 8,192 | 8,192 |
|
| 41 |
|
|
|
|
| 47 |
|
| 48 |
| Sentence pair | Original | Truncated |
|
| 49 |
|---|---|---|
|
| 50 |
+
| "Bakı Azərbaycanın paytaxtıdır" ↔ "Baku is the capital of Azerbaijan" | 0.927396 | 0.927396 |
|
| 51 |
+
| "Süni intellekt texnologiyası sürətlə inkişaf edir" ↔ "Artificial intelligence technology is developing rapidly" | 0.926054 | 0.943118 |
|
| 52 |
+
| "Bu gün hava çox gözəldir" ↔ "The weather is very nice today" | 0.937846 | 0.937846 |
|
| 53 |
|
| 54 |
Tokenization output is identical for both languages.
|
| 55 |
|
|
|
|
| 58 |
```python
|
| 59 |
from transformers import AutoTokenizer, AutoModel
|
| 60 |
|
| 61 |
+
tokenizer = AutoTokenizer.from_pretrained("LocalDoc/mmBERT-small-en-az")
|
| 62 |
+
model = AutoModel.from_pretrained("LocalDoc/mmBERT-small-en-az")
|
| 63 |
|
| 64 |
inputs = tokenizer("Salam, bu gün necəsiniz?", return_tensors="pt")
|
| 65 |
outputs = model(**inputs)
|