vrashad commited on
Commit
3d9f5be
·
verified ·
1 Parent(s): 57098a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -11
README.md CHANGED
@@ -4,7 +4,7 @@ license: mit
4
  language:
5
  - en
6
  - az
7
- base_model: jhu-clsp/mmBERT-base
8
  tags:
9
  - modernbert
10
  - multilingual
@@ -17,9 +17,9 @@ tags:
17
  pipeline_tag: feature-extraction
18
  ---
19
 
20
- # mmBERT-base-en-az
21
 
22
- A vocabulary-truncated version of [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base), optimized for **English** and **Azerbaijani** by removing unused tokens from the 1800+ language vocabulary.
23
 
24
  ## What is this model?
25
 
@@ -32,10 +32,10 @@ This model keeps only the ~72K tokens that actually appear in English and Azerba
32
  | Metric | Original | Truncated |
33
  |---|---|---|
34
  | Vocabulary size | 256,000 | 71,751 |
35
- | Total parameters | 306.9M | 165.4M |
36
  | Embedding parameters | 196.6M | 55.1M |
37
- | Model size (fp32) | 1.14 GB | 0.62 GB |
38
- | Hidden size | 768 | 768 |
39
  | Layers | 22 | 22 |
40
  | Max sequence length | 8,192 | 8,192 |
41
 
@@ -47,9 +47,9 @@ Cosine similarity between Azerbaijani–English sentence pairs is identical or n
47
 
48
  | Sentence pair | Original | Truncated |
49
  |---|---|---|
50
- | "Bakı Azərbaycanın paytaxtıdır" ↔ "Baku is the capital of Azerbaijan" | 0.7718 | 0.7718 |
51
- | "Süni intellekt texnologiyası sürətlə inkişaf edir" ↔ "Artificial intelligence technology is developing rapidly" | 0.7626 | 0.7792 |
52
- | "Bu gün hava çox gözəldir" ↔ "The weather is very nice today" | 0.8285 | 0.8285 |
53
 
54
  Tokenization output is identical for both languages.
55
 
@@ -58,8 +58,8 @@ Tokenization output is identical for both languages.
58
  ```python
59
  from transformers import AutoTokenizer, AutoModel
60
 
61
- tokenizer = AutoTokenizer.from_pretrained("LocalDoc/mmBERT-base-en-az")
62
- model = AutoModel.from_pretrained("LocalDoc/mmBERT-base-en-az")
63
 
64
  inputs = tokenizer("Salam, bu gün necəsiniz?", return_tensors="pt")
65
  outputs = model(**inputs)
 
4
  language:
5
  - en
6
  - az
7
+ base_model: jhu-clsp/mmBERT-small
8
  tags:
9
  - modernbert
10
  - multilingual
 
17
  pipeline_tag: feature-extraction
18
  ---
19
 
20
+ # mmBERT-small-en-az
21
 
22
+ A vocabulary-truncated version of [jhu-clsp/mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small), optimized for **English** and **Azerbaijani** by removing unused tokens from the 1800+ language vocabulary.
23
 
24
  ## What is this model?
25
 
 
32
  | Metric | Original | Truncated |
33
  |---|---|---|
34
  | Vocabulary size | 256,000 | 71,751 |
35
+ | Total parameters | 140.493M | 69.42M |
36
  | Embedding parameters | 196.6M | 55.1M |
37
+ | Model size (fp32) | 0.52 GB | 0.26 GB |
38
+ | Hidden size | 384 | 384 |
39
  | Layers | 22 | 22 |
40
  | Max sequence length | 8,192 | 8,192 |
41
 
 
47
 
48
  | Sentence pair | Original | Truncated |
49
  |---|---|---|
50
+ | "Bakı Azərbaycanın paytaxtıdır" ↔ "Baku is the capital of Azerbaijan" | 0.927396 | 0.927396 |
51
+ | "Süni intellekt texnologiyası sürətlə inkişaf edir" ↔ "Artificial intelligence technology is developing rapidly" | 0.926054 | 0.943118 |
52
+ | "Bu gün hava çox gözəldir" ↔ "The weather is very nice today" | 0.937846 | 0.937846 |
53
 
54
  Tokenization output is identical for both languages.
55
 
 
58
  ```python
59
  from transformers import AutoTokenizer, AutoModel
60
 
61
+ tokenizer = AutoTokenizer.from_pretrained("LocalDoc/mmBERT-small-en-az")
62
+ model = AutoModel.from_pretrained("LocalDoc/mmBERT-small-en-az")
63
 
64
  inputs = tokenizer("Salam, bu gün necəsiniz?", return_tensors="pt")
65
  outputs = model(**inputs)