lbourdois commited on
Commit
8cf58be
·
verified ·
1 Parent(s): 2e8ad0f

Update model card for Samoan

Browse files
Files changed (1) hide show
  1. README.md +70 -48
README.md CHANGED
@@ -1,48 +1,70 @@
1
- ---
2
- pipeline_tag: sentence-similarity
3
- language: smo
4
- license: mit
5
- tags:
6
- - trimmed
7
- library_name: sentence-transformers
8
- base_model: intfloat/multilingual-e5-large
9
- base_model_relation: quantized
10
- datasets:
11
- - Lumberjackk/fineweb-2-trimming
12
- ---
13
-
14
- # multilingual-e5-large-smo-32768
15
-
16
- This model is a 39.7% smaller version of [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)
17
- optimized for 32768 language via vocabulary pruning.
18
-
19
- **Total vocabulary size**: 32768 tokens (reduced from 250002)
20
- **Tokenizer type**: Unigram
21
- **Training samples per language**: 200000 texts
22
- **Dataset**: [Lumberjackk/fineweb-2-trimming](https://huggingface.co/datasets/Lumberjackk/fineweb-2-trimming)
23
-
24
- ## Language Distribution
25
-
26
- - **smo**: 32768 tokens
27
-
28
- This pruned model should perform similarly to the original model for 32768 with a much smaller
29
- memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected
30
- languages were removed from the vocabulary.
31
-
32
- ## Usage
33
-
34
- You can use this model with the Transformers library:
35
- ```python
36
- from transformers import AutoModel, AutoTokenizer
37
-
38
- model_name = "Lumberjackk/multilingual-e5-large-smo-32768"
39
- model = AutoModel.from_pretrained(model_name)
40
- tokenizer = AutoTokenizer.from_pretrained(model_name)
41
- ```
42
-
43
- ## Model Statistics
44
-
45
- - **Original model size**: 559.9M parameters
46
- - **Pruned model size**: 337.4M parameters
47
- - **Size reduction**: 39.7%
48
- - **Vocabulary reduction**: 86.9%
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ language: smo
4
+ license: mit
5
+ tags:
6
+ - trimmed
7
+ library_name: sentence-transformers
8
+ base_model: intfloat/multilingual-e5-large
9
+ base_model_relation: quantized
10
+ datasets:
11
+ - lbourdois/fineweb-2-trimming
12
+ ---
13
+
14
+ # multilingual-e5-large-smo-32768
15
+ This model is a **39.73% smaller** version of [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) optimized for **Samoan** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method.
16
+ This trimmed model should perform similarly to the original model with only 32,768 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
17
+
18
+ ## Model Statistics
19
+ | Metric | Original | Trimmed | Reduction |
20
+ |--------|----------|---------|-----------|
21
+ | **Vocabulary size** | 250,037 tokens | 32,768 tokens | **86.89%** |
22
+ | **Model size** | 559,890,432 params | 337,442,816 params | **39.73%** |
23
+
24
+ ![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/me5-large-32768.png)
25
+
26
+ ## Mining Dataset Statistics
27
+ - **Number of texts used for mining**: 106,185 texts
28
+ - **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming)
29
+
30
+ ## Usage
31
+ ```python
32
+ from sentence_transformers import SentenceTransformer
33
+ # Download from the 🤗 Hub
34
+ model = SentenceTransformer("alphaedge-ai/multilingual-e5-large-smo-32768")
35
+ # Run inference with queries and documents
36
+ query = "My query in Samoan"
37
+ documents = [
38
+ "Chunk in Samoan",
39
+ "Chunk in Samoan",
40
+ "Chunk in Samoan",
41
+ ]
42
+ query_embeddings = model.encode_query(query)
43
+ document_embeddings = model.encode_document(documents)
44
+ print(query_embeddings.shape, document_embeddings.shape)
45
+ # Compute similarities to determine a ranking
46
+ similarities = model.similarity(query_embeddings, document_embeddings)
47
+ print(similarities)
48
+ ```
49
+
50
+ ## Citations
51
+
52
+ #### Multilingual E5
53
+ ```
54
+ @article{wang2024multilingual,
55
+ title={Multilingual E5 Text Embeddings: A Technical Report},
56
+ author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
57
+ journal={arXiv preprint arXiv:2402.05672},
58
+ year={2024}
59
+ }
60
+ ```
61
+
62
+ #### Trimming blog post
63
+ ```
64
+ @misc{hf_blogpost_trimming,
65
+ title={Introduction to Trimming},
66
+ author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
67
+ year={2026},
68
+ url={https://huggingface.co/blog/lbourdois/introduction-to-trimming},
69
+ }
70
+ ```