netapy commited on
Commit
8e4ea66
·
verified ·
1 Parent(s): 1bb338b

Initial commit: Sentence-Transformers mean-pooling wrapper

Browse files
Files changed (1) hide show
  1. README.md +50 -114
README.md CHANGED
@@ -1,144 +1,80 @@
1
  ---
 
 
 
 
 
 
2
  tags:
3
  - sentence-transformers
4
- - sentence-similarity
 
 
5
  - feature-extraction
6
- - dense
7
- base_model: Mathlesage/euroBertV10
8
- pipeline_tag: sentence-similarity
9
- library_name: sentence-transformers
10
  ---
11
 
12
- # SentenceTransformer based on Mathlesage/euroBertV10
13
-
14
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Mathlesage/euroBertV10](https://huggingface.co/Mathlesage/euroBertV10). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
15
-
16
- ## Model Details
17
-
18
- ### Model Description
19
- - **Model Type:** Sentence Transformer
20
- - **Base model:** [Mathlesage/euroBertV10](https://huggingface.co/Mathlesage/euroBertV10) <!-- at revision 6056a08488ad1c7d39822e6306e086ce83b4a6f0 -->
21
- - **Maximum Sequence Length:** 8196 tokens
22
- - **Output Dimensionality:** 768 dimensions
23
- - **Similarity Function:** Cosine Similarity
24
- <!-- - **Training Dataset:** Unknown -->
25
- <!-- - **Language:** Unknown -->
26
- <!-- - **License:** Unknown -->
27
 
28
- ### Model Sources
29
 
30
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
31
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
32
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
33
 
34
- ### Full Model Architecture
35
 
36
- ```
37
- SentenceTransformer(
38
- (0): Transformer({'max_seq_length': 8196, 'do_lower_case': False, 'architecture': 'EuroBertModel'})
39
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
40
- )
41
- ```
42
 
43
- ## Usage
44
 
45
- ### Direct Usage (Sentence Transformers)
46
 
47
- First install the Sentence Transformers library:
48
-
49
- ```bash
50
  pip install -U sentence-transformers
51
  ```
52
 
53
- Then you can load this model and run inference.
54
  ```python
55
  from sentence_transformers import SentenceTransformer
56
 
57
- # Download from the 🤗 Hub
58
- model = SentenceTransformer("sentence_transformers_model_id")
59
- # Run inference
60
- sentences = [
61
- 'The weather is lovely today.',
62
- "It's so sunny outside!",
63
- 'He drove to the stadium.',
64
- ]
65
- embeddings = model.encode(sentences)
66
- print(embeddings.shape)
67
- # [3, 768]
68
-
69
- # Get the similarity scores for the embeddings
70
- similarities = model.similarity(embeddings, embeddings)
71
- print(similarities)
72
- # tensor([[1.0000, 0.5651, 0.0976],
73
- # [0.5651, 1.0000, 0.2057],
74
- # [0.0976, 0.2057, 1.0000]])
75
  ```
76
 
77
- <!--
78
- ### Direct Usage (Transformers)
79
-
80
- <details><summary>Click to see the direct usage in Transformers</summary>
81
-
82
- </details>
83
- -->
84
-
85
- <!--
86
- ### Downstream Usage (Sentence Transformers)
87
 
88
- You can finetune this model on your own dataset.
89
-
90
- <details><summary>Click to expand</summary>
91
-
92
- </details>
93
- -->
94
-
95
- <!--
96
- ### Out-of-Scope Use
97
-
98
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
99
- -->
100
-
101
- <!--
102
- ## Bias, Risks and Limitations
103
-
104
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
105
- -->
106
-
107
- <!--
108
- ### Recommendations
109
-
110
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
111
- -->
112
-
113
- ## Training Details
114
-
115
- ### Framework Versions
116
- - Python: 3.9.6
117
- - Sentence Transformers: 5.1.0
118
- - Transformers: 4.55.2
119
- - PyTorch: 2.8.0
120
- - Accelerate:
121
- - Datasets: 2.21.0
122
- - Tokenizers: 0.21.4
123
 
124
- ## Citation
 
 
125
 
126
- ### BibTeX
 
127
 
128
- <!--
129
- ## Glossary
 
130
 
131
- *Clearly define terms in order to be accessible across audiences.*
132
- -->
 
133
 
134
- <!--
135
- ## Model Card Authors
136
 
137
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
138
- -->
 
 
 
139
 
140
- <!--
141
- ## Model Card Contact
142
 
143
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
144
- -->
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - fr
5
+ - en
6
+ library_name: sentence-transformers
7
+ pipeline_tag: sentence-similarity
8
  tags:
9
  - sentence-transformers
10
+ - embeddings
11
+ - eurobert
12
+ - multilingual
13
  - feature-extraction
14
+ base_model: EuroBERT/EuroBERT-210m
 
 
 
15
  ---
16
 
17
+ # OrdalieTech/Solon-embeddings-mini-beta-1.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
+ Le modèle d'origine a été créé à partir de `EuroBERT/EuroBERT-210m`, puis entraîné avec la technique **InfoNCE** sur des **paires de très haute qualité générées par LLM**
20
 
 
 
 
21
 
22
+ ## Points clés
23
 
24
+ - **Backbone** : `EuroBERT/EuroBERT-210m`
25
+ - **Pooling** : moyenne des tokens (CLS désactivé, max désactivé)
26
+ - **Dimensions** : 768
27
+ - **Langues** : multilingue dont le français et l'anglais
 
 
28
 
29
+ ## Exemples d'usage
30
 
31
+ ### Avec `sentence-transformers`
32
 
33
+ ```python
 
 
34
  pip install -U sentence-transformers
35
  ```
36
 
 
37
  ```python
38
  from sentence_transformers import SentenceTransformer
39
 
40
+ model = SentenceTransformer("OrdalieTech/Solon-embeddings-mini-beta-1.1")
41
+ sentences = ["Ceci est une phrase d'exemple", "Chaque phrase est convertie en vecteur"]
42
+ embeddings = model.encode(sentences, convert_to_tensor=False, normalize_embeddings=True)
43
+ print(embeddings[0].shape) # (768,)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  ```
45
 
46
+ ### Avec `transformers` (feature extraction)
 
 
 
 
 
 
 
 
 
47
 
48
+ ```python
49
+ pip install -U transformers torch
50
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ ```python
53
+ from transformers import AutoTokenizer, AutoModel
54
+ import torch
55
 
56
+ tok = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m", trust_remote_code=True)
57
+ enc = AutoModel.from_pretrained("EuroBERT/EuroBERT-210m", trust_remote_code=True)
58
 
59
+ inputs = tok(["Ceci est une phrase d'exemple"], padding=True, truncation=True, return_tensors="pt")
60
+ with torch.no_grad():
61
+ out = enc(**inputs).last_hidden_state # (batch, seq, 768)
62
 
63
+ mask = inputs["attention_mask"].unsqueeze(-1) # (batch, seq, 1)
64
+ mean_emb = (out * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
65
+ ```
66
 
67
+ ## Cas d'usage
 
68
 
69
+ - Recherche sémantique
70
+ - Reranking
71
+ - Similarité sémantique de phrases (STS)
72
+ - Recommandation de contenu
73
+ - Classification basée sur des embeddings
74
 
75
+ ## Crédit et licence
 
76
 
77
+ - Modèle de base : [`EuroBERT/EuroBERT-210m`](https://huggingface.co/EuroBERT/EuroBERT-210m) licence Apache-2.0
78
+ - Cette publication reprend la licence Apache-2.0 et respecte les conditions de redistribution du modèle de base
79
+ - Merci aux auteurs d'EuroBERT pour leur travail et l'ouverture du modèle
80
+ - Création : @matheoqtb