vlhandfo commited on
Commit
2e5f8c6
·
1 Parent(s): 009a431

[README] Add usage, evaluation metrics, and expand.

Browse files
Files changed (1) hide show
  1. README.md +75 -29
README.md CHANGED
@@ -68,7 +68,9 @@ model-index:
68
 
69
  # SentenceTransformer based on NbAiLab/nb-bert-base
70
 
71
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [NbAiLab/nb-bert-base](https://huggingface.co/NbAiLab/nb-bert-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. The easiest way is to simply measure the cosine distance between two sentences. Sentences that are close to each other in meaning, will have a small cosine distance and a similarity close to 1. The model is trained in such a way that similar sentences in different languages should also be close to each other. Ideally, an English-Norwegian sentence pair should have high similarity.
 
 
72
 
73
  ## Model Details
74
 
@@ -81,9 +83,10 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [N
81
  - **Training Dataset:** Subset of [NbAiLab/mnli-norwegian](https://huggingface.co/datasets/NbAiLab/mnli-norwegian)
82
  - **Language:** Norwegian and English
83
  - **License:** Apache 2.0
 
84
  ### EU AI Act
85
 
86
- This release is a non-generative encoder model whose outputs are vectors/scores rather than language or media. Its intended functionality is limited to representation, retrieval, ranking, or classification support. On that basis, the release is preliminarily assessed as not falling within the provider obligations for GPAI models under the EU AI Act definitions, subject to legal confirmation if capability scope or marketed generality changes.
87
 
88
  ### Model Sources
89
 
@@ -118,29 +121,67 @@ from sentence_transformers import SentenceTransformer
118
  model = SentenceTransformer("NbAiLab/nb-sbert-v2-base")
119
  # Run inference
120
  sentences = [
121
- 'While Queen may refer to both Queen regent (sovereign) or Queen consort, the King has always been the sovereign.',
122
- 'There is a very good reason not to refer to the Queen\'s spouse as "King" - because they aren\'t the King.',
123
- 'A man sitting on the floor in a room is strumming a guitar.',
124
- ]
125
  embeddings = model.encode(sentences)
126
  print(embeddings.shape)
127
- # [3, 768]
128
 
129
  # Get the similarity scores for the embeddings
130
  similarities = model.similarity(embeddings, embeddings)
131
  print(similarities)
132
- # tensor([[ 1.0000, 0.5028, -0.1004],
133
- # [ 0.5028, 1.0000, -0.0914],
134
- # [-0.1004, -0.0914, 1.0000]])
135
  ```
136
 
137
- <!--
138
  ### Direct Usage (Transformers)
139
 
 
 
140
  <details><summary>Click to see the direct usage in Transformers</summary>
141
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  </details>
143
- -->
144
 
145
  <!--
146
  ### Downstream Usage (Sentence Transformers)
@@ -167,21 +208,22 @@ You can finetune this model on your own dataset.
167
  * Dataset: [STS Benchmark](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/stsbenchmark.tsv.gz)
168
  * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
169
 
170
- | Metric | Value |
171
- |:--------------------|:-----------|
172
- | pearson_cosine | 0.8478 |
173
- | **spearman_cosine** | **0.8495** |
174
 
175
  #### [MTEB (Scandinavian)](https://embeddings-benchmark.github.io/mteb/)
176
 
177
- | Metric | Value |
178
- |:--------------------|:-----------|
179
- | Mean (Task) | |
180
- | Mean (TaskType) | |
181
- | Bitext Mining | |
182
- | Classification | |
183
- | Clustering | |
184
- | Retrieval | |
 
185
 
186
  <!--
187
  ## Bias, Risks and Limitations
@@ -214,6 +256,8 @@ You can finetune this model on your own dataset.
214
  | <code>Det som følger er mindre en glid nedover en glatt skråning enn et profesjonelt skred som resulterer i enten en oppsigelse eller en smal flukt til neste drømmejobb, der, selvfølgelig, syklusen gjentas igjen.</code> | <code>Syklusen gjentar seg ved neste jobb.</code> | <code>Syklusen gjentar seg sjelden ved neste jobb.</code> |
215
  | <code>Syklusen gjentar seg ved neste jobb.</code> | <code>Det som følger er mindre en glid nedover en glatt skråning enn et profesjonelt skred som resulterer i enten en oppsigelse eller en smal flukt til neste drømmejobb, der, selvfølgelig, syklusen gjentas igjen.</code> | <code>Syklusen gjentar seg sjelden ved neste jobb.</code> |
216
  | <code>The public areas are spectacular, the rooms a bit less so, but a long-awaited renovation was carried out in 1998.</code> | <code>The rooms are nice, but the public area is in a league of it's own.</code> | <code>The public area was fine, but the rooms were really something else.</code> |
 
 
217
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
218
  ```json
219
  {
@@ -366,6 +410,8 @@ You can finetune this model on your own dataset.
366
  </details>
367
 
368
  ### Training Logs
 
 
369
  | Epoch | Step | Training Loss | sts-dev_spearman_cosine |
370
  |:------:|:----:|:-------------:|:-----------------------:|
371
  | 0.0243 | 100 | 1.8923 | - |
@@ -418,7 +464,7 @@ You can finetune this model on your own dataset.
418
  | 0.9471 | 3900 | 0.3246 | - |
419
  | 0.9713 | 4000 | 0.3215 | - |
420
  | 0.9956 | 4100 | 0.3143 | - |
421
-
422
 
423
  ### Framework Versions
424
  - Python: 3.14.3
@@ -464,11 +510,11 @@ You can finetune this model on your own dataset.
464
  *Clearly define terms in order to be accessible across audiences.*
465
  -->
466
 
467
- <!--
468
- ## Model Card Authors
469
 
470
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
471
- -->
 
 
472
 
473
  <!--
474
  ## Model Card Contact
 
68
 
69
  # SentenceTransformer based on NbAiLab/nb-bert-base
70
 
71
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [NbAiLab/nb-bert-base](https://huggingface.co/NbAiLab/nb-bert-base).
72
+
73
+ The model maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. The easiest way is to simply measure the cosine distance between two sentences. Sentences that are close to each other in meaning, will have a small cosine distance and a similarity close to 1. The model is trained in such a way that similar sentences in different languages should also be close to each other. Ideally, an English-Norwegian sentence pair should have high similarity.
74
 
75
  ## Model Details
76
 
 
83
  - **Training Dataset:** Subset of [NbAiLab/mnli-norwegian](https://huggingface.co/datasets/NbAiLab/mnli-norwegian)
84
  - **Language:** Norwegian and English
85
  - **License:** Apache 2.0
86
+
87
  ### EU AI Act
88
 
89
+ This release is a **non-generative encoder model** whose outputs are vectors/scores rather than language or media. Its intended functionality is limited to representation, retrieval, ranking, or classification support. On that basis, the release is preliminarily assessed as not falling within the provider obligations for GPAI models under the EU AI Act definitions, subject to legal confirmation if capability scope or marketed generality changes.
90
 
91
  ### Model Sources
92
 
 
121
  model = SentenceTransformer("NbAiLab/nb-sbert-v2-base")
122
  # Run inference
123
  sentences = [
124
+ "This is a Norwegian boy",
125
+ "Dette er en norsk gutt"
126
+ ]
127
+
128
  embeddings = model.encode(sentences)
129
  print(embeddings.shape)
130
+ # (2, 768)
131
 
132
  # Get the similarity scores for the embeddings
133
  similarities = model.similarity(embeddings, embeddings)
134
  print(similarities)
135
+ # tensor([[1.0000, 0.8287],
136
+ # [0.8287, 1.0000]])
 
137
  ```
138
 
 
139
  ### Direct Usage (Transformers)
140
 
141
+ Without [sentence-transformers](https://www.SBERT.net), you can still use the model. First, you pass in your input through the transformer model, then you have to apply the right pooling-operation on top of the contextualized word embeddings.
142
+
143
  <details><summary>Click to see the direct usage in Transformers</summary>
144
 
145
+
146
+ ```python
147
+ import torch
148
+
149
+ from sklearn.metrics.pairwise import cosine_similarity
150
+ from transformers import AutoTokenizer, AutoModel
151
+
152
+ #Mean Pooling - Take attention mask into account for correct averaging
153
+ def mean_pooling(model_output, attention_mask):
154
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
155
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
156
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
157
+
158
+
159
+ # Sentences we want sentence embeddings for
160
+ sentences = ["This is a Norwegian boy", "Dette er en norsk gutt"]
161
+
162
+ # Load model from HuggingFace Hub
163
+ tokenizer = AutoTokenizer.from_pretrained('NbAiLab/nb-sbert-v2-base')
164
+ model = AutoModel.from_pretrained('NbAiLab/nb-sbert-v2-base')
165
+
166
+ # Tokenize sentences
167
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
168
+
169
+ # Compute token embeddings
170
+ with torch.no_grad():
171
+ model_output = model(**encoded_input)
172
+
173
+ # Perform pooling. In this case, mean pooling.
174
+ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
175
+ print(embeddings.shape)
176
+ # torch.Size([2, 768])
177
+
178
+ similarity = cosine_similarity(embeddings[0].reshape(1, -1), embeddings[1].reshape(1, -1))
179
+ print(similarity)
180
+ # This should give 0.8287 in the example above.
181
+ ```
182
+
183
  </details>
184
+
185
 
186
  <!--
187
  ### Downstream Usage (Sentence Transformers)
 
208
  * Dataset: [STS Benchmark](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/stsbenchmark.tsv.gz)
209
  * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
210
 
211
+ | Metric | nb-sbert-base | nb-sbert-v2-base |
212
+ |:--------------------|:--------------|:-----------------|
213
+ | pearson_cosine | 0.8275 | **0.8478** |
214
+ | spearman_cosine | 0.8245 | **0.8495** |
215
 
216
  #### [MTEB (Scandinavian)](https://embeddings-benchmark.github.io/mteb/)
217
 
218
+ | Metric | nb-sbert-base | nb-sbert-v2-base |
219
+ |:--------------------|:-----------------|:-----------------|
220
+ | **Mean (Task)** | 0.5190 | **0.5496** |
221
+ | **Mean (TaskType)** | 0.5394 | **0.5690** |
222
+ | &emsp; | &emsp; | &emsp; |
223
+ | Bitext Mining | 0.7228 | **0.7275** |
224
+ | Classification | 0.5708 | **0.5841** |
225
+ | Clustering | 0.3798 | **0.4105** |
226
+ | Retrieval | 0.4840 | **0.5540** |
227
 
228
  <!--
229
  ## Bias, Risks and Limitations
 
256
  | <code>Det som følger er mindre en glid nedover en glatt skråning enn et profesjonelt skred som resulterer i enten en oppsigelse eller en smal flukt til neste drømmejobb, der, selvfølgelig, syklusen gjentas igjen.</code> | <code>Syklusen gjentar seg ved neste jobb.</code> | <code>Syklusen gjentar seg sjelden ved neste jobb.</code> |
257
  | <code>Syklusen gjentar seg ved neste jobb.</code> | <code>Det som følger er mindre en glid nedover en glatt skråning enn et profesjonelt skred som resulterer i enten en oppsigelse eller en smal flukt til neste drømmejobb, der, selvfølgelig, syklusen gjentas igjen.</code> | <code>Syklusen gjentar seg sjelden ved neste jobb.</code> |
258
  | <code>The public areas are spectacular, the rooms a bit less so, but a long-awaited renovation was carried out in 1998.</code> | <code>The rooms are nice, but the public area is in a league of it's own.</code> | <code>The public area was fine, but the rooms were really something else.</code> |
259
+ | <code>Ah, but he had no opportunity.</code> | <code>Han hadde ikke sjansen til å gjøre noe.</code> | <code>Han hadde mange muligheter.</code> |
260
+
261
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
262
  ```json
263
  {
 
410
  </details>
411
 
412
  ### Training Logs
413
+
414
+ <details><summary>Click to see expand</summary>
415
  | Epoch | Step | Training Loss | sts-dev_spearman_cosine |
416
  |:------:|:----:|:-------------:|:-----------------------:|
417
  | 0.0243 | 100 | 1.8923 | - |
 
464
  | 0.9471 | 3900 | 0.3246 | - |
465
  | 0.9713 | 4000 | 0.3215 | - |
466
  | 0.9956 | 4100 | 0.3143 | - |
467
+ </details>
468
 
469
  ### Framework Versions
470
  - Python: 3.14.3
 
510
  *Clearly define terms in order to be accessible across audiences.*
511
  -->
512
 
 
 
513
 
514
+ ## Citing & Authors
515
+
516
+ The model was trained by Victoria Handford and Lucas Georges Gabriel Charpentier. The documentation was initially autogenerated by the SentenceTransformers library then revised by Victoria Handford, Lucas Georges Gabriel Charpentier, and Javier de la Rosa.
517
+
518
 
519
  <!--
520
  ## Model Card Contact