[README] Add usage, evaluation metrics, and expand.
Browse files
README.md
CHANGED
|
@@ -68,7 +68,9 @@ model-index:
|
|
| 68 |
|
| 69 |
# SentenceTransformer based on NbAiLab/nb-bert-base
|
| 70 |
|
| 71 |
-
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [NbAiLab/nb-bert-base](https://huggingface.co/NbAiLab/nb-bert-base).
|
|
|
|
|
|
|
| 72 |
|
| 73 |
## Model Details
|
| 74 |
|
|
@@ -81,9 +83,10 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [N
|
|
| 81 |
- **Training Dataset:** Subset of [NbAiLab/mnli-norwegian](https://huggingface.co/datasets/NbAiLab/mnli-norwegian)
|
| 82 |
- **Language:** Norwegian and English
|
| 83 |
- **License:** Apache 2.0
|
|
|
|
| 84 |
### EU AI Act
|
| 85 |
|
| 86 |
-
This release is a non-generative encoder model whose outputs are vectors/scores rather than language or media. Its intended functionality is limited to representation, retrieval, ranking, or classification support. On that basis, the release is preliminarily assessed as not falling within the provider obligations for GPAI models under the EU AI Act definitions, subject to legal confirmation if capability scope or marketed generality changes.
|
| 87 |
|
| 88 |
### Model Sources
|
| 89 |
|
|
@@ -118,29 +121,67 @@ from sentence_transformers import SentenceTransformer
|
|
| 118 |
model = SentenceTransformer("NbAiLab/nb-sbert-v2-base")
|
| 119 |
# Run inference
|
| 120 |
sentences = [
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
embeddings = model.encode(sentences)
|
| 126 |
print(embeddings.shape)
|
| 127 |
-
#
|
| 128 |
|
| 129 |
# Get the similarity scores for the embeddings
|
| 130 |
similarities = model.similarity(embeddings, embeddings)
|
| 131 |
print(similarities)
|
| 132 |
-
# tensor([[
|
| 133 |
-
# [
|
| 134 |
-
# [-0.1004, -0.0914, 1.0000]])
|
| 135 |
```
|
| 136 |
|
| 137 |
-
<!--
|
| 138 |
### Direct Usage (Transformers)
|
| 139 |
|
|
|
|
|
|
|
| 140 |
<details><summary>Click to see the direct usage in Transformers</summary>
|
| 141 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
</details>
|
| 143 |
-
|
| 144 |
|
| 145 |
<!--
|
| 146 |
### Downstream Usage (Sentence Transformers)
|
|
@@ -167,21 +208,22 @@ You can finetune this model on your own dataset.
|
|
| 167 |
* Dataset: [STS Benchmark](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/stsbenchmark.tsv.gz)
|
| 168 |
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
|
| 169 |
|
| 170 |
-
| Metric |
|
| 171 |
-
|:--------------------|:-----------|
|
| 172 |
-
| pearson_cosine | 0.8478
|
| 173 |
-
|
|
| 174 |
|
| 175 |
#### [MTEB (Scandinavian)](https://embeddings-benchmark.github.io/mteb/)
|
| 176 |
|
| 177 |
-
| Metric |
|
| 178 |
-
|:--------------------|:-----------|
|
| 179 |
-
| Mean (Task)
|
| 180 |
-
| Mean (TaskType)
|
| 181 |
-
|
|
| 182 |
-
|
|
| 183 |
-
|
|
| 184 |
-
|
|
|
|
|
| 185 |
|
| 186 |
<!--
|
| 187 |
## Bias, Risks and Limitations
|
|
@@ -214,6 +256,8 @@ You can finetune this model on your own dataset.
|
|
| 214 |
| <code>Det som følger er mindre en glid nedover en glatt skråning enn et profesjonelt skred som resulterer i enten en oppsigelse eller en smal flukt til neste drømmejobb, der, selvfølgelig, syklusen gjentas igjen.</code> | <code>Syklusen gjentar seg ved neste jobb.</code> | <code>Syklusen gjentar seg sjelden ved neste jobb.</code> |
|
| 215 |
| <code>Syklusen gjentar seg ved neste jobb.</code> | <code>Det som følger er mindre en glid nedover en glatt skråning enn et profesjonelt skred som resulterer i enten en oppsigelse eller en smal flukt til neste drømmejobb, der, selvfølgelig, syklusen gjentas igjen.</code> | <code>Syklusen gjentar seg sjelden ved neste jobb.</code> |
|
| 216 |
| <code>The public areas are spectacular, the rooms a bit less so, but a long-awaited renovation was carried out in 1998.</code> | <code>The rooms are nice, but the public area is in a league of it's own.</code> | <code>The public area was fine, but the rooms were really something else.</code> |
|
|
|
|
|
|
|
| 217 |
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
|
| 218 |
```json
|
| 219 |
{
|
|
@@ -366,6 +410,8 @@ You can finetune this model on your own dataset.
|
|
| 366 |
</details>
|
| 367 |
|
| 368 |
### Training Logs
|
|
|
|
|
|
|
| 369 |
| Epoch | Step | Training Loss | sts-dev_spearman_cosine |
|
| 370 |
|:------:|:----:|:-------------:|:-----------------------:|
|
| 371 |
| 0.0243 | 100 | 1.8923 | - |
|
|
@@ -418,7 +464,7 @@ You can finetune this model on your own dataset.
|
|
| 418 |
| 0.9471 | 3900 | 0.3246 | - |
|
| 419 |
| 0.9713 | 4000 | 0.3215 | - |
|
| 420 |
| 0.9956 | 4100 | 0.3143 | - |
|
| 421 |
-
|
| 422 |
|
| 423 |
### Framework Versions
|
| 424 |
- Python: 3.14.3
|
|
@@ -464,11 +510,11 @@ You can finetune this model on your own dataset.
|
|
| 464 |
*Clearly define terms in order to be accessible across audiences.*
|
| 465 |
-->
|
| 466 |
|
| 467 |
-
<!--
|
| 468 |
-
## Model Card Authors
|
| 469 |
|
| 470 |
-
|
| 471 |
-
|
|
|
|
|
|
|
| 472 |
|
| 473 |
<!--
|
| 474 |
## Model Card Contact
|
|
|
|
| 68 |
|
| 69 |
# SentenceTransformer based on NbAiLab/nb-bert-base
|
| 70 |
|
| 71 |
+
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [NbAiLab/nb-bert-base](https://huggingface.co/NbAiLab/nb-bert-base).
|
| 72 |
+
|
| 73 |
+
The model maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. The easiest way is to simply measure the cosine distance between two sentences. Sentences that are close to each other in meaning, will have a small cosine distance and a similarity close to 1. The model is trained in such a way that similar sentences in different languages should also be close to each other. Ideally, an English-Norwegian sentence pair should have high similarity.
|
| 74 |
|
| 75 |
## Model Details
|
| 76 |
|
|
|
|
| 83 |
- **Training Dataset:** Subset of [NbAiLab/mnli-norwegian](https://huggingface.co/datasets/NbAiLab/mnli-norwegian)
|
| 84 |
- **Language:** Norwegian and English
|
| 85 |
- **License:** Apache 2.0
|
| 86 |
+
|
| 87 |
### EU AI Act
|
| 88 |
|
| 89 |
+
This release is a **non-generative encoder model** whose outputs are vectors/scores rather than language or media. Its intended functionality is limited to representation, retrieval, ranking, or classification support. On that basis, the release is preliminarily assessed as not falling within the provider obligations for GPAI models under the EU AI Act definitions, subject to legal confirmation if capability scope or marketed generality changes.
|
| 90 |
|
| 91 |
### Model Sources
|
| 92 |
|
|
|
|
| 121 |
model = SentenceTransformer("NbAiLab/nb-sbert-v2-base")
|
| 122 |
# Run inference
|
| 123 |
sentences = [
|
| 124 |
+
"This is a Norwegian boy",
|
| 125 |
+
"Dette er en norsk gutt"
|
| 126 |
+
]
|
| 127 |
+
|
| 128 |
embeddings = model.encode(sentences)
|
| 129 |
print(embeddings.shape)
|
| 130 |
+
# (2, 768)
|
| 131 |
|
| 132 |
# Get the similarity scores for the embeddings
|
| 133 |
similarities = model.similarity(embeddings, embeddings)
|
| 134 |
print(similarities)
|
| 135 |
+
# tensor([[1.0000, 0.8287],
|
| 136 |
+
# [0.8287, 1.0000]])
|
|
|
|
| 137 |
```
|
| 138 |
|
|
|
|
| 139 |
### Direct Usage (Transformers)
|
| 140 |
|
| 141 |
+
Without [sentence-transformers](https://www.SBERT.net), you can still use the model. First, you pass in your input through the transformer model, then you have to apply the right pooling-operation on top of the contextualized word embeddings.
|
| 142 |
+
|
| 143 |
<details><summary>Click to see the direct usage in Transformers</summary>
|
| 144 |
|
| 145 |
+
|
| 146 |
+
```python
|
| 147 |
+
import torch
|
| 148 |
+
|
| 149 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
| 150 |
+
from transformers import AutoTokenizer, AutoModel
|
| 151 |
+
|
| 152 |
+
#Mean Pooling - Take attention mask into account for correct averaging
|
| 153 |
+
def mean_pooling(model_output, attention_mask):
|
| 154 |
+
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
|
| 155 |
+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
| 156 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
# Sentences we want sentence embeddings for
|
| 160 |
+
sentences = ["This is a Norwegian boy", "Dette er en norsk gutt"]
|
| 161 |
+
|
| 162 |
+
# Load model from HuggingFace Hub
|
| 163 |
+
tokenizer = AutoTokenizer.from_pretrained('NbAiLab/nb-sbert-v2-base')
|
| 164 |
+
model = AutoModel.from_pretrained('NbAiLab/nb-sbert-v2-base')
|
| 165 |
+
|
| 166 |
+
# Tokenize sentences
|
| 167 |
+
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
| 168 |
+
|
| 169 |
+
# Compute token embeddings
|
| 170 |
+
with torch.no_grad():
|
| 171 |
+
model_output = model(**encoded_input)
|
| 172 |
+
|
| 173 |
+
# Perform pooling. In this case, mean pooling.
|
| 174 |
+
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
|
| 175 |
+
print(embeddings.shape)
|
| 176 |
+
# torch.Size([2, 768])
|
| 177 |
+
|
| 178 |
+
similarity = cosine_similarity(embeddings[0].reshape(1, -1), embeddings[1].reshape(1, -1))
|
| 179 |
+
print(similarity)
|
| 180 |
+
# This should give 0.8287 in the example above.
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
</details>
|
| 184 |
+
|
| 185 |
|
| 186 |
<!--
|
| 187 |
### Downstream Usage (Sentence Transformers)
|
|
|
|
| 208 |
* Dataset: [STS Benchmark](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/stsbenchmark.tsv.gz)
|
| 209 |
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
|
| 210 |
|
| 211 |
+
| Metric | nb-sbert-base | nb-sbert-v2-base |
|
| 212 |
+
|:--------------------|:--------------|:-----------------|
|
| 213 |
+
| pearson_cosine | 0.8275 | **0.8478** |
|
| 214 |
+
| spearman_cosine | 0.8245 | **0.8495** |
|
| 215 |
|
| 216 |
#### [MTEB (Scandinavian)](https://embeddings-benchmark.github.io/mteb/)
|
| 217 |
|
| 218 |
+
| Metric | nb-sbert-base | nb-sbert-v2-base |
|
| 219 |
+
|:--------------------|:-----------------|:-----------------|
|
| 220 |
+
| **Mean (Task)** | 0.5190 | **0.5496** |
|
| 221 |
+
| **Mean (TaskType)** | 0.5394 | **0.5690** |
|
| 222 |
+
|   |   |   |
|
| 223 |
+
| Bitext Mining | 0.7228 | **0.7275** |
|
| 224 |
+
| Classification | 0.5708 | **0.5841** |
|
| 225 |
+
| Clustering | 0.3798 | **0.4105** |
|
| 226 |
+
| Retrieval | 0.4840 | **0.5540** |
|
| 227 |
|
| 228 |
<!--
|
| 229 |
## Bias, Risks and Limitations
|
|
|
|
| 256 |
| <code>Det som følger er mindre en glid nedover en glatt skråning enn et profesjonelt skred som resulterer i enten en oppsigelse eller en smal flukt til neste drømmejobb, der, selvfølgelig, syklusen gjentas igjen.</code> | <code>Syklusen gjentar seg ved neste jobb.</code> | <code>Syklusen gjentar seg sjelden ved neste jobb.</code> |
|
| 257 |
| <code>Syklusen gjentar seg ved neste jobb.</code> | <code>Det som følger er mindre en glid nedover en glatt skråning enn et profesjonelt skred som resulterer i enten en oppsigelse eller en smal flukt til neste drømmejobb, der, selvfølgelig, syklusen gjentas igjen.</code> | <code>Syklusen gjentar seg sjelden ved neste jobb.</code> |
|
| 258 |
| <code>The public areas are spectacular, the rooms a bit less so, but a long-awaited renovation was carried out in 1998.</code> | <code>The rooms are nice, but the public area is in a league of it's own.</code> | <code>The public area was fine, but the rooms were really something else.</code> |
|
| 259 |
+
| <code>Ah, but he had no opportunity.</code> | <code>Han hadde ikke sjansen til å gjøre noe.</code> | <code>Han hadde mange muligheter.</code> |
|
| 260 |
+
|
| 261 |
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
|
| 262 |
```json
|
| 263 |
{
|
|
|
|
| 410 |
</details>
|
| 411 |
|
| 412 |
### Training Logs
|
| 413 |
+
|
| 414 |
+
<details><summary>Click to see expand</summary>
|
| 415 |
| Epoch | Step | Training Loss | sts-dev_spearman_cosine |
|
| 416 |
|:------:|:----:|:-------------:|:-----------------------:|
|
| 417 |
| 0.0243 | 100 | 1.8923 | - |
|
|
|
|
| 464 |
| 0.9471 | 3900 | 0.3246 | - |
|
| 465 |
| 0.9713 | 4000 | 0.3215 | - |
|
| 466 |
| 0.9956 | 4100 | 0.3143 | - |
|
| 467 |
+
</details>
|
| 468 |
|
| 469 |
### Framework Versions
|
| 470 |
- Python: 3.14.3
|
|
|
|
| 510 |
*Clearly define terms in order to be accessible across audiences.*
|
| 511 |
-->
|
| 512 |
|
|
|
|
|
|
|
| 513 |
|
| 514 |
+
## Citing & Authors
|
| 515 |
+
|
| 516 |
+
The model was trained by Victoria Handford and Lucas Georges Gabriel Charpentier. The documentation was initially autogenerated by the SentenceTransformers library then revised by Victoria Handford, Lucas Georges Gabriel Charpentier, and Javier de la Rosa.
|
| 517 |
+
|
| 518 |
|
| 519 |
<!--
|
| 520 |
## Model Card Contact
|