Murhaf commited on
Commit
08031f8
·
verified ·
1 Parent(s): fc7bf8d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -129
README.md CHANGED
@@ -1,13 +1,13 @@
1
  ---
2
  language:
3
  - 'no'
 
 
4
  tags:
5
  - sentence-transformers
6
  - sentence-similarity
7
  - feature-extraction
8
  - dense
9
- - generated_from_trainer
10
- - dataset_size:556367
11
  - loss:CachedMultipleNegativesRankingLoss
12
  base_model:
13
  - ltg/norbert4-base
@@ -49,7 +49,7 @@ library_name: sentence-transformers
49
  metrics:
50
  - cosine_accuracy
51
  model-index:
52
- - name: SentenceTransformer based on Murhaf/ltg-norbert4-base_ndla
53
  results:
54
  - task:
55
  type: triplet
@@ -64,32 +64,15 @@ model-index:
64
  license: apache-2.0
65
  ---
66
 
67
- # SentenceTransformer based on Murhaf/ltg-norbert4-base_ndla
68
 
69
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Murhaf/ltg-norbert4-base_ndla](https://huggingface.co/Murhaf/ltg-norbert4-base_ndla) on the [all-nli-norwegian](https://huggingface.co/datasets/Murhaf/all-nli-norwegian) dataset. It maps sentences & paragraphs to a 640-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 
 
70
 
71
- ## Model Details
72
 
73
- ### Model Description
74
- - **Model Type:** Sentence Transformer
75
- - **Base model:** [ltg/norbert4-base](https://huggingface.co/ltg/norbert4-base) <!-- at revision 762fb095e1c571e52d8690bf07ec8b65d3551026 -->
76
- - **Maximum Sequence Length:** 75 tokens
77
- - **Output Dimensionality:** 640 dimensions
78
- - **Similarity Function:** Cosine Similarity
79
- - **Training Dataset:**
80
- - [all-nli-norwegian](https://huggingface.co/datasets/Fremtind/all-nli-norwegian)
81
- - **Language:** no
82
- <!-- - **License:** Unknown -->
83
-
84
-
85
- ### Full Model Architecture
86
-
87
- ```
88
- SentenceTransformer(
89
- (0): Transformer({'max_seq_length': 75, 'do_lower_case': False, 'architecture': 'GptBertModel'})
90
- (1): Pooling({'word_embedding_dimension': 640, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
91
- )
92
- ```
93
 
94
  ## Usage
95
 
@@ -101,136 +84,62 @@ First install the Sentence Transformers library:
101
  pip install -U sentence-transformers
102
  ```
103
 
104
- Then you can load this model and run inference.
105
  ```python
106
  from sentence_transformers import SentenceTransformer
107
 
108
  # Download from the 🤗 Hub
109
- model = SentenceTransformer("Murhaf/ltg-norbert4-base_ndla-all-nli")
110
  # Run inference
111
  sentences = [
112
- 'En mann lager et sandmaleri gulvet.',
113
- 'En mann lager kunst.',
114
- 'En kvinne ødelegger et sandmaleri.',
115
  ]
116
  embeddings = model.encode(sentences)
117
  print(embeddings.shape)
118
- # [3, 640]
119
 
120
  # Get the similarity scores for the embeddings
121
  similarities = model.similarity(embeddings, embeddings)
122
  print(similarities)
123
- # tensor([[1.0000, 0.5608, 0.3858],
124
- # [0.5608, 1.0000, 0.2424],
125
- # [0.3858, 0.2424, 1.0000]])
126
  ```
127
 
128
- <!--
129
- ### Direct Usage (Transformers)
130
-
131
- <details><summary>Click to see the direct usage in Transformers</summary>
132
-
133
- </details>
134
- -->
135
-
136
- <!--
137
- ### Downstream Usage (Sentence Transformers)
138
-
139
- You can finetune this model on your own dataset.
140
-
141
- <details><summary>Click to expand</summary>
142
-
143
- </details>
144
- -->
145
-
146
- <!--
147
- ### Out-of-Scope Use
148
-
149
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
150
- -->
151
 
152
  ## Evaluation
153
 
154
- ### Metrics
155
 
156
- #### Triplet
 
 
157
 
158
- * Dataset: `nob_all_nli_test`
159
- * Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
160
 
161
- | Metric | Value |
162
- |:--------------------|:----------|
163
- | **cosine_accuracy** | **0.947** |
164
 
165
- <!--
166
- ## Bias, Risks and Limitations
167
 
168
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
169
- -->
 
 
 
 
 
 
 
170
 
171
- <!--
172
- ### Recommendations
173
 
174
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
175
- -->
176
 
177
- ## Training Details
 
178
 
179
- ### Training Dataset
180
-
181
- #### all-nli-norwegian
182
-
183
- * Dataset: [all-nli-norwegian](https://huggingface.co/datasets/Murhaf/all-nli-norwegian) at [98cabde](https://huggingface.co/datasets/Murhaf/all-nli-norwegian/tree/98cabded09bfe5f505757840026ecdf6a357a04c)
184
- * Size: 556,367 training samples
185
- * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
186
- * Approximate statistics based on the first 1000 samples:
187
- | | anchor | positive | negative |
188
- |:--------|:---------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
189
- | type | string | string | string |
190
- | details | <ul><li>min: 6 tokens</li><li>mean: 9.53 tokens</li><li>max: 47 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 12.03 tokens</li><li>max: 40 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 12.7 tokens</li><li>max: 49 tokens</li></ul> |
191
- * Samples:
192
- | anchor | positive | negative |
193
- |:---------------------------------------------------------------|:------------------------------------------------|:---------------------------------------------------------------|
194
- | <code>En person på en hest hopper over et havarert fly.</code> | <code>En person er utendørs, på en hest.</code> | <code>En person er på en diner og bestiller en omelett.</code> |
195
- | <code>Barn smiler og vinker til kameraet</code> | <code>Det er barn til stede</code> | <code>Barna rynker pannen</code> |
196
- | <code>En gutt hopper på skateboard midt på en rød bro.</code> | <code>Gutten gjør et skateboardtriks.</code> | <code>Gutten skater nedover fortauet.</code> |
197
- * Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
198
- ```json
199
- {
200
- "scale": 20.0,
201
- "similarity_fct": "cos_sim",
202
- "mini_batch_size": 32,
203
- "gather_across_devices": false
204
- }
205
- ```
206
-
207
- ### Evaluation Dataset
208
-
209
- #### all-nli-norwegian
210
-
211
- * Dataset: [all-nli-norwegian](https://huggingface.co/datasets/Murhaf/all-nli-norwegian) at [98cabde](https://huggingface.co/datasets/Murhaf/all-nli-norwegian/tree/98cabded09bfe5f505757840026ecdf6a357a04c)
212
- * Size: 6,561 evaluation samples
213
- * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
214
- * Approximate statistics based on the first 1000 samples:
215
- | | anchor | positive | negative |
216
- |:--------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:--------------------------------------------------------------------------------|
217
- | type | string | string | string |
218
- | details | <ul><li>min: 5 tokens</li><li>mean: 17.72 tokens</li><li>max: 74 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 8.98 tokens</li><li>max: 31 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 9.5 tokens</li><li>max: 29 tokens</li></ul> |
219
- * Samples:
220
- | anchor | positive | negative |
221
- |:--------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------|:------------------------------------------------------------|
222
- | <code>To kvinner klemmer mens de holder take-away pakker.</code> | <code>To kvinner holder pakker.</code> | <code>Mennene slåss utenfor en deli.</code> |
223
- | <code>To små barn i blå drakter, en med nummer 9 og en med nummer 2, står på trinn i et bad og vasker hendene i en vask.</code> | <code>To barn i nummererte drakter vasker hendene.</code> | <code>To barn i jakker går til skolen.</code> |
224
- | <code>En mann selger donuts til en kunde under et verdensutstillingsarrangement holdt i byen Angeles</code> | <code>En mann selger donuts til en kunde.</code> | <code>En kvinne drikker kaffen sin på en liten kafé.</code> |
225
- * Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
226
- ```json
227
- {
228
- "scale": 20.0,
229
- "similarity_fct": "cos_sim",
230
- "mini_batch_size": 32,
231
- "gather_across_devices": false
232
- }
233
- ```
234
 
235
  ### Training Hyperparameters
236
  #### Non-Default Hyperparameters
 
1
  ---
2
  language:
3
  - 'no'
4
+ - nn
5
+ - nb
6
  tags:
7
  - sentence-transformers
8
  - sentence-similarity
9
  - feature-extraction
10
  - dense
 
 
11
  - loss:CachedMultipleNegativesRankingLoss
12
  base_model:
13
  - ltg/norbert4-base
 
49
  metrics:
50
  - cosine_accuracy
51
  model-index:
52
+ - name: SentenceTransformer based on NorBERT4-base
53
  results:
54
  - task:
55
  type: triplet
 
64
  license: apache-2.0
65
  ---
66
 
67
+ # SentenceTransformer based on NorBERT4-base
68
 
69
+ NorSBERT4-base is a [Sentence Transformer](https://www.SBERT.net) model finetuned from [ltg/norbert4-base](https://huggingface.co/ltg/norbert4-base).
70
+ The model maps sentences (and paragraphs) to a 960-dimensional dense vector space and can be used for semantic textual similarity, semantic search,
71
+ text classification, clustering, among other tasks.
72
 
 
73
 
74
+ Note: While the fine-tuned sentence-transformer model has a `max_seq_length` of 75 tokens, the base model does not.
75
+ This means that the sequence length can be increased to 16384 (which is the max length in the base model).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ## Usage
78
 
 
84
  pip install -U sentence-transformers
85
  ```
86
 
87
+ Then you can load this model and run inference. Note that you should load the model with `trust_remote_code=True` because it needs a custom wrapper (see the [base model](https://huggingface.co/ltg/norbert4-base/blob/main/modeling_gptbert.py) for more details).
88
  ```python
89
  from sentence_transformers import SentenceTransformer
90
 
91
  # Download from the 🤗 Hub
92
+ model = SentenceTransformer("Fremtind/norsbert4-base", trust_remote_code=True)
93
  # Run inference
94
  sentences = [
95
+ 'To personer, en i lyse jeans og en stripete skjorte, spiller biljard.',
96
+ 'Folk spiller biljard',
97
+ 'folk løper',
98
  ]
99
  embeddings = model.encode(sentences)
100
  print(embeddings.shape)
101
+ # [3, 960]
102
 
103
  # Get the similarity scores for the embeddings
104
  similarities = model.similarity(embeddings, embeddings)
105
  print(similarities)
106
+ # tensor([[1.0000, 0.7294, 0.1690],
107
+ # [0.7294, 1.0000, 0.2412],
108
+ # [0.1690, 0.2412, 1.0000]])
109
  ```
110
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
112
  ## Evaluation
113
 
114
+ To verify the utility of our models, we evaluated them on a selection of classification and clustering tasks for Norwegian from [MTEBv2](https://embeddings-benchmark.github.io/mteb/).
115
 
116
+ The heatmap below shows the results of evaluating five sentence-transformers on ten different tasks;
117
+ three of the sentence-transformer models we have fine-tuned ([Fremtind/norsbert4-large](https://huggingface.co/Fremtind/norsbert4-large), [Fremtind/norsbert4-base](https://huggingface.co/Fremtind/norsbert4-base), [Fremtind/mmBERT-base-norwegian](https://huggingface.co/Fremtind/mmBERT-base-norwegian))
118
+ and the other two are relatively popular (and comparable) sentence similarity models ([FFI/SimCSE-NB-BERT-large](https://huggingface.co/FFI/SimCSE-NB-BERT-large) and [NbAiLab/nb-sbert-base](https://huggingface.co/NbAiLab/nb-sbert-base)).
119
 
 
 
120
 
121
+ ![newplot](https://cdn-uploads.huggingface.co/production/uploads/6179579cf08e328ce6c12c26/5N4zMC8AeETTRRig_UWsh.png)
 
 
122
 
123
+ We ranked the models using **Borda count** (which is used in MTEB), where each model was assigned a number of points based on its relative performance across all evaluated tasks.
 
124
 
125
+ | Rank | Model | Borda Points |
126
+ |:----:|:--------------------------------|:-------------:|
127
+ | 1 | **[Fremtind/norsbert4-large](https://huggingface.co/Fremtind/norsbert4-large)** | **44** |
128
+ | 2 | [FFI/SimCSE-NB-BERT-large](https://huggingface.co/FFI/SimCSE-NB-BERT-large) | 40 |
129
+ | 3 | [Fremtind/norsbert4-base](https://huggingface.co/Murhaf/norsbert4-base) | 24 |
130
+ | 4 | [NbAiLab/nb-sbert-base](https://huggingface.co/NbAiLab/nb-sbert-base) | 15 |
131
+ | 5 | [Fremtind/mmBERT-base-norwegian](https://huggingface.co/Fremtind/mmBERT-base-norwegian) | 7 |
132
+
133
+
134
 
135
+ ## Training Details
 
136
 
137
+ The model was fine-tuned in two stages.
 
138
 
139
+ In the **first stage**, it was trained in an unsupervised manner following the SimCSE method (Gao et al., 2021). In this setup, the same sentence is encoded twice, and due to dropout (in training mode), the model produces two slightly different embeddings. The training objective is to minimize the distance between these embeddings while maximizing the distance to embeddings of other sentences in the same batch.
140
+ For this stage, we created sentence pairs in three categories from the [NDLA Parallel Paragraphs dataset](https://huggingface.co/datasets/NbAiLab/ndla_parallel_paragraphs): (Bokmål, Bokmål), (Nynorsk, Nynorsk), and (Bokmål, Nynorsk). In the (Bokmål, Bokmål) and (Nynorsk, Nynorsk) pairs, each sentence was paired with itself, leveraging dropout to create embedding variation. In the (Bokmål, Nynorsk) category, cross-lingual sentence pairs were used to align the model’s semantic representations across the two language varieties.
141
 
142
+ In the **second stage**, the model was further fine-tuned on a natural language inference dataset, namely [Fremtind/all-nli-norwegian](https://huggingface.co/datasets/Fremtind/all-nli-norwegian). The dataset is formatted as triplets (anchor, positive, negative), where the _anchor_ is the premise, the _positive_ is an entailment hypothesis, and the _negative_ is a contradiction hypothesis. The objective is to minimize the distance between the anchor and positive while maximizing it between the anchor and negative. This fine-tuning stage follows the 'standard' supervised fine-tuning strategy introduced in Sentence-BERT.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
  ### Training Hyperparameters
145
  #### Non-Default Hyperparameters