Update README.md
Browse files
README.md
CHANGED
|
@@ -68,7 +68,7 @@ model-index:
|
|
| 68 |
|
| 69 |
# SentenceTransformer based on NbAiLab/nb-bert-base
|
| 70 |
|
| 71 |
-
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [NbAiLab/nb-bert-base](https://huggingface.co/NbAiLab/nb-bert-base).
|
| 72 |
|
| 73 |
The model maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. The easiest way is to simply measure the cosine distance between two sentences. Sentences that are close to each other in meaning, will have a small cosine distance and a similarity close to 1. The model is trained in such a way that similar sentences in different languages should also be close to each other. Ideally, an English-Norwegian sentence pair should have high similarity.
|
| 74 |
|
|
@@ -516,13 +516,9 @@ You can finetune this model on your own dataset.
|
|
| 516 |
year = {2021},
|
| 517 |
address = {Reykjavik, Iceland (Online)},
|
| 518 |
publisher = {Linköping University Electronic Press, Sweden},
|
| 519 |
-
url = {https://
|
| 520 |
pages = {20--29},
|
| 521 |
-
abstract = {In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library.
|
| 522 |
-
The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models
|
| 523 |
-
in several token and sequence classification tasks for both Norwegian Bokmål and Norwegian Nynorsk. Our model also improves the mBERT performance for other
|
| 524 |
-
languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore,
|
| 525 |
-
we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.},
|
| 526 |
}
|
| 527 |
```
|
| 528 |
|
|
|
|
| 68 |
|
| 69 |
# SentenceTransformer based on NbAiLab/nb-bert-base
|
| 70 |
|
| 71 |
+
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [NbAiLab/nb-bert-base](https://huggingface.co/NbAiLab/nb-bert-base). It is the second version of the existing [NbAiLab/nb-sbert-base](https://huggingface.co/NbAiLab/nb-sbert-base) model, providing a larger max sequence length for inputs.
|
| 72 |
|
| 73 |
The model maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. The easiest way is to simply measure the cosine distance between two sentences. Sentences that are close to each other in meaning, will have a small cosine distance and a similarity close to 1. The model is trained in such a way that similar sentences in different languages should also be close to each other. Ideally, an English-Norwegian sentence pair should have high similarity.
|
| 74 |
|
|
|
|
| 516 |
year = {2021},
|
| 517 |
address = {Reykjavik, Iceland (Online)},
|
| 518 |
publisher = {Linköping University Electronic Press, Sweden},
|
| 519 |
+
url = {https://huggingface.co/papers/2104.09617},
|
| 520 |
pages = {20--29},
|
| 521 |
+
abstract = {In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for both Norwegian Bokmål and Norwegian Nynorsk. Our model also improves the mBERT performance for other languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore, we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.},
|
|
|
|
|
|
|
|
|
|
|
|
|
| 522 |
}
|
| 523 |
```
|
| 524 |
|