fixed the datasets file
Browse files
README.md
CHANGED
|
@@ -9,7 +9,6 @@ license: apache-2.0
|
|
| 9 |
datasets:
|
| 10 |
- s2orc
|
| 11 |
- flax-sentence-embeddings/stackexchange_xml
|
| 12 |
-
- MS Marco
|
| 13 |
- gooaq
|
| 14 |
- yahoo_answers_topics
|
| 15 |
- code_search_net
|
|
@@ -99,18 +98,18 @@ For an automated evaluation of this model, see the *Sentence Embeddings Benchmar
|
|
| 99 |
|
| 100 |
## Background
|
| 101 |
|
| 102 |
-
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
|
| 103 |
-
contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
|
| 104 |
1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
|
| 105 |
|
| 106 |
-
We developped this model during the
|
| 107 |
-
[Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
|
| 108 |
organized by Hugging Face. We developped this model as part of the project:
|
| 109 |
[Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
|
| 110 |
|
| 111 |
## Intended uses
|
| 112 |
|
| 113 |
-
Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
|
| 114 |
the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
|
| 115 |
|
| 116 |
By default, input text longer than 384 word pieces is truncated.
|
|
@@ -118,11 +117,11 @@ By default, input text longer than 384 word pieces is truncated.
|
|
| 118 |
|
| 119 |
## Training procedure
|
| 120 |
|
| 121 |
-
### Pre-training
|
| 122 |
|
| 123 |
We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
|
| 124 |
|
| 125 |
-
### Fine-tuning
|
| 126 |
|
| 127 |
We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
|
| 128 |
We then apply the cross entropy loss by comparing with true pairs.
|
|
@@ -162,7 +161,7 @@ We sampled each dataset given a weighted probability which configuration is deta
|
|
| 162 |
| [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
|
| 163 |
| [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
|
| 164 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
|
| 165 |
-
| AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
|
| 166 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
|
| 167 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
|
| 168 |
| [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
|
|
@@ -173,4 +172,4 @@ We sampled each dataset given a weighted probability which configuration is deta
|
|
| 173 |
| [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
|
| 174 |
| [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
|
| 175 |
| [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
|
| 176 |
-
| **Total** | | **1,170,060,424** |
|
|
|
|
| 9 |
datasets:
|
| 10 |
- s2orc
|
| 11 |
- flax-sentence-embeddings/stackexchange_xml
|
|
|
|
| 12 |
- gooaq
|
| 13 |
- yahoo_answers_topics
|
| 14 |
- code_search_net
|
|
|
|
| 98 |
|
| 99 |
## Background
|
| 100 |
|
| 101 |
+
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
|
| 102 |
+
contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
|
| 103 |
1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
|
| 104 |
|
| 105 |
+
We developped this model during the
|
| 106 |
+
[Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
|
| 107 |
organized by Hugging Face. We developped this model as part of the project:
|
| 108 |
[Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
|
| 109 |
|
| 110 |
## Intended uses
|
| 111 |
|
| 112 |
+
Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
|
| 113 |
the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
|
| 114 |
|
| 115 |
By default, input text longer than 384 word pieces is truncated.
|
|
|
|
| 117 |
|
| 118 |
## Training procedure
|
| 119 |
|
| 120 |
+
### Pre-training
|
| 121 |
|
| 122 |
We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
|
| 123 |
|
| 124 |
+
### Fine-tuning
|
| 125 |
|
| 126 |
We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
|
| 127 |
We then apply the cross entropy loss by comparing with true pairs.
|
|
|
|
| 161 |
| [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
|
| 162 |
| [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
|
| 163 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
|
| 164 |
+
| AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
|
| 165 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
|
| 166 |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
|
| 167 |
| [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
|
|
|
|
| 172 |
| [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
|
| 173 |
| [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
|
| 174 |
| [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
|
| 175 |
+
| **Total** | | **1,170,060,424** |
|