Update README.md
Browse files
README.md
CHANGED
|
@@ -19,11 +19,16 @@ The model intended to be used for encoding sentences or short paragraphs. Given
|
|
| 19 |
# Training data
|
| 20 |
|
| 21 |
The model was trained on a random collection of **English** sentences from Wikipedia: [Training data file](https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m_for_simcse.txt)
|
|
|
|
|
|
|
| 22 |
|
| 23 |
# Model Training
|
| 24 |
|
| 25 |
<mark>In order to make use of the **few-shot** capability of **miCSE**, the mode needs to be trained on your data. The source code and instructions to do so will be provided shortly. Stay tuned :). </mark>
|
| 26 |
|
|
|
|
|
|
|
|
|
|
| 27 |
# Model Usage
|
| 28 |
### Example 1) - Sentence Similarity
|
| 29 |
|
|
|
|
| 19 |
# Training data
|
| 20 |
|
| 21 |
The model was trained on a random collection of **English** sentences from Wikipedia: [Training data file](https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m_for_simcse.txt)
|
| 22 |
+
Training data consists of data splits of different sizes (from 10% to 0.0064%) of the SimCSE training corpus. Each split size comprises 5 files, each created with a different seed.
|
| 23 |
+
Data can be downloaded [here](https://huggingface.co/datasets/sap-ai-research/datasets-for-micse).
|
| 24 |
|
| 25 |
# Model Training
|
| 26 |
|
| 27 |
<mark>In order to make use of the **few-shot** capability of **miCSE**, the mode needs to be trained on your data. The source code and instructions to do so will be provided shortly. Stay tuned :). </mark>
|
| 28 |
|
| 29 |
+
## Training Data
|
| 30 |
+
|
| 31 |
+
|
| 32 |
# Model Usage
|
| 33 |
### Example 1) - Sentence Similarity
|
| 34 |
|