Update README.md
Browse filesUpdated references to the training dataset.
README.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
datasets:
|
| 4 |
-
- deepvk/
|
| 5 |
language:
|
| 6 |
- ru
|
| 7 |
base_model:
|
|
@@ -91,7 +91,7 @@ Compared to the USER2-base model, there are two additional MLP layers. One is fo
|
|
| 91 |
|
| 92 |
<img src="assets/architecture.png" alt="GeRaCl architecture" width="600"/>
|
| 93 |
|
| 94 |
-
The training set is built entirely from splits of the [`deepvk/
|
| 95 |
- **Synthetic classes part**. For every training example we randomly chose one of the five class lists (`classes_0`…`classes_4`) and paired it with the sample’s text. The validation and test splits were added unchanged.
|
| 96 |
- **RU-MTEB part**. The entire `ru_mteb_classes` dataset was added to the mix.
|
| 97 |
- **RU-MTEB extended part**. The entire `ru_mteb_extended_classes` dataset was added to the mix.
|
|
@@ -99,10 +99,10 @@ The training set is built entirely from splits of the [`deepvk/CLAZER`](https:/
|
|
| 99 |
|
| 100 |
| Dataset | # Samples |
|
| 101 |
|----------------------------:|:----:|
|
| 102 |
-
| [
|
| 103 |
-
| [
|
| 104 |
-
| [
|
| 105 |
-
| [
|
| 106 |
| **Total** | 244K |
|
| 107 |
|
| 108 |
## Citations
|
|
@@ -114,4 +114,4 @@ The training set is built entirely from splits of the [`deepvk/CLAZER`](https:/
|
|
| 114 |
publisher={Hugging Face}
|
| 115 |
year={2025},
|
| 116 |
}
|
| 117 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
datasets:
|
| 4 |
+
- deepvk/GeRaCl_synthethic_dataset
|
| 5 |
language:
|
| 6 |
- ru
|
| 7 |
base_model:
|
|
|
|
| 91 |
|
| 92 |
<img src="assets/architecture.png" alt="GeRaCl architecture" width="600"/>
|
| 93 |
|
| 94 |
+
The training set is built entirely from splits of the [`deepvk/GeRaCl_synthethic_dataset`](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset) dataset. It is a concatenation of three sub-datasets:
|
| 95 |
- **Synthetic classes part**. For every training example we randomly chose one of the five class lists (`classes_0`…`classes_4`) and paired it with the sample’s text. The validation and test splits were added unchanged.
|
| 96 |
- **RU-MTEB part**. The entire `ru_mteb_classes` dataset was added to the mix.
|
| 97 |
- **RU-MTEB extended part**. The entire `ru_mteb_extended_classes` dataset was added to the mix.
|
|
|
|
| 99 |
|
| 100 |
| Dataset | # Samples |
|
| 101 |
|----------------------------:|:----:|
|
| 102 |
+
| [GeRaCl_synthethic_dataset/synthetic_classes_train](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/synthetic_classes_train) | 93K |
|
| 103 |
+
| [GeRaCl_synthethic_dataset/synthetic_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/synthetic_classes) (val and test) | 6K |
|
| 104 |
+
| [GeRaCl_synthethic_dataset/ru_mteb_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_classes/) | 52K |
|
| 105 |
+
| [GeRaCl_synthethic_dataset/ru_mteb_extended_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_extended_classes) | 93K |
|
| 106 |
| **Total** | 244K |
|
| 107 |
|
| 108 |
## Citations
|
|
|
|
| 114 |
publisher={Hugging Face}
|
| 115 |
year={2025},
|
| 116 |
}
|
| 117 |
+
```
|