Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,12 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
## Model Summary
|
| 5 |
This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches
|
| 6 |
](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
|
|
@@ -40,3 +46,15 @@ dist_executor.run()
|
|
| 40 |
## Training
|
| 41 |
For more training details, you can refer to the paper and the training code is available on GitHub
|
| 42 |
[PreSelect](https://github.com/hkust-nlp/preselect).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
+
<p align="center">
|
| 5 |
+
📑 <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a>    |    🔨 <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a>    |    🤗 <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a>    |    📦 <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a>
|
| 6 |
+
<br>
|
| 7 |
+
</p>
|
| 8 |
+
|
| 9 |
+
|
| 10 |
## Model Summary
|
| 11 |
This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches
|
| 12 |
](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
|
|
|
|
| 46 |
## Training
|
| 47 |
For more training details, you can refer to the paper and the training code is available on GitHub
|
| 48 |
[PreSelect](https://github.com/hkust-nlp/preselect).
|
| 49 |
+
|
| 50 |
+
## Citation
|
| 51 |
+
If you find this work helpful, please kindly cite as:
|
| 52 |
+
```
|
| 53 |
+
@article{shum2025predictivedataselectiondata,
|
| 54 |
+
title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches},
|
| 55 |
+
author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He},
|
| 56 |
+
journal={arXiv preprint arXiv:2503.00808},
|
| 57 |
+
year={2025},
|
| 58 |
+
eprint={2503.00808},
|
| 59 |
+
}
|
| 60 |
+
```
|