| | --- |
| | license: mit |
| | pipeline_tag: TEXT_CLASSIFICATION |
| | library_name: fasttext |
| | --- |
| | |
| | <p align="center"> |
| | π <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a>    |    π¨ <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a>    |    π€ <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a>    |    π¦ <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a> |
| | <br> |
| | </p> |
| | |
| |
|
| | ## Model Summary |
| | This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches |
| | ](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%. |
| | The positive label name and negative label name are "__label__1" and "__label__0" respectively. |
| |
|
| | ## How to use |
| | You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply: |
| |
|
| | ```python |
| | import os |
| | import argparse |
| | from pathlib import Path |
| | |
| | parser = argparse.ArgumentParser("Filter") |
| | parser.add_argument("--input_path",type=str, help="input path name") |
| | parser.add_argument("--output_path",type=str, help="output name") |
| | |
| | args = parser.parse_args() |
| | from datatrove.executor import LocalPipelineExecutor |
| | from datatrove.pipeline.filters import FastTextClassifierFilter |
| | from datatrove.pipeline.readers import ParquetReader,JsonlReader |
| | from datatrove.pipeline.writers.jsonl import JsonlWriter |
| | Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True) |
| | |
| | dist_executor = LocalPipelineExecutor( |
| | skip_completed=False, |
| | pipeline=[ |
| | JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}), |
| | FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]), |
| | JsonlWriter(f"{args.output_path}", compression=None) |
| | ], |
| | tasks=100, |
| | ) |
| | dist_executor.run() |
| | ``` |
| |
|
| | ## Training |
| | For more training details, you can refer to the paper and the training code is available on GitHub |
| | [PreSelect](https://github.com/hkust-nlp/preselect). |
| |
|
| | ## Citation |
| | If you find this work helpful, please kindly cite as: |
| | ``` |
| | @article{shum2025predictivedataselectiondata, |
| | title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches}, |
| | author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He}, |
| | journal={arXiv preprint arXiv:2503.00808}, |
| | year={2025}, |
| | eprint={2503.00808}, |
| | } |
| | ``` |