Instructions to use perplexity-correlations/fasttext-arc-easy-target with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use perplexity-correlations/fasttext-arc-easy-target with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("perplexity-correlations/fasttext-arc-easy-target", "model.bin")) - Notebooks
- Google Colab
- Kaggle
Improve model card
Browse filesThis PR improves the model card by adding metadata such as `pipeline_tag` and `library_name`, and provides a link to the associated Github repository. It also clarifies the model's purpose as a data filter for pretraining LLMs.
README.md
CHANGED
|
@@ -1,7 +1,11 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
This
|
| 6 |
-
|
| 7 |
-
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
pipeline_tag: text-classification
|
| 4 |
+
library_name: fasttext
|
| 5 |
---
|
| 6 |
|
| 7 |
+
This fastText model is a pretraining data filter, targeting the ARC Easy task. It's designed to select high-quality pretraining data using perplexity correlations, as described in [Improving Pretraining Data Using Perplexity Correlations](https://arxiv.org/abs/2409.05816). The model classifies text as either "include" or "exclude" for use in pretraining a language model. It does *not* itself represent a pretrained language model.
|
| 8 |
+
|
| 9 |
+
The filter was created using a method that leverages correlations between LLM losses on various texts and downstream benchmark performance. By selecting texts with high correlation, this model aims to improve the efficiency of the data selection process for pretraining LLMs.
|
| 10 |
+
|
| 11 |
+
Code: https://github.com/TristanThrush/perplexity-correlations
|