nielsr's picture
nielsr HF Staff
Fix metadata tag
2920893 verified
|
raw
history blame
779 Bytes
metadata
license: mit
library_name: fasttext
pipeline_tag: text-classification
tags:
  - pretraining-data-selection

This fastText model is a filter for selecting high-quality pretraining data, as described in Improving Pretraining Data Using Perplexity Correlations. It targets the LAMBADA IT task.

The model uses perplexity correlations to identify text segments highly correlated with strong performance on downstream benchmarks. It doesn't perform text classification directly; instead, it outputs a score indicating the suitability of a text segment for pretraining.

For complete usage instructions and the theoretical background, please refer to the project's GitHub repository.