nielsr's picture
nielsr HF Staff
Improve model card: Add metadata, description and Github link
29132e9 verified
|
raw
history blame
774 Bytes
metadata
license: mit
library_name: fasttext
pipeline_tag: data-filtering
tags:
  - pretraining-data-selection

This fastText model is a filter for selecting high-quality pretraining data, as described in Improving Pretraining Data Using Perplexity Correlations. It targets the LAMBADA IT task.

The model uses perplexity correlations to identify text segments highly correlated with strong performance on downstream benchmarks. It doesn't perform text classification directly; instead, it outputs a score indicating the suitability of a text segment for pretraining.

For complete usage instructions and the theoretical background, please refer to the project's GitHub repository.