fasttext-lambada-it-target / README.md

nielsr HF Staff

Fix metadata tag

2920893 verified 9 months ago

preview code

raw

history blame

779 Bytes

metadata

license: mit
library_name: fasttext
pipeline_tag: text-classification
tags:
  - pretraining-data-selection

This fastText model is a filter for selecting high-quality pretraining data, as described in Improving Pretraining Data Using Perplexity Correlations. It targets the LAMBADA IT task.

The model uses perplexity correlations to identify text segments highly correlated with strong performance on downstream benchmarks. It doesn't perform text classification directly; instead, it outputs a score indicating the suitability of a text segment for pretraining.

For complete usage instructions and the theoretical background, please refer to the project's GitHub repository.