fasttext-lambada-it-target / README.md

nielsr HF Staff

Improve model card: Add metadata, description and Github link

29132e9 verified 11 months ago

preview code

raw

history blame

774 Bytes

metadata

license: mit
library_name: fasttext
pipeline_tag: data-filtering
tags:
  - pretraining-data-selection

This fastText model is a filter for selecting high-quality pretraining data, as described in Improving Pretraining Data Using Perplexity Correlations. It targets the LAMBADA IT task.

The model uses perplexity correlations to identify text segments highly correlated with strong performance on downstream benchmarks. It doesn't perform text classification directly; instead, it outputs a score indicating the suitability of a text segment for pretraining.

For complete usage instructions and the theoretical background, please refer to the project's GitHub repository.