| license: mit | |
| pipeline_tag: text-classification | |
| library_name: fasttext | |
| This is the fastText pretraining data filter targeting the PIQA task, discussed in the main text of the Perplexity Correlations paper: https://arxiv.org/abs/2409.05816. This filter helps select high-quality pretraining data by identifying strong correlations between LLM perplexity on a given text and its downstream performance on a target benchmark (PIQA in this case). The filter is trained using perplexity correlations from a sample of 90 LLMs from the Open LLM Leaderboard on texts from tens of thousands of web domains. | |
| The filter is implemented using the `fasttext` library and can be used to select or weight pretraining data samples based on their predicted likelihood of improving downstream performance. | |
| For more information on the methodology and usage, please refer to the [Perplexity Correlations paper](https://arxiv.org/abs/2409.05816) and the [project repository](https://github.com/TristanThrush/perplexity-correlations). | |
| ```python | |
| import fasttext | |
| # Load the pre-trained fastText model | |
| model = fasttext.load_model('fasttext_filter.bin') | |
| # Example usage: Get the 'include' probability for a piece of text | |
| text = "Some text to filter." | |
| prediction = model.predict(text)[0] # Prediction is 'include' or 'exclude' | |
| probability = model.predict_proba(text)[0][0] if prediction[0] == '__label__include' else model.predict_proba(text)[0][1] # probability of 'include' | |
| print(f"Prediction: {prediction[0]}, Probability: {probability}") | |
| ``` |