Improve model card with metadata and description (#1)

4a4f6ca verified 8 months ago

1.53 kB

	---
	license: mit
	pipeline_tag: text-classification
	library_name: fasttext
	---

	This is the fastText pretraining data filter targeting the PIQA task, discussed in the main text of the Perplexity Correlations paper: https://arxiv.org/abs/2409.05816. This filter helps select high-quality pretraining data by identifying strong correlations between LLM perplexity on a given text and its downstream performance on a target benchmark (PIQA in this case). The filter is trained using perplexity correlations from a sample of 90 LLMs from the Open LLM Leaderboard on texts from tens of thousands of web domains.

	The filter is implemented using the `fasttext` library and can be used to select or weight pretraining data samples based on their predicted likelihood of improving downstream performance.

	For more information on the methodology and usage, please refer to the [Perplexity Correlations paper](https://arxiv.org/abs/2409.05816) and the [project repository](https://github.com/TristanThrush/perplexity-correlations).

	```python
	import fasttext

	# Load the pre-trained fastText model
	model = fasttext.load_model('fasttext_filter.bin')

	# Example usage: Get the 'include' probability for a piece of text
	text = "Some text to filter."
	prediction = model.predict(text)[0] # Prediction is 'include' or 'exclude'
	probability = model.predict_proba(text)[0][0] if prediction[0] == '__label__include' else model.predict_proba(text)[0][1] # probability of 'include'

	print(f"Prediction: {prediction[0]}, Probability: {probability}")
	```