agentlans/fineweb2hq-vs-c4
Viewer • Updated • 200k • 44
How to use agentlans/multilingual-e5-small-fineweb2hq-vs-c4-classifier with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="agentlans/multilingual-e5-small-fineweb2hq-vs-c4-classifier") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("agentlans/multilingual-e5-small-fineweb2hq-vs-c4-classifier")
model = AutoModelForSequenceClassification.from_pretrained("agentlans/multilingual-e5-small-fineweb2hq-vs-c4-classifier")Note: This model is provided for reference and reproducibility, not for standalone use.
This model is a fine-tuned version of agentlans/multilingual-e5-small-aligned-v2 on the agentlans/fineweb2hq-vs-c4 dataset.
The aim is to classify text as higher quality (FineWeb 2 HQ) or lower quality (C4) for AI training.
On the validation set:
from transformers import pipeline
classifier = pipeline("text-classification", model="agentlans/multilingual-e5-small-fineweb2hq-vs-c4-classifier")
classifier("Your text here.")
The following hyperparameters were used during training:
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Combined Score | Input Tokens Seen |
|---|---|---|---|---|---|---|
| 0.1387 | 1.0 | 40000 | 0.1983 | 0.9515 | 1.3494 | 40960000 |
| 0.0682 | 2.0 | 80000 | 0.2264 | 0.9528 | 1.3270 | 81920000 |
| 0.0424 | 3.0 | 120000 | 0.2598 | 0.9552 | 1.2845 | 122880000 |
Base model
intfloat/multilingual-e5-small