data-is-better-together/fineweb-c
Viewer • Updated • 88.7k • 7.79k • 60
How to use tartuNLP/mmBERT-small-m-edu-classifier with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="tartuNLP/mmBERT-small-m-edu-classifier") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("tartuNLP/mmBERT-small-m-edu-classifier")
model = AutoModelForSequenceClassification.from_pretrained("tartuNLP/mmBERT-small-m-edu-classifier")Trained on full documents of up to 8192 tokens in total. The train set of tartuNLP/fineweb-c-combined-resample was used, which itself is a mix and a resample of HuggingFaceFW/fineweb-edu-llama3-annotations and data-is-better-together/fineweb-c.
{0: '❗ Problematic Content ❗', 1: 'None', 2: 'Minimal', 3: 'Basic', 4: 'Good', 5: 'Excellent'}
Evaluated on the development set of tartuNLP/fineweb-c-combined-resample organized so that each language appears at least once.
precision recall f1-score support
0 0.89 0.78 0.83 602
1 0.65 0.88 0.75 916
2 0.41 0.29 0.34 345
3 0.40 0.30 0.34 179
4 0.53 0.15 0.23 127
5 0.55 0.39 0.45 44
accuracy 0.66 2213
macro avg 0.57 0.46 0.49 2213
weighted avg 0.65 0.66 0.64 2213
[[471 114 10 6 0 1]
[ 33 806 59 13 5 0]
[ 10 204 101 28 2 0]
[ 7 72 37 53 8 2]
[ 7 35 27 28 19 11]
[ 2 7 10 6 2 17]]
Base model
jhu-clsp/mmBERT-small