How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="Kalamazooter/DutchDatasetCleaner_Bertje")
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje")
model = AutoModelForSequenceClassification.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje")
Quick Links

Model description

This model was created with the intention of easily being able to filter large synthetic datasets in the Dutch language. It was mostly trained to pick out strings with a lot of repitition, weird grammar or refusals specifically, returning either ["Correct","Error","Refusal"]

How to use

from transformers import AutoTokenizer, BertForSequenceClassification, pipeline
import json
model = BertForSequenceClassification.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje")
tokenizer = AutoTokenizer.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje", model_max_len=512)
text_classification = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
)

tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}

ErrorThreshold = 0.8 #model is slightly trigger happy on the error class, modify this value to your needs
Dataset = "Base_Dataset"

with open(Dataset+".jsonl","r") as DirtyDataset:
    lines = DirtyDataset.readlines()
    for line in lines:
        DatasetDict = json.loads(line)
        output = text_classification(DatasetDict['text'],**tokenizer_kwargs)
        label = output[0]['label']
        score = output[0]['score']
        if label == 'Refusal':
            with open(Dataset+"_Refused.jsonl","a") as RefusalDataset:
                RefusalDataset.writelines([line])
        if label == 'Error' and score > ErrorThreshold:
            with open(Dataset+"_Error.jsonl","a") as ErrorDataset:
                ErrorDataset.writelines([line])
        if label == 'Correct' or (label == 'Error' and score < ErrorThreshold): 
            with open(Dataset+"_Clean.jsonl","a") as CorrectDataset:
                CorrectDataset.writelines([line])
Downloads last month
5
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support