Text Classification
Transformers
Safetensors
Dutch
bert
BERTje
Filtering
Data Cleaning
text-embeddings-inference
Instructions to use Kalamazooter/DutchDatasetCleaner_Bertje with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Kalamazooter/DutchDatasetCleaner_Bertje with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Kalamazooter/DutchDatasetCleaner_Bertje")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje") model = AutoModelForSequenceClassification.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje") - Notebooks
- Google Colab
- Kaggle
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje")
model = AutoModelForSequenceClassification.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje")Quick Links
Model description
This model was created with the intention of easily being able to filter large synthetic datasets in the Dutch language. It was mostly trained to pick out strings with a lot of repitition, weird grammar or refusals specifically, returning either ["Correct","Error","Refusal"]
How to use
from transformers import AutoTokenizer, BertForSequenceClassification, pipeline
import json
model = BertForSequenceClassification.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje")
tokenizer = AutoTokenizer.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje", model_max_len=512)
text_classification = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
)
tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}
ErrorThreshold = 0.8 #model is slightly trigger happy on the error class, modify this value to your needs
Dataset = "Base_Dataset"
with open(Dataset+".jsonl","r") as DirtyDataset:
lines = DirtyDataset.readlines()
for line in lines:
DatasetDict = json.loads(line)
output = text_classification(DatasetDict['text'],**tokenizer_kwargs)
label = output[0]['label']
score = output[0]['score']
if label == 'Refusal':
with open(Dataset+"_Refused.jsonl","a") as RefusalDataset:
RefusalDataset.writelines([line])
if label == 'Error' and score > ErrorThreshold:
with open(Dataset+"_Error.jsonl","a") as ErrorDataset:
ErrorDataset.writelines([line])
if label == 'Correct' or (label == 'Error' and score < ErrorThreshold):
with open(Dataset+"_Clean.jsonl","a") as CorrectDataset:
CorrectDataset.writelines([line])
- Downloads last month
- 5
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Kalamazooter/DutchDatasetCleaner_Bertje")