Kalamazooter
/

DutchDatasetCleaner_Bertje

Text Classification

text-embeddings-inference

Model card Files Files and versions

Kalamazooter commited on Feb 20, 2024

Commit

3874964

·

verified ·

1 Parent(s): 20af765

Update Readme.md

Files changed (1) hide show

README.md +46 -0

README.md CHANGED Viewed

@@ -1,3 +1,49 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+language: nl
+tags:
+- BERTje
+- Filtering
+- Data Cleaning
 ---
+## Model description
+This model was created with the intention of easily being able to filter large synthetic datasets in the Dutch language.
+It was mostly trained to pick out strings with a lot of repitition, weird grammar or refusals specifically, returning either ["Correct","Error","Refusal"]
+THIS IS NOT THE FINAL VERSION, MORE ITERATIONS IN THE NEXT FEW WEEKS
+## How to use
+```python
+from transformers import AutoTokenizer, BertForSequenceClassification, pipeline
+import json
+model = BertForSequenceClassification.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje")
+tokenizer = AutoTokenizer.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje", model_max_len=512)
+text_classification = pipeline(
+    "text-classification",
+    model=model,
+    tokenizer=tokenizer,
+)
+tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}
+ErrorThreshold = 0.8 #model is slightly trigger happy on the error class, modify this value to your needs
+Dataset = "Base_Dataset"
+with open(Dataset+".jsonl","r") as DirtyDataset:
+    lines = DirtyDataset.readlines()
+    for line in lines:
+        DatasetDict = json.loads(line)
+        output = text_classification(DatasetDict['text'],**tokenizer_kwargs)
+        label = output[0]['label']
+        score = output[0]['score']
+        if label == 'Refusal':
+            with open(Dataset+"_Refused.jsonl","a") as RefusalDataset:
+                RefusalDataset.writelines([line])
+        if label == 'Error' and score > ErrorThreshold:
+            with open(Dataset+"_Error.jsonl","a") as ErrorDataset:
+                ErrorDataset.writelines([line])
+        if label == 'Correct' or (label == 'Error' and score < ErrorThreshold):
+            with open(Dataset+"_Clean.jsonl","a") as CorrectDataset:
+                CorrectDataset.writelines([line])
+```