--- license: cc-by-sa-4.0 datasets: - SuccubusBot/incoherent-text-dataset language: - en - es - fr - de - zh - ja - ru - ar - hi metrics: - accuracy base_model: - distilbert/distilbert-base-multilingual-cased pipeline_tag: text-classification library_name: transformers --- # DistilBERT Incoherence Classifier (Multilingual) This is a fine-tuned DistilBERT-multilingual model for classifying text based on its coherence. It can identify various types of incoherence. ## Model Details - **Model:** DistilBERT (distilbert-base-multilingual-cased) - **Task:** Text Classification (Coherence Detection) - **Fine-tuning:** The model was fine-tuned using a synthetically generated dataset that features various types of incoherence ## Training Metrics | Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | F1 | | :---- | :------------ | :------------ | :-------- | :-------- | :-------- | :------- | | 1 | 0.343600 | 0.303963 | 0.880312 | 0.882746 | 0.880312 | 0.879637 | | 2 | 0.245200 | 0.286482 | 0.900850 | 0.901156 | 0.900850 | 0.899612 | | 3 | 0.149700 | 0.313061 | 0.906161 | 0.906049 | 0.906161 | 0.905103 | ## Evaluation Metrics The following metrics were measured on the test set: | Metric | Value | | :---------- | :------- | | Loss | 0.316272 | | Accuracy | 0.903329 | | Precision | 0.903704 | | Recall | 0.903329 | | F1-Score | 0.902359 | ## Classification Report: ``` precision recall f1-score support coherent 0.86 0.93 0.90 2051 grammatical_errors 0.88 0.76 0.81 599 random_bytes 1.00 1.00 1.00 599 random_tokens 1.00 1.00 1.00 600 random_words 0.95 0.93 0.94 600 run_on 0.85 0.79 0.82 600 word_soup 0.89 0.83 0.86 599 accuracy 0.90 5648 macro avg 0.92 0.89 0.90 5648 weighted avg 0.90 0.90 0.90 5648 ``` ## Confusion Matrix ![Confusion Matrix](confusion_matrix.png) The confusion matrix above shows the performance of the model on each class. ## Usage This model can be used for text classification tasks, specifically for detecting and categorizing different types of text incoherence. You can use the `inference_example` function provided in the notebook to test your own text. ```py from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline tokenizer = AutoTokenizer.from_pretrained("SuccubusBot/distilbert-multilingual-incoherence-classifier") model = AutoModelForSequenceClassification.from_pretrained("SuccubusBot/distilbert-multilingual-incoherence-classifier") classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) while True: text = input("Enter text (or type 'exit' to quit): ") if text.lower() == "exit": break # Example usage results = classifier(text) # Print the results with confidence scores for all labels for result in results: print(f"Label: {result['label']}, Confidence: {result['score']}") ``` ## Limitations The model has been trained on a generated dataset, so care must be taken in evaluating it in the real world. More data may need to be collected before evaluating this model in a real-world setting.