sirunchained commited on
Commit
f0e23b4
·
verified ·
1 Parent(s): 59f1247

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -3
README.md CHANGED
@@ -1,3 +1,91 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: sklearn
3
+ tags:
4
+ - text-classification
5
+ - persian
6
+ - offensive-language-detection
7
+ - onnx
8
+ language: fa
9
+ metrics:
10
+ - accuracy
11
+ - f1
12
+ - precision
13
+ - recall
14
+ ---
15
+
16
+ # Offensive Text Classifier for Persian Language
17
+
18
+ This repository contains an Offensive Text Classifier model trained to detect offensive language in Persian text. The model is built using `scikit-learn` and exported to ONNX format for efficient inference.
19
+
20
+ ## Model Description
21
+
22
+ This model classifies Persian text into two categories: `Offensive` (label 1) and `Neutral` (label 0). It uses a TfidfVectorizer for text feature extraction combined with a Support Vector Machine (SVM) classifier.
23
+
24
+ ## Dataset
25
+
26
+ The model was trained on the [ParsOffensive dataset](https://github.com/golnaz76gh/pars-offensive-dataset). This dataset consists of Persian comments labeled as either 'Offensive' or 'Neutral'.
27
+
28
+ ## Preprocessing
29
+
30
+ The text data underwent the following preprocessing steps:
31
+ - **Normalization**: Using `hazm.Normalizer`.
32
+ - **Lemmatization**: Using `hazm.Lemmatizer`.
33
+ - **Stop-word Removal**: Common Persian stop words were removed.
34
+ - **Label Encoding**: 'Neutral' and 'Offensive' labels were converted to numerical `0` and `1` respectively.
35
+ - **Imbalance Handling**: The ADASYN technique was applied to address class imbalance during training.
36
+
37
+ ## Model Architecture
38
+
39
+ The final pipeline consists of:
40
+ 1. **`TfidfVectorizer`**: Converts raw text into a matrix of TF-IDF features.
41
+ 2. **`SVC` (Support Vector Classifier)**: A Support Vector Machine classifier with a radial basis function (RBF) kernel.
42
+
43
+ ## Performance
44
+
45
+ Below are the performance metrics on the test set:
46
+
47
+ ```
48
+ precision recall f1-score support
49
+
50
+ 0 0.80 0.96 0.88 1043
51
+ 1 0.91 0.62 0.74 644
52
+
53
+ accuracy 0.83 1687
54
+ macro avg 0.86 0.79 0.81 1687
55
+ weighted avg 0.84 0.83 0.82 1687
56
+ ```
57
+
58
+ Detailed Metrics:
59
+ - **Accuracy**: 0.830
60
+ - **Precision (Offensive)**: 0.909
61
+ - **Recall (Offensive)**: 0.618
62
+ - **F1-score (Offensive)**: 0.736
63
+
64
+ ## How to Use
65
+
66
+ ### Load the model
67
+
68
+ You can load the ONNX model and use it for inference. You will need to apply the same preprocessing steps as during training.
69
+
70
+ ```python
71
+ from hazm import Lemmatizer, Normalizer, stopwords_list
72
+
73
+ # recreate the preprocessing components (or load them if saved)
74
+ stopwords = stopwords_list()
75
+ lemmatizer = Lemmatizer()
76
+ normalizer = Normalizer()
77
+
78
+ def clean_sentences(sentence: str) -> str:
79
+ return " ".join(lemmatizer.lemmatize(word) for word in normalizer.normalize(sentence).split(" ") if word not in stopwords)
80
+ ```
81
+
82
+ ## Dependencies
83
+
84
+ - `pandas`
85
+ - `numpy`
86
+ - `hazm`
87
+ - `scikit-learn`
88
+ - `imblearn`
89
+ - `skl2onnx`
90
+ - `onnxruntime`
91
+ - `huggingface_hub`