huzaifanasirrr commited on
Commit
b7e614f
·
verified ·
1 Parent(s): 6b833a2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +363 -0
README.md ADDED
@@ -0,0 +1,363 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - text-classification
5
+ - ai-detection
6
+ - human-vs-ai
7
+ - binary-classification
8
+ - ensemble-methods
9
+ - bert
10
+ - lstm
11
+ - xgboost
12
+ - machine-learning
13
+ - deep-learning
14
+ datasets:
15
+ - hc3
16
+ language:
17
+ - en
18
+ metrics:
19
+ - accuracy
20
+ - precision
21
+ - recall
22
+ - f1
23
+ - roc-auc
24
+ library_name: scikit-learn
25
+ pipeline_tag: text-classification
26
+ ---
27
+
28
+ # Human vs. AI Text Classifier
29
+
30
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
31
+ [![scikit-learn](https://img.shields.io/badge/scikit--learn-1.3+-orange.svg)](https://scikit-learn.org/)
32
+ [![PyTorch 2.0+](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)
33
+ [![TensorFlow](https://img.shields.io/badge/TensorFlow-2.13+-FF6F00.svg)](https://tensorflow.org/)
34
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
35
+ [![GitHub](https://img.shields.io/badge/GitHub-Repository-black.svg)](https://github.com/huzaifanasir95/Human-vs-AI-Classifier)
36
+
37
+ ## Model Description
38
+
39
+ A comprehensive ensemble-based text classification system that distinguishes between human-written and AI-generated text with high accuracy. This implementation combines **traditional machine learning** (Logistic Regression, Random Forest, SVM, XGBoost) and **deep learning** approaches (BiLSTM with Attention, BERT) using advanced ensemble techniques.
40
+
41
+ **Key Features:**
42
+ - 6 diverse classifiers (4 traditional ML + 2 deep learning)
43
+ - 5,015-dimensional hybrid feature space (5,000 TF-IDF + 15 linguistic features)
44
+ - 4 ensemble strategies (Hard/Soft Voting, Weighted Average, Stacking)
45
+ - 99.59% F1-score with weighted ensemble
46
+ - Balanced performance (99.59% precision, recall, and accuracy)
47
+ - Trained on 52,452 samples from HC3 dataset
48
+
49
+ ## Model Architecture
50
+
51
+ ```
52
+ Input Text
53
+
54
+ [Feature Engineering]
55
+ ├─→ TF-IDF Vectorization (5,000 features)
56
+ │ - Unigrams & Bigrams
57
+ │ - Max DF: 0.95, Min DF: 2
58
+
59
+ └─→ Linguistic Features (15 features)
60
+ - Text length, word count, sentence count
61
+ - Lexical diversity (TTR)
62
+ - Stopword/punctuation ratios
63
+ - Statistical text properties
64
+
65
+ Multi-Modal Feature Vector (5,015 dimensions)
66
+
67
+ ┌──────────────────────────────────────────────┐
68
+ │ Base Classifiers (Parallel) │
69
+ ├──────────────────────────────────────────────┤
70
+ │ Traditional ML │ Deep Learning │
71
+ ├─────────────────────────┼────────────────────┤
72
+ │ • Logistic Regression │ • BERT │
73
+ │ • Random Forest (200) │ (bert-base) │
74
+ │ • SVM (RBF kernel) │ • BiLSTM+Attention │
75
+ │ • XGBoost (200 trees) │ (64 units) │
76
+ └─────────────────────────┴────────────────────┘
77
+
78
+ [Ensemble Aggregation]
79
+ ├─→ Hard Voting (Majority vote)
80
+ ├─→ Soft Voting (Probability averaging)
81
+ ├─→ Weighted Average (Optimized weights)
82
+ └─→ Stacking (Meta-learner: Logistic Regression)
83
+
84
+ Final Prediction: Human (0) or AI (1)
85
+ ```
86
+
87
+ **Individual Model Specifications:**
88
+
89
+ | Model | Type | Parameters | Key Configuration |
90
+ |-------|------|------------|-------------------|
91
+ | **Logistic Regression** | Linear | 5,015 | C=1.0, L2 regularization, LBFGS solver |
92
+ | **Random Forest** | Ensemble Trees | - | 200 estimators, unlimited depth |
93
+ | **SVM** | Kernel Method | - | RBF kernel, C=1.0, gamma=scale |
94
+ | **XGBoost** | Gradient Boosting | - | 200 trees, depth=7, LR=0.1 |
95
+ | **BiLSTM** | Recurrent NN | ~500K | 64 units/dir, attention, dropout=0.5 |
96
+ | **BERT** | Transformer | 110M | bert-base-uncased, max_len=128 |
97
+
98
+ ## Performance
99
+
100
+ ### Individual Models
101
+
102
+ | Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
103
+ |-------|----------|-----------|--------|----------|---------|
104
+ | **XGBoost** | 0.9903 | 0.9838 | 0.9970 | 0.9904 | 0.9994 |
105
+ | **Logistic Regression** | 0.9897 | 0.9827 | 0.9970 | 0.9898 | 0.9996 |
106
+ | **SVM** | 0.9867 | 0.9807 | 0.9929 | 0.9867 | 0.9991 |
107
+ | **BERT** | 0.9727 | 0.9510 | 0.9967 | 0.9733 | 0.9975 |
108
+ | **BiLSTM** | 0.9710 | 0.9668 | 0.9756 | 0.9712 | 0.9963 |
109
+ | **Random Forest** | 0.9573 | 0.9571 | 0.9576 | 0.9573 | 0.9922 |
110
+
111
+ ### Ensemble Methods
112
+
113
+ | Method | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
114
+ |--------|----------|-----------|--------|----------|---------|
115
+ | **Weighted Average** ⭐ | **0.9959** | **0.9959** | **0.9959** | **0.9959** | **0.9998** |
116
+ | **Stacking** | 0.9956 | 0.9947 | 0.9964 | 0.9956 | 0.9998 |
117
+ | **Soft Voting** | 0.9945 | 0.9937 | 0.9954 | 0.9945 | 0.9998 |
118
+ | **Hard Voting** | 0.9921 | 0.9944 | 0.9898 | 0.9921 | 0.9998 |
119
+
120
+ **Optimized Ensemble Weights:**
121
+ - XGBoost: 0.25
122
+ - Logistic Regression: 0.20
123
+ - BERT: 0.20
124
+ - SVM: 0.15
125
+ - Random Forest: 0.10
126
+ - BiLSTM: 0.10
127
+
128
+ **Confusion Matrix (Weighted Ensemble):**
129
+ ```
130
+ Predicted
131
+ Human AI
132
+ Actual Human 3918 16
133
+ AI 16 3918
134
+ ```
135
+ - Total Errors: 32 / 7,868 (0.41%)
136
+ - False Positives: 16 (0.20%)
137
+ - False Negatives: 16 (0.20%)
138
+
139
+ ## Training Details
140
+
141
+ **Dataset:**
142
+ - **Name**: HC3 (Human-ChatGPT Comparison Corpus)
143
+ - **Total Samples**: 52,452 balanced pairs
144
+ - **Training**: 36,716 (70%)
145
+ - **Validation**: 7,868 (15%)
146
+ - **Test**: 7,868 (15%)
147
+ - **Domains**: Finance, Medicine, Open QA, Reddit ELI5, Wikipedia CS/AI
148
+ - **Minimum Length**: 50 characters
149
+ - **Balance**: 50-50 (Human-AI)
150
+
151
+ **Feature Engineering:**
152
+ - **TF-IDF**: 5,000 dimensions (unigrams + bigrams)
153
+ - **Linguistic**: 15 handcrafted features
154
+ - Text statistics (length, word/sentence counts)
155
+ - Lexical diversity (Type-Token Ratio)
156
+ - Character ratios (stopwords, punctuation, digits, capitals)
157
+ - Structural patterns (long/short words, question/exclamation marks)
158
+
159
+ **Training Configuration:**
160
+
161
+ *Traditional ML Models:*
162
+ - Framework: scikit-learn 1.3+
163
+ - Cross-validation: 5-fold (for stacking)
164
+ - Class balance: Maintained via stratified splitting
165
+
166
+ *Deep Learning Models:*
167
+ - **BiLSTM**: 10 epochs (early stopped at 4), batch=64, Adam optimizer (LR=1e-3)
168
+ - **BERT**: 2 epochs, batch=16, AdamW optimizer (LR=2e-5), warmup=500 steps
169
+
170
+ **Hardware:**
171
+ - Training: CPU/GPU compatible
172
+ - BiLSTM training time: 3,406 seconds (4 epochs)
173
+ - BERT training time: Variable (depends on GPU)
174
+
175
+ ## Usage
176
+
177
+ ### Installation
178
+
179
+ ```bash
180
+ git clone https://github.com/huzaifanasir95/Human-vs-AI-Classifier.git
181
+ cd Human-vs-AI-Classifier
182
+ pip install -r requirements.txt
183
+ ```
184
+
185
+ ### Download Models
186
+
187
+ ```python
188
+ from huggingface_hub import hf_hub_download
189
+ import pickle
190
+ import torch
191
+
192
+ # Download traditional ML models
193
+ models = ['logistic_regression', 'random_forest', 'svm', 'xgboost']
194
+ for model_name in models:
195
+ model_path = hf_hub_download(
196
+ repo_id="huzaifanasirrr/human-vs-ai-text-classifier",
197
+ filename=f"models/{model_name}.pkl"
198
+ )
199
+ with open(model_path, 'rb') as f:
200
+ model = pickle.load(f)
201
+
202
+ # Download deep learning models
203
+ bert_path = hf_hub_download(
204
+ repo_id="huzaifanasirrr/human-vs-ai-text-classifier",
205
+ filename="models/bert_best.pt"
206
+ )
207
+ bilstm_path = hf_hub_download(
208
+ repo_id="huzaifanasirrr/human-vs-ai-text-classifier",
209
+ filename="models/bilstm_best.h5"
210
+ )
211
+ ```
212
+
213
+ ### Inference (Weighted Ensemble)
214
+
215
+ ```python
216
+ from src.feature_extractor import FeatureExtractor
217
+ from src.models.ensemble import WeightedEnsemble
218
+ import numpy as np
219
+
220
+ # Initialize feature extractor
221
+ feature_extractor = FeatureExtractor(
222
+ max_features=5000,
223
+ ngram_range=(1, 2)
224
+ )
225
+
226
+ # Extract features from text
227
+ text = "Your text to classify here..."
228
+ features = feature_extractor.extract(text) # Shape: (5015,)
229
+
230
+ # Load ensemble
231
+ ensemble = WeightedEnsemble(
232
+ models=[lr_model, rf_model, svm_model, xgb_model, bert_model, bilstm_model],
233
+ weights=[0.20, 0.10, 0.15, 0.25, 0.20, 0.10]
234
+ )
235
+
236
+ # Predict
237
+ prediction = ensemble.predict(features)
238
+ probability = ensemble.predict_proba(features)
239
+
240
+ if prediction == 0:
241
+ print(f"Human-written (confidence: {probability[0]:.2%})")
242
+ else:
243
+ print(f"AI-generated (confidence: {probability[1]:.2%})")
244
+ ```
245
+
246
+ ### Single Model Inference
247
+
248
+ ```python
249
+ # Using XGBoost (best individual model)
250
+ xgb_prediction = xgb_model.predict(features.reshape(1, -1))
251
+ xgb_proba = xgb_model.predict_proba(features.reshape(1, -1))
252
+
253
+ print(f"Prediction: {'AI' if xgb_prediction[0] else 'Human'}")
254
+ print(f"Confidence: {xgb_proba[0][xgb_prediction[0]]:.2%}")
255
+ ```
256
+
257
+ ## Key Innovations
258
+
259
+ 1. **Hybrid Feature Engineering**: Combines vocabulary-based TF-IDF with linguistic style features
260
+ 2. **Multi-Paradigm Ensemble**: Integrates linear models, tree ensembles, kernel methods, and neural networks
261
+ 3. **Optimized Weighting**: Performance-based weight assignment for ensemble members
262
+ 4. **Balanced Performance**: Equal precision and recall (99.59%) indicates no systematic bias
263
+ 5. **Domain Diversity**: Trained across 5 different text domains for robust generalization
264
+
265
+ ## Feature Importance
266
+
267
+ Based on XGBoost analysis:
268
+
269
+ | Feature Type | Importance |
270
+ |--------------|-----------|
271
+ | TF-IDF Features | 89.2% |
272
+ | Average Sentence Length | 4.3% |
273
+ | Lexical Diversity (TTR) | 2.7% |
274
+ | Unique Words Ratio | 1.5% |
275
+ | Average Word Length | 1.1% |
276
+ | Others | 1.2% |
277
+
278
+ **Insight**: Vocabulary patterns dominate, but linguistic features provide crucial complementary information.
279
+
280
+ ## Limitations
281
+
282
+ - **Dataset Specificity**: Trained on ChatGPT-generated text; may not generalize to other LLMs (GPT-4, Claude, Gemini)
283
+ - **Domain Dependency**: Best performance on domains similar to training data
284
+ - **Temporal Drift**: As LLMs evolve, detection patterns may become obsolete
285
+ - **Adversarial Vulnerability**: Not evaluated against deliberate evasion attempts
286
+ - **Language**: English-only (no multilingual support)
287
+ - **Computational Cost**: Full ensemble requires running 6 models (mitigated by optimized weights)
288
+
289
+ ## Citation
290
+
291
+ If you use this model in your research, please cite:
292
+
293
+ ```bibtex
294
+ @article{nasir2025humanaiclassifier,
295
+ title={Human vs. AI Text Classification: A Comprehensive Study Using Machine Learning and Deep Learning Approaches},
296
+ author={Nasir, Huzaifa},
297
+ institution={National University of Computer and Emerging Sciences, Pakistan},
298
+ year={2025},
299
+ note={Hugging Face: https://huggingface.co/huzaifanasirrr/human-vs-ai-text-classifier}
300
+ }
301
+ ```
302
+
303
+ **HC3 Dataset:**
304
+ ```bibtex
305
+ @article{guo2023hc3,
306
+ title={How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection},
307
+ author={Guo, Biyang and Zhang, Xin and Wang, Ziyuan and Jiang, Minqi and Nie, Jinran and Ding, Yuxuan and ... and Wu, Yupeng},
308
+ journal={arXiv preprint arXiv:2301.07597},
309
+ year={2023}
310
+ }
311
+ ```
312
+
313
+ ## Model Files
314
+
315
+ - `models/*.pkl` - Traditional ML models (Logistic Regression, Random Forest, SVM, XGBoost)
316
+ - `models/bert_best.pt` - Fine-tuned BERT model checkpoint
317
+ - `models/bilstm_best.h5` - BiLSTM with Attention model
318
+ - `results/*.json` - Comprehensive performance metrics
319
+ - `data/feature_info.json` - Feature vocabulary and metadata
320
+ - `visualizations/*.png` - Training curves, confusion matrices, ROC curves, comparisons
321
+ - `config.yaml` - Configuration settings
322
+ - `research_paper.tex` - Full research paper (SPRINGER LNCS format)
323
+
324
+ ## Ethical Considerations
325
+
326
+ ⚠️ **Important Notice:**
327
+
328
+ This model is designed for research and educational purposes. When deploying for real-world applications:
329
+
330
+ - **Transparency**: Inform users when text is subject to AI detection
331
+ - **Fairness**: Evaluate for bias against non-native speakers or specific writing styles
332
+ - **Privacy**: Respect user privacy and data protection regulations
333
+ - **Accuracy**: Do not use as definitive proof; false positives (0.2%) can occur
334
+ - **Context**: Use as one signal among many, not as sole evidence
335
+ - **Appeals**: Provide mechanisms for users to contest decisions
336
+
337
+ Detection systems should support human judgment, not replace it.
338
+
339
+ ## Author
340
+
341
+ **Huzaifa Nasir**
342
+ 📧 nasirhuzaifa95@gmail.com
343
+ 🎓 National University of Computer and Emerging Sciences (FAST-NUCES), Pakistan
344
+ 🔗 [GitHub Repository](https://github.com/huzaifanasir95/Human-vs-AI-Classifier)
345
+ 🆔 ORCID: [0009-0000-1482-3268](https://orcid.org/0009-0000-1482-3268)
346
+
347
+ ## License
348
+
349
+ MIT License - See LICENSE file for details.
350
+
351
+ ## Acknowledgments
352
+
353
+ This project builds upon:
354
+ - **HC3 Dataset**: Human-ChatGPT comparison corpus ([Guo et al., 2023](https://arxiv.org/abs/2301.07597))
355
+ - **BERT**: Pre-trained language model ([Devlin et al., 2018](https://arxiv.org/abs/1810.04805))
356
+ - **XGBoost**: Gradient boosting framework ([Chen & Guestrin, 2016](https://arxiv.org/abs/1603.02754))
357
+ - **Scikit-learn**: Machine learning library ([Pedregosa et al., 2011](https://jmlr.org/papers/v12/pedregosa11a.html))
358
+
359
+ Research conducted at FAST-NUCES Islamabad. Special thanks to the open-source community.
360
+
361
+ ---
362
+
363
+ **Status**: ✅ Production-ready | Last updated: January 2025