Pujan-Dev commited on
Commit
7ce4837
Β·
1 Parent(s): dabc1e2

Docs :added the readme.md

Browse files
notebook/ai_vs_human/final_archi.md ADDED
@@ -0,0 +1,426 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI vs Human Text Detector V3 - Final Architecture Summary
2
+
3
+ **Model Version**: V3
4
+ **Type**: Hybrid Feature Engineering + TF-IDF Classifier
5
+ **Output Directory**: `./v3_model/`
6
+ **Date**: March 2026
7
+
8
+ ---
9
+
10
+ ## πŸ“Š Overview
11
+
12
+ The V3 model is a **non-transformer, feature-based ML classifier** that distinguishes between AI-generated and human-written text using a hybrid approach combining engineered linguistic features with TF-IDF text representations.
13
+
14
+ ```
15
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
16
+ β”‚ Input Text β”‚
17
+ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
18
+ β”‚
19
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
20
+ β”‚ β”‚
21
+ β–Ό β–Ό
22
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
23
+ β”‚ Text Features β”‚ β”‚ Engineered β”‚
24
+ β”‚ (TF-IDF) β”‚ β”‚ Features β”‚
25
+ β”‚ β”‚ β”‚ (16 features) β”‚
26
+ β”‚ β€’ Word (1-2gram) β”‚ β”‚ β”‚
27
+ β”‚ β€’ Char (3-5gram) β”‚ β”‚ β€’ Perplexity β”‚
28
+ β”‚ β”‚ β”‚ β€’ Burstiness β”‚
29
+ β”‚ Max 200k featuresβ”‚ β”‚ β€’ Stylometry β”‚
30
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
31
+ β”‚ β”‚
32
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
33
+ └───────►│ StandardScalerβ”‚β—„β”€β”€β”€β”€β”€β”€β”˜
34
+ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
35
+ β”‚
36
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
37
+ β”‚ Sparse Matrix β”‚
38
+ β”‚ Concat (hstack)β”‚
39
+ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
40
+ β”‚
41
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
42
+ β”‚ Logistic β”‚
43
+ β”‚ Regression β”‚
44
+ β”‚ (GridSearchCV)β”‚
45
+ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
46
+ β”‚
47
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
48
+ β”‚ Prediction β”‚
49
+ β”‚ (Human vs AI) β”‚
50
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
51
+ ```
52
+
53
+ ---
54
+
55
+ ## πŸ—οΈ Architecture Components
56
+
57
+ ### 1. **Data Loading**
58
+
59
+ **Function**: `load_dataset_recursive(max_samples=20000)`
60
+
61
+ - **Source**: Recursively scans `./DATASET/` folder
62
+ - **Formats Supported**: `.jsonl`, `.json`, `.csv`
63
+ - **Schema Support**:
64
+ - Schema 1: `human_text` + `ai_text` columns
65
+ - Schema 2: `text` + `label`/`ai_gen` columns
66
+ - **Labels**:
67
+ - `0` = Human text
68
+ - `1` = AI-generated text
69
+ - **Preprocessing**: Text normalization (whitespace cleanup)
70
+ - **Max Samples**: 20,000 (configurable)
71
+ - **Random State**: 42
72
+
73
+ ---
74
+
75
+ ### 2. **Feature Extraction Pipeline**
76
+
77
+ The model extracts **3 types of features** in parallel:
78
+
79
+ #### 2.1 **Perplexity Features** (1 feature)
80
+
81
+ **Model**: `distilgpt2` (Hugging Face Transformers)
82
+
83
+ ```python
84
+ class PerplexityCalculator:
85
+ - Model: distilgpt2
86
+ - Max Length: 512 tokens
87
+ - Metric: exp(cross_entropy_loss)
88
+ - Cap: 10,000 (outlier protection)
89
+ - Fallback: 100.0 on error
90
+ ```
91
+
92
+ **What it measures**: Language model surprise/naturalness
93
+ - Lower perplexity β†’ More predictable (often AI)
94
+ - Higher perplexity β†’ Less predictable (often human)
95
+
96
+ ---
97
+
98
+ #### 2.2 **Burstiness Features** (5 features)
99
+
100
+ Measures sentence length variation patterns.
101
+
102
+ **Features**:
103
+ 1. `burst_mean` - Average sentence length (words)
104
+ 2. `burst_std` - Standard deviation of sentence lengths
105
+ 3. `burst_max` - Maximum sentence length
106
+ 4. `burst_min` - Minimum sentence length
107
+ 5. `burst_range` - Range (max - min)
108
+
109
+ **Theory**: Human writing has more variation in sentence length (high burstiness), while AI text tends to be more uniform.
110
+
111
+ ---
112
+
113
+ #### 2.3 **Stylometry Features** (10 features)
114
+
115
+ Writing style and readability metrics.
116
+
117
+ **Features**:
118
+ 1. `num_words` - Total word count
119
+ 2. `num_chars` - Total character count
120
+ 3. `num_sentences` - Total sentence count
121
+ 4. `avg_word_len` - Average word length
122
+ 5. `avg_sent_len` - Average sentence length
123
+ 6. `lexical_diversity` - Unique words / total words
124
+ 7. `punct_ratio` - Punctuation density
125
+ 8. `caps_ratio` - Capitalization ratio
126
+ 9. `flesch_reading` - Flesch Reading Ease score
127
+ 10. `flesch_grade` - Flesch-Kincaid Grade Level
128
+
129
+ **Library**: `textstat` + `nltk`
130
+
131
+ ---
132
+
133
+ ### 3. **TF-IDF Vectorization**
134
+
135
+ #### 3.1 **Word-Level TF-IDF**
136
+
137
+ ```python
138
+ TfidfVectorizer(
139
+ analyzer="word",
140
+ ngram_range=(1, 2), # Unigrams + bigrams
141
+ min_df=3, # Minimum document frequency
142
+ max_df=0.98, # Maximum document frequency
143
+ max_features=120000, # Cap at 120k features
144
+ sublinear_tf=True # log(tf) scaling
145
+ )
146
+ ```
147
+
148
+ **Output**: Sparse matrix of word/phrase importance scores
149
+
150
+ ---
151
+
152
+ #### 3.2 **Character-Level TF-IDF**
153
+
154
+ ```python
155
+ TfidfVectorizer(
156
+ analyzer="char_wb", # Character n-grams (word boundaries)
157
+ ngram_range=(3, 5), # 3-char to 5-char sequences
158
+ min_df=3,
159
+ max_df=0.98,
160
+ max_features=80000, # Cap at 80k features
161
+ sublinear_tf=True
162
+ )
163
+ ```
164
+
165
+ **Purpose**: Captures sub-word patterns and stylistic signatures
166
+
167
+ ---
168
+
169
+ ### 4. **Feature Preprocessing**
170
+
171
+ **Engineered Features**:
172
+ - Scaled using `StandardScaler` (z-score normalization)
173
+ - Converted to sparse CSR matrix for memory efficiency
174
+
175
+ **Hybrid Feature Vector**:
176
+ ```python
177
+ hybrid_vec = hstack([word_tfidf, char_tfidf, engineered_features_scaled])
178
+ ```
179
+
180
+ **Final Feature Dimensionality**:
181
+ - Word TF-IDF: Up to 120,000 features
182
+ - Char TF-IDF: Up to 80,000 features
183
+ - Engineered: 16 features
184
+ - **Total**: Up to ~200,016 features (sparse)
185
+
186
+ ---
187
+
188
+ ### 5. **Model Training**
189
+
190
+ #### 5.1 **Train-Test Split**
191
+ ```python
192
+ train_size: 80% (16,000 samples)
193
+ test_size: 20% (4,000 samples)
194
+ stratified: Yes (balanced across classes)
195
+ random_state: 42
196
+ ```
197
+
198
+ #### 5.2 **Classifier**
199
+
200
+ **Algorithm**: Logistic Regression
201
+
202
+ **Hyperparameter Tuning**: GridSearchCV with 3-fold stratified cross-validation
203
+
204
+ **Search Space**:
205
+ ```python
206
+ {
207
+ "C": [0.5, 1.0, 2.0, 4.0], # Regularization strength
208
+ "class_weight": [None, "balanced"], # Class balancing
209
+ "solver": "saga", # Stochastic Average Gradient
210
+ "penalty": "l2", # L2 regularization
211
+ "max_iter": 2500,
212
+ "n_jobs": -1 # Parallel processing
213
+ }
214
+ ```
215
+
216
+ **Scoring Metric**: F1 Score (balanced for precision/recall)
217
+
218
+ ---
219
+
220
+ ### 6. **Model Evaluation**
221
+
222
+ **Metrics Tracked**:
223
+ - **Accuracy**: Overall correct predictions
224
+ - **F1 Score**: Harmonic mean of precision/recall
225
+ - **ROC-AUC**: Area under ROC curve
226
+ - **Confusion Matrix**: True/false positives/negatives
227
+ - **Classification Report**: Per-class precision/recall/F1
228
+
229
+ **Visualizations**:
230
+ 1. Confusion Matrix
231
+ 2. ROC Curve
232
+ 3. Feature Importance (top engineered features)
233
+ 4. Perplexity Distribution (Human vs AI)
234
+ 5. Lexical Diversity Distribution
235
+ 6. Burstiness STD Distribution
236
+
237
+ ---
238
+
239
+ ### 7. **Model Persistence**
240
+
241
+ **Output Directory**: `./v3_model/`
242
+
243
+ **Saved Artifacts**:
244
+
245
+ | File | Description |
246
+ |------|-------------|
247
+ | `classifier.pkl` | Trained Logistic Regression model |
248
+ | `scaler.pkl` | StandardScaler for engineered features |
249
+ | `word_vectorizer.pkl` | Word-level TF-IDF vectorizer |
250
+ | `char_vectorizer.pkl` | Character-level TF-IDF vectorizer |
251
+ | `feature_names.json` | List of engineered feature names (16 features) |
252
+ | `metadata.json` | Model performance metrics & configuration |
253
+
254
+ **Metadata Contents**:
255
+ ```json
256
+ {
257
+ "selected_model": "hybrid_tfidf_logistic",
258
+ "cv_best_f1": 0.xxxx,
259
+ "num_engineered_features": 16,
260
+ "num_word_tfidf_features": 120000,
261
+ "num_char_tfidf_features": 80000,
262
+ "train_samples": 16000,
263
+ "test_samples": 4000,
264
+ "train_accuracy": 0.xxxx,
265
+ "train_f1": 0.xxxx,
266
+ "test_accuracy": 0.xxxx,
267
+ "test_f1": 0.xxxx
268
+ }
269
+ ```
270
+
271
+ ---
272
+
273
+ ### 8. **Inference Pipeline**
274
+
275
+ **Function**: `predict_v3(text: str) -> dict`
276
+
277
+ **Process**:
278
+ ```python
279
+ 1. Normalize text (whitespace cleanup)
280
+ 2. Extract engineered features (16 features)
281
+ 3. Scale engineered features (StandardScaler)
282
+ 4. Generate word TF-IDF vector
283
+ 5. Generate char TF-IDF vector
284
+ 6. Concatenate all features (sparse matrix)
285
+ 7. Predict with Logistic Regression
286
+ 8. Return prediction + probabilities + features
287
+ ```
288
+
289
+ **Output Schema**:
290
+ ```python
291
+ {
292
+ "text": str, # Truncated input (100 chars)
293
+ "word_count": int, # Number of words
294
+ "predicted_label": int, # 0=Human, 1=AI
295
+ "predicted_name": str, # "human" or "ai"
296
+ "probability_human": float, # P(Human) [0-1]
297
+ "probability_ai": float, # P(AI) [0-1]
298
+ "features": dict # All 16 engineered features
299
+ }
300
+ ```
301
+
302
+ **Batch Function**: `predict_v3_batch(texts: list[str]) -> list[dict]`
303
+
304
+ ---
305
+
306
+ ## πŸ”§ Configuration
307
+
308
+ ```python
309
+ @dataclass
310
+ class V3Config:
311
+ max_samples: int = 20000 # Max training samples
312
+ test_size: float = 0.2 # Test split ratio
313
+ output_dir: str = "./v3_model" # Model save directory
314
+ random_state: int = 42 # Reproducibility seed
315
+ cv_folds: int = 3 # Cross-validation folds
316
+ ```
317
+
318
+ ---
319
+
320
+ ## πŸ“¦ Dependencies
321
+
322
+ **Core Libraries**:
323
+ - `scikit-learn` - ML algorithms, TF-IDF, metrics
324
+ - `pandas` - Data manipulation
325
+ - `numpy` - Numerical operations
326
+ - `scipy` - Sparse matrix operations
327
+
328
+ **Feature Extraction**:
329
+ - `transformers` - DistilGPT2 for perplexity
330
+ - `torch` - PyTorch backend for transformers
331
+ - `nltk` - Sentence tokenization (`punkt_tab`)
332
+ - `textstat` - Readability metrics
333
+
334
+ **Visualization**:
335
+ - `matplotlib` - Plotting
336
+ - `seaborn` - Statistical visualizations
337
+
338
+ ---
339
+
340
+ ## 🎯 Key Design Decisions
341
+
342
+ ### Why Not Transformers?
343
+ 1. **Speed**: No GPU required, fast inference
344
+ 2. **Interpretability**: Explainable features
345
+ 3. **Efficiency**: Smaller model size (~500MB vs 5GB+)
346
+ 4. **Robustness**: Works on any text length
347
+
348
+ ### Why Hybrid Features?
349
+ 1. **TF-IDF**: Captures content and vocabulary patterns
350
+ 2. **Perplexity**: Measures language model naturalness
351
+ 3. **Burstiness**: Detects sentence variation patterns
352
+ 4. **Stylometry**: Analyzes writing style signatures
353
+
354
+ ### Why Logistic Regression?
355
+ 1. **Scalability**: Handles 200k+ sparse features efficiently
356
+ 2. **Speed**: Fast training and inference
357
+ 3. **Interpretability**: Clear feature importance via coefficients
358
+ 4. **Robustness**: Well-suited for high-dimensional sparse data
359
+
360
+ ---
361
+
362
+ ## πŸ“ˆ Expected Performance
363
+
364
+ **Typical Results** (20k samples):
365
+ - **Test Accuracy**: 85-95%
366
+ - **Test F1 Score**: 0.85-0.95
367
+ - **Inference Speed**: ~50-100 texts/second (CPU)
368
+ - **Model Size**: ~500 MB total
369
+
370
+ **Best For**:
371
+ - βœ… General English text classification
372
+ - βœ… Articles, essays, reviews
373
+ - βœ… Medium to long texts (50+ words)
374
+
375
+ **Limitations**:
376
+ - ⚠️ Very short texts (<10 words) may be unreliable
377
+ - ⚠️ Perplexity calculation is the bottleneck (uses GPU if available)
378
+ - ⚠️ Domain-specific jargon may affect performance
379
+ - ⚠️ Non-English text requires retraining
380
+
381
+ ---
382
+
383
+ ## πŸ”„ Model Loading Example
384
+
385
+ ```python
386
+ from pathlib import Path
387
+ import pickle
388
+ import json
389
+
390
+ model_dir = Path("./v3_model")
391
+
392
+ # Load all artifacts
393
+ classifier = pickle.load(open(model_dir / "classifier.pkl", "rb"))
394
+ scaler = pickle.load(open(model_dir / "scaler.pkl", "rb"))
395
+ word_vectorizer = pickle.load(open(model_dir / "word_vectorizer.pkl", "rb"))
396
+ char_vectorizer = pickle.load(open(model_dir / "char_vectorizer.pkl", "rb"))
397
+ feature_names = json.load(open(model_dir / "feature_names.json", "r"))
398
+ metadata = json.load(open(model_dir / "metadata.json", "r"))
399
+
400
+ # Use predict_v3() function for inference
401
+ result = predict_v3("Your text here...")
402
+ ```
403
+
404
+ ---
405
+
406
+ ## πŸ’‘ Future Improvements
407
+
408
+ 1. **Model Versioning**: Add versioning system for model updates
409
+ 2. **Confidence Thresholds**: Flag uncertain predictions
410
+ 3. **Batch Optimization**: Vectorized batch inference
411
+ 4. **Model Wrapper Class**: Encapsulate all logic in `AIPredictorV3` class
412
+ 5. **Perplexity Caching**: Cache calculations for faster inference
413
+ 6. **Ensemble Methods**: Combine multiple models for better accuracy
414
+ 7. **Active Learning**: Iterative retraining with user feedback
415
+ 8. **Multi-language Support**: Train separate models per language
416
+
417
+ ---
418
+
419
+ ## πŸ“ Citation & Credits
420
+
421
+ **Framework**: scikit-learn + HuggingFace Transformers
422
+ **Perplexity Model**: DistilGPT2 (OpenAI/Hugging Face)
423
+ **Readability Metrics**: textstat library
424
+
425
+
426
+ **Architecture Type**: Hybrid Feature Engineering + Logistic Regression
notebook/ai_vs_human/mainv3.ipynb ADDED
The diff for this file is too large to render. See raw diff