Spaces:

Pujan-Dev
/

AI_API

Sleeping

App Files Files Community

Pujan-Dev commited on Jun 13, 2025

Commit

7ce4837

1 Parent(s): dabc1e2

Docs :added the readme.md

Browse files

Files changed (2) hide show

notebook/ai_vs_human/final_archi.md +426 -0
notebook/ai_vs_human/mainv3.ipynb +0 -0

notebook/ai_vs_human/final_archi.md ADDED Viewed

	@@ -0,0 +1,426 @@

+# AI vs Human Text Detector V3 - Final Architecture Summary
+**Model Version**: V3
+**Type**: Hybrid Feature Engineering + TF-IDF Classifier
+**Output Directory**: `./v3_model/`
+**Date**: March 2026
+---
+## 📊 Overview
+The V3 model is a **non-transformer, feature-based ML classifier** that distinguishes between AI-generated and human-written text using a hybrid approach combining engineered linguistic features with TF-IDF text representations.
+```
+┌─────────────┐
+│  Input Text │
+└──────┬──────┘
+       │
+       ├──────────────────────────────────┐
+       │                                  │
+       ▼                                  ▼
+┌──────────────────┐            ┌─────────────────┐
+│  Text Features   │            │   Engineered    │
+│    (TF-IDF)      │            │    Features     │
+│                  │            │  (16 features)  │
+│ • Word (1-2gram) │            │                 │
+│ • Char (3-5gram) │            │ • Perplexity    │
+│                  │            │ • Burstiness    │
+│ Max 200k features│            │ • Stylometry    │
+└────────┬─────────┘            └─────────┬───────┘
+         │                                │
+         │        ┌───────────────┐       │
+         └───────►│ StandardScaler│◄──────┘
+                  └───────┬───────┘
+                          │
+                  ┌───────▼──────────┐
+                  │ Sparse Matrix    │
+                  │   Concat (hstack)│
+                  └───────┬──────────┘
+                          │
+                  ┌───────▼────────┐
+                  │   Logistic     │
+                  │  Regression    │
+                  │  (GridSearchCV)│
+                  └───────┬────────┘
+                          │
+                  ┌───────▼────────┐
+                  │  Prediction    │
+                  │ (Human vs AI)  │
+                  └────────────────┘
+```
+---
+## 🏗️ Architecture Components
+### 1. **Data Loading**
+**Function**: `load_dataset_recursive(max_samples=20000)`
+- **Source**: Recursively scans `./DATASET/` folder
+- **Formats Supported**: `.jsonl`, `.json`, `.csv`
+- **Schema Support**:
+  - Schema 1: `human_text` + `ai_text` columns
+  - Schema 2: `text` + `label`/`ai_gen` columns
+- **Labels**:
+  - `0` = Human text
+  - `1` = AI-generated text
+- **Preprocessing**: Text normalization (whitespace cleanup)
+- **Max Samples**: 20,000 (configurable)
+- **Random State**: 42
+---
+### 2. **Feature Extraction Pipeline**
+The model extracts **3 types of features** in parallel:
+#### 2.1 **Perplexity Features** (1 feature)
+**Model**: `distilgpt2` (Hugging Face Transformers)
+```python
+class PerplexityCalculator:
+    - Model: distilgpt2
+    - Max Length: 512 tokens
+    - Metric: exp(cross_entropy_loss)
+    - Cap: 10,000 (outlier protection)
+    - Fallback: 100.0 on error
+```
+**What it measures**: Language model surprise/naturalness
+- Lower perplexity → More predictable (often AI)
+- Higher perplexity → Less predictable (often human)
+---
+#### 2.2 **Burstiness Features** (5 features)
+Measures sentence length variation patterns.
+**Features**:
+1. `burst_mean` - Average sentence length (words)
+2. `burst_std` - Standard deviation of sentence lengths
+3. `burst_max` - Maximum sentence length
+4. `burst_min` - Minimum sentence length
+5. `burst_range` - Range (max - min)
+**Theory**: Human writing has more variation in sentence length (high burstiness), while AI text tends to be more uniform.
+---
+#### 2.3 **Stylometry Features** (10 features)
+Writing style and readability metrics.
+**Features**:
+1. `num_words` - Total word count
+2. `num_chars` - Total character count
+3. `num_sentences` - Total sentence count
+4. `avg_word_len` - Average word length
+5. `avg_sent_len` - Average sentence length
+6. `lexical_diversity` - Unique words / total words
+7. `punct_ratio` - Punctuation density
+8. `caps_ratio` - Capitalization ratio
+9. `flesch_reading` - Flesch Reading Ease score
+10. `flesch_grade` - Flesch-Kincaid Grade Level
+**Library**: `textstat` + `nltk`
+---
+### 3. **TF-IDF Vectorization**
+#### 3.1 **Word-Level TF-IDF**
+```python
+TfidfVectorizer(
+    analyzer="word",
+    ngram_range=(1, 2),        # Unigrams + bigrams
+    min_df=3,                  # Minimum document frequency
+    max_df=0.98,               # Maximum document frequency
+    max_features=120000,       # Cap at 120k features
+    sublinear_tf=True          # log(tf) scaling
+)
+```
+**Output**: Sparse matrix of word/phrase importance scores
+---
+#### 3.2 **Character-Level TF-IDF**
+```python
+TfidfVectorizer(
+    analyzer="char_wb",        # Character n-grams (word boundaries)
+    ngram_range=(3, 5),        # 3-char to 5-char sequences
+    min_df=3,
+    max_df=0.98,
+    max_features=80000,        # Cap at 80k features
+    sublinear_tf=True
+)
+```
+**Purpose**: Captures sub-word patterns and stylistic signatures
+---
+### 4. **Feature Preprocessing**
+**Engineered Features**:
+- Scaled using `StandardScaler` (z-score normalization)
+- Converted to sparse CSR matrix for memory efficiency
+**Hybrid Feature Vector**:
+```python
+hybrid_vec = hstack([word_tfidf, char_tfidf, engineered_features_scaled])
+```
+**Final Feature Dimensionality**:
+- Word TF-IDF: Up to 120,000 features
+- Char TF-IDF: Up to 80,000 features
+- Engineered: 16 features
+- **Total**: Up to ~200,016 features (sparse)
+---
+### 5. **Model Training**
+#### 5.1 **Train-Test Split**
+```python
+train_size: 80% (16,000 samples)
+test_size: 20% (4,000 samples)
+stratified: Yes (balanced across classes)
+random_state: 42
+```
+#### 5.2 **Classifier**
+**Algorithm**: Logistic Regression
+**Hyperparameter Tuning**: GridSearchCV with 3-fold stratified cross-validation
+**Search Space**:
+```python
+{
+    "C": [0.5, 1.0, 2.0, 4.0],           # Regularization strength
+    "class_weight": [None, "balanced"],   # Class balancing
+    "solver": "saga",                     # Stochastic Average Gradient
+    "penalty": "l2",                      # L2 regularization
+    "max_iter": 2500,
+    "n_jobs": -1                          # Parallel processing
+}
+```
+**Scoring Metric**: F1 Score (balanced for precision/recall)
+---
+### 6. **Model Evaluation**
+**Metrics Tracked**:
+- **Accuracy**: Overall correct predictions
+- **F1 Score**: Harmonic mean of precision/recall
+- **ROC-AUC**: Area under ROC curve
+- **Confusion Matrix**: True/false positives/negatives
+- **Classification Report**: Per-class precision/recall/F1
+**Visualizations**:
+1. Confusion Matrix
+2. ROC Curve
+3. Feature Importance (top engineered features)
+4. Perplexity Distribution (Human vs AI)
+5. Lexical Diversity Distribution
+6. Burstiness STD Distribution
+---
+### 7. **Model Persistence**
+**Output Directory**: `./v3_model/`
+**Saved Artifacts**:
+| File | Description |
+|------|-------------|
+| `classifier.pkl` | Trained Logistic Regression model |
+| `scaler.pkl` | StandardScaler for engineered features |
+| `word_vectorizer.pkl` | Word-level TF-IDF vectorizer |
+| `char_vectorizer.pkl` | Character-level TF-IDF vectorizer |
+| `feature_names.json` | List of engineered feature names (16 features) |
+| `metadata.json` | Model performance metrics & configuration |
+**Metadata Contents**:
+```json
+{
+  "selected_model": "hybrid_tfidf_logistic",
+  "cv_best_f1": 0.xxxx,
+  "num_engineered_features": 16,
+  "num_word_tfidf_features": 120000,
+  "num_char_tfidf_features": 80000,
+  "train_samples": 16000,
+  "test_samples": 4000,
+  "train_accuracy": 0.xxxx,
+  "train_f1": 0.xxxx,
+  "test_accuracy": 0.xxxx,
+  "test_f1": 0.xxxx
+}
+```
+---
+### 8. **Inference Pipeline**
+**Function**: `predict_v3(text: str) -> dict`
+**Process**:
+```python
+1. Normalize text (whitespace cleanup)
+2. Extract engineered features (16 features)
+3. Scale engineered features (StandardScaler)
+4. Generate word TF-IDF vector
+5. Generate char TF-IDF vector
+6. Concatenate all features (sparse matrix)
+7. Predict with Logistic Regression
+8. Return prediction + probabilities + features
+```
+**Output Schema**:
+```python
+{
+    "text": str,                    # Truncated input (100 chars)
+    "word_count": int,              # Number of words
+    "predicted_label": int,         # 0=Human, 1=AI
+    "predicted_name": str,          # "human" or "ai"
+    "probability_human": float,     # P(Human) [0-1]
+    "probability_ai": float,        # P(AI) [0-1]
+    "features": dict                # All 16 engineered features
+}
+```
+**Batch Function**: `predict_v3_batch(texts: list[str]) -> list[dict]`
+---
+## 🔧 Configuration
+```python
+@dataclass
+class V3Config:
+    max_samples: int = 20000       # Max training samples
+    test_size: float = 0.2         # Test split ratio
+    output_dir: str = "./v3_model" # Model save directory
+    random_state: int = 42         # Reproducibility seed
+    cv_folds: int = 3              # Cross-validation folds
+```
+---
+## 📦 Dependencies
+**Core Libraries**:
+- `scikit-learn` - ML algorithms, TF-IDF, metrics
+- `pandas` - Data manipulation
+- `numpy` - Numerical operations
+- `scipy` - Sparse matrix operations
+**Feature Extraction**:
+- `transformers` - DistilGPT2 for perplexity
+- `torch` - PyTorch backend for transformers
+- `nltk` - Sentence tokenization (`punkt_tab`)
+- `textstat` - Readability metrics
+**Visualization**:
+- `matplotlib` - Plotting
+- `seaborn` - Statistical visualizations
+---
+## 🎯 Key Design Decisions
+### Why Not Transformers?
+1. **Speed**: No GPU required, fast inference
+2. **Interpretability**: Explainable features
+3. **Efficiency**: Smaller model size (~500MB vs 5GB+)
+4. **Robustness**: Works on any text length
+### Why Hybrid Features?
+1. **TF-IDF**: Captures content and vocabulary patterns
+2. **Perplexity**: Measures language model naturalness
+3. **Burstiness**: Detects sentence variation patterns
+4. **Stylometry**: Analyzes writing style signatures
+### Why Logistic Regression?
+1. **Scalability**: Handles 200k+ sparse features efficiently
+2. **Speed**: Fast training and inference
+3. **Interpretability**: Clear feature importance via coefficients
+4. **Robustness**: Well-suited for high-dimensional sparse data
+---
+## 📈 Expected Performance
+**Typical Results** (20k samples):
+- **Test Accuracy**: 85-95%
+- **Test F1 Score**: 0.85-0.95
+- **Inference Speed**: ~50-100 texts/second (CPU)
+- **Model Size**: ~500 MB total
+**Best For**:
+- ✅ General English text classification
+- ✅ Articles, essays, reviews
+- ✅ Medium to long texts (50+ words)
+**Limitations**:
+- ⚠️ Very short texts (<10 words) may be unreliable
+- ⚠️ Perplexity calculation is the bottleneck (uses GPU if available)
+- ⚠️ Domain-specific jargon may affect performance
+- ⚠️ Non-English text requires retraining
+---
+## 🔄 Model Loading Example
+```python
+from pathlib import Path
+import pickle
+import json
+model_dir = Path("./v3_model")
+# Load all artifacts
+classifier = pickle.load(open(model_dir / "classifier.pkl", "rb"))
+scaler = pickle.load(open(model_dir / "scaler.pkl", "rb"))
+word_vectorizer = pickle.load(open(model_dir / "word_vectorizer.pkl", "rb"))
+char_vectorizer = pickle.load(open(model_dir / "char_vectorizer.pkl", "rb"))
+feature_names = json.load(open(model_dir / "feature_names.json", "r"))
+metadata = json.load(open(model_dir / "metadata.json", "r"))
+# Use predict_v3() function for inference
+result = predict_v3("Your text here...")
+```
+---
+## 💡 Future Improvements
+1. **Model Versioning**: Add versioning system for model updates
+2. **Confidence Thresholds**: Flag uncertain predictions
+3. **Batch Optimization**: Vectorized batch inference
+4. **Model Wrapper Class**: Encapsulate all logic in `AIPredictorV3` class
+5. **Perplexity Caching**: Cache calculations for faster inference
+6. **Ensemble Methods**: Combine multiple models for better accuracy
+7. **Active Learning**: Iterative retraining with user feedback
+8. **Multi-language Support**: Train separate models per language
+---
+## 📝 Citation & Credits
+**Framework**: scikit-learn + HuggingFace Transformers
+**Perplexity Model**: DistilGPT2 (OpenAI/Hugging Face)
+**Readability Metrics**: textstat library
+**Architecture Type**: Hybrid Feature Engineering + Logistic Regression

notebook/ai_vs_human/mainv3.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff