srikanthgali
/

paradetect-deberta-v3-lora

@@ -16,96 +16,196 @@ metrics:
 - f1
 - precision
 - recall
 ---
-# ParaDetect: AI vs Human Text Detection Model
 ## Model Description
-ParaDetect is a fine-tuned DeBERTa-v3-large model using LoRA (Low-Rank Adaptation) for detecting AI-generated vs human-written text. This model achieves high accuracy in distinguishing between human and AI-generated content.
 ## Model Details
-- **Base Model**: microsoft/deberta-v3-large
-- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
 - **Task**: Binary text classification (Human: 0, AI: 1)
-- **Dataset**: AI Text Detection Pile (100K samples)
-- **Performance**: ~99% accuracy on validation set
 ## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 from peft import PeftModel
 import torch
-# Load base model and tokenizer
-base_model_name = "microsoft/deberta-v3-large"
-model = AutoModelForSequenceClassification.from_pretrained(base_model_name)
-tokenizer = AutoTokenizer.from_pretrained(base_model_name)
 # Load LoRA adapter
-model = PeftModel.from_pretrained(model, "srikanthgali/paradetect-deberta-v3-lora")
-# Inference
 def predict_text_origin(text):
-    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
     with torch.no_grad():
         outputs = model(**inputs)
-        prediction = torch.nn.functional.softmax(outputs.logits, dim=-1)
-    human_prob = prediction[0][0].item()
-    ai_prob = prediction[0][1].item()
     return {
         "human_probability": human_prob,
-        "ai_probability": ai_prob,
-        "prediction": "AI" if ai_prob > human_prob else "Human"
     }
 # Example usage
 text = "Your text here..."
 result = predict_text_origin(text)
-print(result)
-```
-## Training Details
-- **Training Data**: 100,000 samples from AI Text Detection Pile
-- **Validation Split**: 20%
-- **Training Strategy**: LoRA fine-tuning with r=16, alpha=32
-- **Optimizer**: AdamW with learning rate 3e-4
-- **Epochs**: 3
-- **Batch Size**: 16
-## Performance Metrics
-| Metric | Score |
-|--------|-------|
-| Accuracy | 99.2% |
-| Precision | 99.1% |
-| Recall | 99.3% |
-| F1-Score | 99.2% |
-## Limitations
-- Optimized for English text
-- Performance may vary on very short texts (<50 words)
-- May not generalize to newer AI models not seen during training
-## Citation
-If you use this model, please cite:
-```bibtex
 @misc{paradetect2024,
-  title={ParaDetect: AI vs Human Text Detection},
   author={Srikanth Gali},
   year={2024},
-  url={https://github.com/srikanthgali/ParaDetect}
 }
-```
-## Repository
-For more details, training code, and demo: [ParaDetect GitHub](https://github.com/srikanthgali/ParaDetect)

 - f1
 - precision
 - recall
+base_model: microsoft/deberta-v3-large
+model_name: paradetect-deberta-v3-lora
 ---
+# ParaDetect: DeBERTa-v3-Large Fine-tuned for AI vs Human Text Detection
 ## Model Description
+ParaDetect is a fine-tuned DeBERTa-v3-large model using LoRA (Low-Rank Adaptation) for detecting AI-generated vs human-written text. This model achieves ~99% accuracy in distinguishing between human and AI-generated content, making it highly effective for academic integrity, content verification, and research applications.
 ## Model Details
+- **Base Model**: microsoft/deberta-v3-large (~435M parameters)
+- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
+- **Trainable Parameters**: ~28M parameters (6% of total)
 - **Task**: Binary text classification (Human: 0, AI: 1)
+- **Dataset**: AI Text Detection Pile (cleaned, 100K samples)
+- **Training Framework**: Hugging Face Transformers + PEFT
+## Performance Metrics
+### Test Set Results
+- **Accuracy**: 99.31%
+- **Precision (Weighted)**: 99.31%
+- **Recall (Weighted)**: 99.31%
+- **F1-Score (Weighted)**: 99.31%
+### Class-wise Performance
+| Class | Precision | Recall | F1-Score | Support |
+|-------|-----------|---------|----------|---------|
+| **Human (0)** | 99.72% | 98.89% | 99.30% | 7,500 |
+| **AI (1)** | 98.91% | 99.72% | 99.31% | 7,500 |
+## Training Details
+### LoRA Configuration
+- **Rank (r)**: 64
+- **Alpha**: 128
+- **Dropout**: 0.1
+- **Target Modules**: query_proj, key_proj, value_proj, dense, output.dense
+- **Bias**: all
+### Training Parameters
+- **Epochs**: 3 (with early stopping)
+- **Batch Size**: 32 (train/eval)
+- **Learning Rate**: 2e-4
+- **Optimizer**: AdamW
+- **Weight Decay**: 0.01
+- **Warmup Ratio**: 0.1
+- **Max Gradient Norm**: 1.0
+### Early Stopping
+- **Patience**: 5 evaluation steps
+- **Metric**: F1-score
+- **Threshold**: 0.001
 ## Usage
+### Quick Start
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 from peft import PeftModel
 import torch
+# Load tokenizer and base model
+tokenizer = AutoTokenizer.from_pretrained("srikanthgali/paradetect-deberta-v3-lora")
+base_model = AutoModelForSequenceClassification.from_pretrained(
+    "microsoft/deberta-v3-large",
+    num_labels=2
+)
 # Load LoRA adapter
+model = PeftModel.from_pretrained(base_model, "srikanthgali/paradetect-deberta-v3-lora")
+# Prediction function
 def predict_text_origin(text):
+    inputs = tokenizer(
+        text,
+        return_tensors="pt",
+        truncation=True,
+        max_length=512,
+        padding=True
+    )
     with torch.no_grad():
         outputs = model(**inputs)
+        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
+        prediction = torch.argmax(probabilities, dim=-1)
+    human_prob = probabilities[0][0].item()
+    ai_prob = probabilities[0][1].item()
     return {
+        "prediction": "AI" if prediction.item() == 1 else "Human",
+        "confidence": max(human_prob, ai_prob),
         "human_probability": human_prob,
+        "ai_probability": ai_prob
     }
 # Example usage
 text = "Your text here..."
 result = predict_text_origin(text)
+print(f"Prediction: {result['prediction']} (Confidence: {result['confidence']:.1%})")
+Gradio Interface
+import gradio as gr
+# Create interface (see full notebook for complete implementation)
+demo = gr.Interface(
+    fn=predict_text_origin,
+    inputs=gr.Textbox(lines=10, placeholder="Enter text to analyze..."),
+    outputs=[
+        gr.Textbox(label="Prediction"),
+        gr.Label(label="Confidence Scores")
+    ],
+    title="ParaDetect - AI vs Human Text Detection",
+    description="Detect whether text is written by humans or generated by AI"
+)
+demo.launch()
+## Technical Specifications
+- **Input**: Text (up to 512 tokens)
+- **Output**: Binary classification with confidence scores
+- **Inference Speed**: ~100ms per text
+- **Memory Usage**: Optimized with LoRA (reduced by ~94%)
+- **GPU Support**: CUDA-enabled for faster inference
+## Training Dataset
+- **Source**: artem9k/ai-text-detection-pile (cleaned)
+- **Size**: 100,000 samples (subset for efficient training)
+- **Split**: 70% train, 15% validation, 15% test
+- **Balance**: Equal distribution of human vs AI text
+- **Text Length**: 10-512 tokens, optimized for 50-500 words
+## Limitations and Considerations
+- **Language**: Optimized for English text
+- **Text Length**: Best performance on 50-500 word texts
+- **Domain**: May not generalize to very recent AI models
+- **Context**: Performance may vary on highly technical or domain-specific content
+- **Updates**: Regular retraining recommended as AI models evolve
+## Intended Use Cases
+### Primary Applications
+- Academic integrity verification
+- Content authenticity checking
+- Research and analysis
+- Educational demonstrations
+- Journalism and fact-checking
+### Not Recommended For
+- Legal evidence without human verification
+- Automated content moderation decisions
+- High-stakes authentication without additional validation
+## Ethical Considerations
+- **Bias**: Model trained on specific dataset; may not represent all text types
+- **Fairness**: Regular evaluation across different demographics recommended
+- **Transparency**: Predictions are probabilistic, not definitive
+- **Human Oversight**: Critical decisions should involve human judgment
+## Model Card Authors
+- **Developer**: Srikanth Gali
+- **Organization**: Independent Research
+- **Contact**: [GitHub Repository](https://github.com/srikanthgali/ParaDetect)
+## Citation
 @misc{paradetect2024,
+  title={ParaDetect: AI vs Human Text Detection with DeBERTa-v3-Large},
   author={Srikanth Gali},
   year={2024},
+  url={https://github.com/srikanthgali/ParaDetect},
+  note={Fine-tuned using LoRA for efficient parameter adaptation}
 }
+## Additional Resources
+- **📁 GitHub Repository**: ParaDetect
+- **📊 Dataset**: AI Text Detection Pile - Cleaned
+- **🎯 Demo:**: Gradio Interface
+- **📈 Training Notebook**: Fine-tuning Details
+- **🔍 EDA**: Data Analysis
+## Version History
+- **v1.0**: Initial release with DeBERTa-v3-Large + LoRA
+- **Training Date**: 2025-10-06
+- **Model Size**: ~28M trainable parameters
+- **Performance**: 99.31% test accuracy