followsci commited on
Commit
dc41bba
·
verified ·
1 Parent(s): 5200ec5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +220 -0
README.md ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ metrics:
6
+ - accuracy
7
+ - f1
8
+ base_model:
9
+ - google-bert/bert-base-uncased
10
+ pipeline_tag: text-classification
11
+ tags:
12
+ - text-classification
13
+ - ai-detection
14
+ - academic-text
15
+ - ai-generated-text-detection
16
+ model-index:
17
+ - name: bert-ai-text-detector
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ name: AI-Generated Text Detection
22
+ dataset:
23
+ name: Custom Academic Text Dataset
24
+ type: custom
25
+ metrics:
26
+ - type: accuracy
27
+ value: 0.9957
28
+ - type: f1
29
+ value: 0.9958
30
+ - type: precision
31
+ value: 0.9923
32
+ - type: recall
33
+ value: 0.9994
34
+ ---
35
+ # BERT-based AI-Generated Academic Text Detector
36
+
37
+ A high-accuracy BERT model for detecting AI-generated academic text with **99.57% accuracy** on paragraph-level samples.
38
+
39
+ ## Online Demo
40
+
41
+ 🌐 **Try the model online**: [https://followsci.com/ai-detection](https://followsci.com/ai-detection)
42
+
43
+ Free web interface with real-time detection, no installation or API key required.
44
+
45
+ ## Model Details
46
+
47
+ ### Model Description
48
+
49
+ - **Model Type**: BERT-base-uncased fine-tuned for binary text classification
50
+ - **Architecture**: BERT-base-uncased (110M parameters)
51
+ - **Task**: Binary classification (Human-written vs AI-generated text)
52
+ - **Input**: Academic text paragraphs (up to 512 tokens)
53
+ - **Output**: Binary label (0 = Human-written, 1 = AI-generated) with confidence scores
54
+
55
+ ### Training Information
56
+
57
+ - **Training Samples**: 1,487,400 paragraph-level samples
58
+ - **Validation Samples**: 185,930 paragraph-level samples
59
+ - **Test Samples**: 185,930 paragraph-level samples
60
+ - **Total Dataset**: 1,859,260 paragraphs
61
+ - **Training Data**:
62
+ - Human-written: Academic papers from arXiv
63
+ - AI-generated: Text generated by various large language models (GPT, Claude, etc.)
64
+
65
+ ## Performance
66
+
67
+ ### Test Set Results
68
+
69
+ | Metric | Value |
70
+ |--------|-------|
71
+ | **Accuracy** | **99.57%** |
72
+ | **F1-Score** | **99.58%** |
73
+ | Precision | 99.23% |
74
+ | Recall | 99.94% |
75
+ | False Positive Rate | 0.82% |
76
+ | False Negative Rate | 0.06% |
77
+
78
+ ### Confusion Matrix (Test Set)
79
+
80
+ | | Predicted: Human | Predicted: AI |
81
+ |---|---|---|
82
+ | **Actual: Human** | 89,740 (TN) | 740 (FP) |
83
+ | **Actual: AI** | 60 (FN) | 95,390 (TP) |
84
+
85
+ **Inference Speed:** ~20,900 samples/second on RTX 3090 (batch size 64)
86
+
87
+ ## Usage
88
+
89
+ ### Quick Start
90
+
91
+ ```python
92
+ from transformers import BertTokenizer, BertForSequenceClassification
93
+ import torch
94
+
95
+ # Load model and tokenizer
96
+ model_name = "followsci/bert-ai-text-detector"
97
+ tokenizer = BertTokenizer.from_pretrained(model_name)
98
+ model = BertForSequenceClassification.from_pretrained(model_name)
99
+ model.eval()
100
+
101
+ # Detect AI text
102
+ text = "Your academic paragraph here..."
103
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
104
+
105
+ with torch.no_grad():
106
+ outputs = model(**inputs)
107
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
108
+ ai_prob = probs[0][1].item() * 100
109
+ human_prob = probs[0][0].item() * 100
110
+
111
+ print(f"AI-generated probability: {ai_prob:.1f}%")
112
+ print(f"Human-written probability: {human_prob:.1f}%")
113
+
114
+ if ai_prob > 50:
115
+ print("Prediction: AI-generated")
116
+ else:
117
+ print("Prediction: Human-written")
118
+ ```
119
+
120
+ ### Batch Processing
121
+
122
+ ```python
123
+ texts = [
124
+ "First paragraph...",
125
+ "Second paragraph...",
126
+ # ... more texts
127
+ ]
128
+
129
+ inputs = tokenizer(
130
+ texts,
131
+ return_tensors="pt",
132
+ truncation=True,
133
+ max_length=512,
134
+ padding=True
135
+ )
136
+
137
+ with torch.no_grad():
138
+ outputs = model(**inputs)
139
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
140
+
141
+ for i, prob in enumerate(probs):
142
+ ai_prob = prob[1].item() * 100
143
+ print(f"Text {i+1}: AI probability = {ai_prob:.1f}%")
144
+ ```
145
+
146
+ ### Using with Transformers Pipeline
147
+
148
+ ```python
149
+ from transformers import pipeline
150
+
151
+ classifier = pipeline(
152
+ "text-classification",
153
+ model="followsci/bert-ai-text-detector",
154
+ tokenizer="followsci/bert-ai-text-detector"
155
+ )
156
+
157
+ result = classifier("Your text here...")
158
+ print(result)
159
+ ```
160
+
161
+ ## Training Details
162
+
163
+ ### Training Configuration
164
+
165
+ - **Base Model**: `bert-base-uncased`
166
+ - **Batch Size**: 64
167
+ - **Learning Rate**: 5e-5 (with linear warmup)
168
+ - **Warmup Steps**: 5,000
169
+ - **Max Sequence Length**: 512
170
+ - **Optimizer**: AdamW
171
+ - **Epochs**: 3
172
+ - **Training Time**: ~11 hours (on RTX 3090)
173
+
174
+ ### Dataset Distribution
175
+
176
+ | Split | Total Samples | Human (Label 0) | AI (Label 1) |
177
+ |-------|--------------|-----------------|--------------|
178
+ | Train | 1,487,400 | 723,780 (48.7%) | 763,620 (51.3%) |
179
+ | Validation | 185,930 | 90,470 (48.7%) | 95,460 (51.3%) |
180
+ | Test | 185,930 | 90,480 (48.7%) | 95,450 (51.3%) |
181
+
182
+ ## Limitations
183
+
184
+ 1. **Domain Specificity**: The model is trained primarily on academic text. Performance may degrade on:
185
+ - Casual text or social media content
186
+ - Technical documentation
187
+ - Creative writing
188
+
189
+ 2. **Binary Classification**: The model only distinguishes between "human" and "AI" text, without:
190
+ - Identifying which AI model generated the text
191
+ - Providing confidence intervals
192
+ - Detecting partially AI-assisted text
193
+
194
+ 3. **Paragraph-Level Detection**: The model is optimized for paragraph-level samples:
195
+ - Performance on sentence-level or full-document level may vary
196
+ - Best results achieved with structured academic paragraphs
197
+
198
+ 4. **False Positives**: Approximately 0.82% false positive rate means some human-written text may be flagged as AI-generated.
199
+
200
+ ## Ethical Considerations
201
+
202
+ - **Use Case**: This model is intended as a tool for academic integrity and research purposes
203
+ - **Bias**: The model may reflect biases present in the training data
204
+ - **Misuse**: Should not be used as the sole criterion for academic misconduct decisions
205
+ - **Transparency**: Results should be interpreted with context and domain expertise
206
+
207
+
208
+ ## License
209
+
210
+ This model is licensed under the MIT License.
211
+
212
+ ## Contact
213
+
214
+ - **Email**: raffoduanedonnenfeld@gmail.com
215
+
216
+ ---
217
+
218
+ <p align="center">
219
+ Made with ❤️ for Academic Integrity
220
+ </p>