DistilBERT — AI-Generated Text Detector

Fine-tuned distilbert-base-uncased for binary classification of human vs. AI-generated text on the HC3 corpus.

Headline metrics on the held-out test set (11,238 samples, 5 domains):

Metric	Value
F1-score	0.9847
Accuracy	0.9891
ROC-AUC	0.9998
Precision	0.9706
Recall	0.9992

Statistically significant over the strongest classical baseline (Linear SVM + TF-IDF) by McNemar's test: χ² = 144.0, p ≈ 3.55 × 10⁻³³.

⚠️ Important: Aggregate metrics hide a cross-domain fairness gap. See the Limitations section before deploying.

📄 Full technical report (PDF) · 🎓 Presentation slides (PDF) · 💻 GitHub repository (code & reproduction)

Intended use

Primary intended use. Research, education, and evaluation experiments studying AI-generated text detection. Useful as a baseline transformer for comparison to newer detection approaches, or as a starting point for further fine-tuning on more recent generators (GPT-4, Claude, Gemini, open-source LLMs).

Out-of-scope use cases.

❌ Academic discipline decisions (plagiarism cases, expulsion, grade penalties)
❌ Hiring decisions (filtering applicant essays or cover letters)
❌ Content moderation in production without domain-specific calibration and human review
❌ Any high-stakes decision without explicit fairness audit on the deployment population
❌ Non-English text — the model has not been evaluated on any non-English content

The reason these are out of scope is documented in the Limitations section.

How to use

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_id  = "Elia43/distilbert-ai-text-detector"
model     = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

text = "Your text sample here."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    logits = model(**inputs).logits
    probs  = torch.softmax(logits, dim=-1)[0]

# Label mapping: 0 = human-written, 1 = AI-generated
print(f"P(human) = {probs[0]:.4f}")
print(f"P(AI)    = {probs[1]:.4f}")

Always report the probability, not the binary label. Calibrated confidence is more honest than a hard prediction.

Training data

HC3 (Human ChatGPT Comparison Corpus) — Guo et al., 2023.

~75,000 samples after quality filtering (removing texts <20 or >1000 words, deduplication)
5 domains: finance, medicine, open_qa, reddit_eli5, wiki_csai
AI source: ChatGPT (gpt-3.5-turbo)
Stratified 70/15/15 split on the composite key label × domain (random_state=42)

The training subset used for fine-tuning was a 15,000-sample stratified slice of the training set — DistilBERT saturates quickly on this task, and using the full set yields no measurable improvement.

Training procedure

Parameter	Value
Base model	`distilbert-base-uncased`
Task head	Sequence classification, 2 labels
Optimizer	AdamW
Learning rate	2e-5
LR schedule	Linear warmup (10% of steps), linear decay
Weight decay	0.01
Batch size	16
Max sequence length	256 tokens
Epochs	3
Gradient clipping	1.0 (max grad norm)
Seed	42
Mixed precision	FP16 (Colab T4 GPU)
Best-checkpoint criterion	Validation F1

Training took ~30 minutes on a single Google Colab T4 GPU.

Evaluation

Held-out test set: 11,238 samples, stratified across 5 domains.

Overall metrics

Metric	Value
Accuracy	0.9891
Precision	0.9706
Recall	0.9992
F1	0.9847
ROC-AUC	0.9998

Per-domain F1

Domain	F1
medicine	1.000
finance	0.990
open_qa	0.988
reddit_eli5	0.985
wiki_csai (technical writing)	0.916

The 9.5% error rate on wiki_csai is over 10× higher than on medical text, and ~9× higher than the project-wide error rate of 1.09%.

Comparison to classical baselines

Model	F1
Multinomial Naive Bayes	0.8731
Bi-LSTM (frozen GloVe-100d)	0.9338
Logistic Regression (TF-IDF)	0.9523
Linear SVM (TF-IDF)	0.9531
DistilBERT (this model)	0.9847

McNemar's test (DistilBERT vs. Linear SVM): χ² = 144.0, p ≈ 3.55 × 10⁻³³.

Limitations and bias

This model has documented limitations. Read this section before using it for anything that affects people.

Domain bias

Error rate ranges from 0% on medical text to 9.54% on technical Wikipedia-style content. This pattern is consistent across every model architecture tested (classical, recurrent, transformer), suggesting the difficulty is intrinsic to the domain — not solvable by switching models. Technical writers face disproportionate misclassification risk.

Length bias

Short texts (under 50 words) have systematically higher error rates than longer documents. Relevant because student short-answer questions, social media posts, and email replies are short by nature.

Non-native English speaker bias (inherited from the literature)

Liang et al. (2023) showed that GPT detectors trained on similar data misclassify >50% of TOEFL essays by non-native English speakers as AI-generated, vs. <5% for native speakers. The mechanism — lower lexical diversity, more formal phrasing, more uniform syntax — exactly matches the features this model relies on. We have strong reason to believe this model exhibits the same bias. It was not directly tested in our evaluation, but the mechanism is structurally identical.

Generator coverage

The training data only contains ChatGPT (gpt-3.5-turbo) outputs. Performance on text from GPT-4, Claude, Gemini, Llama, Mistral, or other generators is untested and likely degraded. Detectors trained on one generation system generalize poorly to others.

Other limitations

English only
5 domains is a narrow slice of real-world writing — news, fiction, academic papers, code, and informal chat are not represented
No adversarial evaluation against paraphrasing attacks or human editing
Static evaluation — performance degrades as language models evolve. Models like this need re-evaluation every 3–6 months.

Responsible deployment recommendations

If you do deploy this model (or any AI-text detector):

Never use as sole evidence. Detection should be an informational signal; final high-stakes decisions require human review.
Report calibrated probabilities, not binary labels. "P(AI) = 0.83 ± 0.12" is honest. "This is AI" is not.
Calibrate on the deployment distribution before going live. A model trained on HC3 will not have the right thresholds for student essays, journalism, or business writing.
Audit for fairness across demographic and linguistic groups, including non-native English writers, before any decision-making use.
Avoid punitive use cases. The harms of false positives in academic discipline or hiring outweigh the benefits.
Re-evaluate every 3–6 months. Generators evolve; detectors decay.
Be transparent. People being analyzed should know it's happening and have a way to challenge results.
Consider not deploying. For some use cases, the most ethical choice is no detector at all.

Citation

@misc{khater2026aitextdetection,
  author       = {Elia Khater},
  title        = {DistilBERT for AI-Generated Text Detection: A Comparative Study with Cross-Domain Fairness Analysis},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/Elia43/distilbert-ai-text-detector}},
  note         = {Final project, Introduction to Natural Language Processing, Université Saint-Joseph (USJ), Spring 2026.
                  Companion repository: \url{https://github.com/Elia43/ai-text-detection}}
}

License

MIT License. The underlying HC3 dataset is licensed by its original authors (Guo et al., 2023) — please respect their terms when reusing this model or extending this work.

Author

Elia Khater — Mathematics & Data Science, Université Saint-Joseph (USJ), Beirut GitHub · LinkedIn · eliakhater7@gmail.com

Downloads last month: 22

Safetensors

Model size

67M params

Tensor type

F32

Model tree for Elia43/distilbert-ai-text-detector

Base model

distilbert/distilbert-base-uncased

Finetuned

(12047)

this model

Dataset used to train Elia43/distilbert-ai-text-detector

Paper for Elia43/distilbert-ai-text-detector

GPT detectors are biased against non-native English writers

Paper • 2304.02819 • Published Apr 6, 2023

Evaluation results

F1 Score on HC3 (Human ChatGPT Comparison Corpus)
self-reported

0.985
Accuracy on HC3 (Human ChatGPT Comparison Corpus)
self-reported

0.989
ROC AUC on HC3 (Human ChatGPT Comparison Corpus)
self-reported

1.000