BERTAIDetector / README.md
pritamdeb68's picture
Update README.md
6b748a6 verified
---
library_name: transformers
tags:
- AI text detection
- human vs AI classification
- BERT fine-tuning
- Human vs AI text classification
- text-detection
license: mit
language:
- en
metrics:
- accuracy
base_model:
- google-bert/bert-base-uncased
---
# Model Card for BERT AI Detector
## Model Details
### Model Description
This model is a fine-tuned BERT designed to classify text as either AI-generated or human-written. The model was trained on data from the [Kaggle LLM Detect competition](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data) using variable-length text inputs ranging from 5 to 100 words. The fine-tuned model achieves high accuracy in identifying the source of the text, making it a valuable tool for detecting AI-generated content.
- **Developed by:** Pritam
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** BERT (base-uncased)
### Model Sources
- **Repository:** [Hugging Face Model Card](https://huggingface.co/pritam2014/BERTAIDetector)
- **Demo:** [Streamlit Interface](https://huggingface.co/spaces/pritam2014/BERTAIDetector)
## Uses
### Direct Use
The model is intended for use in detecting whether text is AI-generated or human-written. Users can input text snippets into the demo or directly integrate the model into their applications for automated content classification.
### Downstream Use
Potential downstream uses include:
- Moderating AI-generated content in online platforms.
- Academic and journalistic content verification.
- Detecting plagiarism or misuse of AI writing tools.
### Out-of-Scope Use
The model is not suitable for:
- Detecting deeply paraphrased AI-generated text.
- Analysis of languages other than English.
- Scenarios where fairness and bias considerations are critical, as those have not been explicitly addressed.
## Bias, Risks, and Limitations
### Recommendations
Users should be aware that:
- The model may not perform well on text heavily modified from AI-generated content.
- It may produce false positives or false negatives due to the inherent limitations of the dataset or model architecture.
## How to Get Started with the Model
Use the following code snippet to load the model:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("pritamdeb68/BERTAIDetector")
model = AutoModelForSequenceClassification.from_pretrained("pritamdeb68/BERTAIDetector")
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=1).item()
print("AI-generated" if predictions == 1 else "Human-written")
```
## Training Details
### Training Data
The training dataset was sourced from the [Kaggle LLM Detect competition](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data). The data includes examples of both AI-generated and human-written text, spanning various input lengths (5-100 words).
### Training Procedure
#### Preprocessing
- Text was tokenized using BERT's tokenizer.
- Input lengths ranged between 5 and 100 words, padded or truncated as necessary.
#### Training Hyperparameters
- **Batch Size:** 300
- **Optimizer:** AdamW
- **Learning Rate:** 1e-5
- **Epochs:** 1
#### Speeds, Sizes, Times
- **Training Time:** 1 hour 10 minutes
- **Hardware Used:** GPU (Kaggle T4 x 2)
- **Loss:** 0.12 on train data
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Validation data from the Kaggle competition was used for evaluation.
#### Metrics
- **Accuracy:** 96.65% on validation data.
### Results
The model achieved high accuracy and low validation loss, demonstrating its effectiveness for the task of AI text detection.
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute):
- **Hardware Type:** Kaggle T4 (x2) GPU
- **Training Duration:** 1 hour 10 minutes
- **Compute Region:** Not specified
## Technical Specifications
### Model Architecture and Objective
- **Model Architecture:** BERT (base-uncased) fine-tuned for text classification.
- **Objective:** Binary classification of text into AI-generated or human-written categories.
### Compute Infrastructure
#### Hardware
- **Type:** Kaggle T4(x2) GPU
#### Software
- **Framework:** PyTorch with Transformers library
## Citation
If you use this model, please cite the repository:
```
@inproceedings{pritam2024bertaidetector,
title={BERT AI Detector},
author={Pritam},
year={2024},
url={https://huggingface.co/pritam2014/BERTAIDetector}
}
```