---
library_name: transformers
tags:
- AI text detection
- human vs AI classification
- BERT fine-tuning
- Human vs AI text classification
- text-detection
license: mit
language:
- en
metrics:
- accuracy
base_model:
- google-bert/bert-base-uncased
---
# Model Card for BERT AI Detector


## Model Details

### Model Description

This model is a fine-tuned BERT designed to classify text as either AI-generated or human-written. The model was trained on data from the [Kaggle LLM Detect competition](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data) using variable-length text inputs ranging from 5 to 100 words. The fine-tuned model achieves high accuracy in identifying the source of the text, making it a valuable tool for detecting AI-generated content.


- **Developed by:** Pritam
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** BERT (base-uncased)

### Model Sources

- **Repository:** [Hugging Face Model Card](https://huggingface.co/pritam2014/BERTAIDetector)
- **Demo:** [Streamlit Interface](https://huggingface.co/spaces/pritam2014/BERTAIDetector)


## Uses


### Direct Use

The model is intended for use in detecting whether text is AI-generated or human-written. Users can input text snippets into the demo or directly integrate the model into their applications for automated content classification.


### Downstream Use

Potential downstream uses include:

- Moderating AI-generated content in online platforms.
- Academic and journalistic content verification.
- Detecting plagiarism or misuse of AI writing tools.

### Out-of-Scope Use

The model is not suitable for:

- Detecting deeply paraphrased AI-generated text.
- Analysis of languages other than English.
- Scenarios where fairness and bias considerations are critical, as those have not been explicitly addressed.

## Bias, Risks, and Limitations


### Recommendations

Users should be aware that:

- The model may not perform well on text heavily modified from AI-generated content.
- It may produce false positives or false negatives due to the inherent limitations of the dataset or model architecture.

## How to Get Started with the Model

Use the following code snippet to load the model:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("pritamdeb68/BERTAIDetector")
model = AutoModelForSequenceClassification.from_pretrained("pritamdeb68/BERTAIDetector")

text = "Your text here"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=1).item()
print("AI-generated" if predictions == 1 else "Human-written")
```

## Training Details

### Training Data

The training dataset was sourced from the [Kaggle LLM Detect competition](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data). The data includes examples of both AI-generated and human-written text, spanning various input lengths (5-100 words).


### Training Procedure

#### Preprocessing


- Text was tokenized using BERT's tokenizer.
- Input lengths ranged between 5 and 100 words, padded or truncated as necessary.

#### Training Hyperparameters

- **Batch Size:** 300
- **Optimizer:** AdamW
- **Learning Rate:** 1e-5
- **Epochs:** 1

#### Speeds, Sizes, Times

- **Training Time:** 1 hour 10 minutes
- **Hardware Used:** GPU (Kaggle T4 x 2)
- **Loss:** 0.12 on train data

## Evaluation


### Testing Data, Factors & Metrics

#### Testing Data

Validation data from the Kaggle competition was used for evaluation.


#### Metrics

- **Accuracy:** 96.65% on validation data.


### Results

The model achieved high accuracy and low validation loss, demonstrating its effectiveness for the task of AI text detection.


## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute):


- **Hardware Type:** Kaggle T4 (x2) GPU
- **Training Duration:** 1 hour 10 minutes
- **Compute Region:** Not specified


## Technical Specifications

### Model Architecture and Objective

- **Model Architecture:** BERT (base-uncased) fine-tuned for text classification.
- **Objective:** Binary classification of text into AI-generated or human-written categories.

### Compute Infrastructure


#### Hardware

- **Type:** Kaggle T4(x2) GPU

#### Software

- **Framework:** PyTorch with Transformers library


## Citation

If you use this model, please cite the repository:

```
@inproceedings{pritam2024bertaidetector,
  title={BERT AI Detector},
  author={Pritam},
  year={2024},
  url={https://huggingface.co/pritam2014/BERTAIDetector}
}
```