BERTAIDetector / README.md

Update README.md

6b748a6 verified about 1 year ago

4.73 kB

	---
	library_name: transformers
	tags:
	- AI text detection
	- human vs AI classification
	- BERT fine-tuning
	- Human vs AI text classification
	- text-detection
	license: mit
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- google-bert/bert-base-uncased
	---
	# Model Card for BERT AI Detector






	## Model Details

	### Model Description

	This model is a fine-tuned BERT designed to classify text as either AI-generated or human-written. The model was trained on data from the [Kaggle LLM Detect competition](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data) using variable-length text inputs ranging from 5 to 100 words. The fine-tuned model achieves high accuracy in identifying the source of the text, making it a valuable tool for detecting AI-generated content.











	- Developed by: Pritam
	- Language(s) (NLP): English
	- License: Apache 2.0
	- Finetuned from model: BERT (base-uncased)

	### Model Sources

	- Repository: [Hugging Face Model Card](https://huggingface.co/pritam2014/BERTAIDetector)
	- Demo: [Streamlit Interface](https://huggingface.co/spaces/pritam2014/BERTAIDetector)


	## Uses



	### Direct Use

	The model is intended for use in detecting whether text is AI-generated or human-written. Users can input text snippets into the demo or directly integrate the model into their applications for automated content classification.



	### Downstream Use

	Potential downstream uses include:

	- Moderating AI-generated content in online platforms.
	- Academic and journalistic content verification.
	- Detecting plagiarism or misuse of AI writing tools.

	### Out-of-Scope Use

	The model is not suitable for:

	- Detecting deeply paraphrased AI-generated text.
	- Analysis of languages other than English.
	- Scenarios where fairness and bias considerations are critical, as those have not been explicitly addressed.

	## Bias, Risks, and Limitations





	### Recommendations

	Users should be aware that:

	- The model may not perform well on text heavily modified from AI-generated content.
	- It may produce false positives or false negatives due to the inherent limitations of the dataset or model architecture.

	## How to Get Started with the Model

	Use the following code snippet to load the model:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("pritamdeb68/BERTAIDetector")
	model = AutoModelForSequenceClassification.from_pretrained("pritamdeb68/BERTAIDetector")

	text = "Your text here"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)
	predictions = outputs.logits.argmax(dim=1).item()
	print("AI-generated" if predictions == 1 else "Human-written")
	```

	## Training Details

	### Training Data

	The training dataset was sourced from the [Kaggle LLM Detect competition](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data). The data includes examples of both AI-generated and human-written text, spanning various input lengths (5-100 words).



	### Training Procedure

	#### Preprocessing





	- Text was tokenized using BERT's tokenizer.
	- Input lengths ranged between 5 and 100 words, padded or truncated as necessary.

	#### Training Hyperparameters

	- Batch Size: 300
	- Optimizer: AdamW
	- Learning Rate: 1e-5
	- Epochs: 1

	#### Speeds, Sizes, Times

	- Training Time: 1 hour 10 minutes
	- Hardware Used: GPU (Kaggle T4 x 2)
	- Loss: 0.12 on train data

	## Evaluation



	### Testing Data, Factors & Metrics

	#### Testing Data

	Validation data from the Kaggle competition was used for evaluation.









	#### Metrics

	- Accuracy: 96.65% on validation data.



	### Results

	The model achieved high accuracy and low validation loss, demonstrating its effectiveness for the task of AI text detection.











	## Environmental Impact

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute):



	- Hardware Type: Kaggle T4 (x2) GPU
	- Training Duration: 1 hour 10 minutes
	- Compute Region: Not specified



	## Technical Specifications

	### Model Architecture and Objective

	- Model Architecture: BERT (base-uncased) fine-tuned for text classification.
	- Objective: Binary classification of text into AI-generated or human-written categories.

	### Compute Infrastructure



	#### Hardware

	- Type: Kaggle T4(x2) GPU

	#### Software

	- Framework: PyTorch with Transformers library

























	## Citation

	If you use this model, please cite the repository:

	```
	@inproceedings{pritam2024bertaidetector,
	title={BERT AI Detector},
	author={Pritam},
	year={2024},
	url={https://huggingface.co/pritam2014/BERTAIDetector}
	}
	```