Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.54.0
title: FineTextTector - AI Text Detector
emoji: 🤖
colorFrom: indigo
colorTo: blue
sdk: streamlit
sdk_version: 1.30.0
app_file: app.py
pinned: false
license: mit
About the Project
This project aims to develop a robust machine learning model capable of distinguishing between human-written and AI-generated text. With the rapid advancement of large language models (LLMs) such as ChatGPT and Gemini, the ability to identify the origin of a text has become crucial in various domains, including academic integrity, content moderation, misinformation detection, and authorship verification. The project leverages state-of-the-art natural language processing (NLP) techniques and transfer learning to build a binary classifier that can accurately predict whether a given text was authored by a human or generated by an AI.
The workflow encompasses comprehensive exploratory data analysis (EDA), advanced text preprocessing, model selection and fine-tuning, and thorough evaluation. The final model is designed to be easily deployable and accessible for real-world applications.
About the Dataset
The dataset used in this project is sourced from Kaggle: AI vs Human Text Dataset. It contains a large collection of text samples, each labeled as either human-written or AI-generated. The dataset is well-suited for binary classification tasks and provides a diverse range of topics and writing styles, making it ideal for training and evaluating models that need to generalize across different types of content.
- Features:
text: The actual text sample.generated: Label indicating the source (0 for human, 1 for AI).
The dataset is split into training, validation, and test sets to ensure unbiased evaluation and robust model performance.
Notebook Summary
The main notebook, ai_vs_human_text_fine_tuned_classifier.ipynb, guides users through the entire process of building the classifier:
- Problem Definition: Outlines the motivation and objectives.
- Exploratory Data Analysis (EDA): Visualizes class distributions, text lengths, lexical richness, punctuation usage, and stopword ratios to uncover patterns and differences between human and AI texts.
- Text Preprocessing: Applies normalization, stopword removal, noise filtering (removing URLs, emails, hashtags, mentions, numbers), and filters out outlier texts based on length.
- Model Selection: Utilizes transfer learning with the
distilbert/distilroberta-basemodel, enhanced with LoRA (Low-Rank Adaptation) for efficient fine-tuning. - Training: Fine-tunes the model on a subset of the data, using stratified splits and advanced training arguments for optimal performance.
- Evaluation: Assesses the model using accuracy, precision, recall, and F1-score on a held-out test set.
- Deployment: Demonstrates how to push the trained model and tokenizer to Hugging Face Hub for sharing and reuse.
Model Results
Preprocessing
- Lowercasing and Stripping: All text is converted to lowercase and stripped of extra whitespace.
- Punctuation and Stopword Removal: Punctuation is removed, and stopwords are filtered out to focus on meaningful content.
- Noise Filtering: Regular expressions are used to remove URLs, emails, hashtags, mentions, and numbers.
- Outlier Filtering: Texts that are extremely short or long (based on quantiles) are removed to ensure consistent input lengths for the model.
- Deduplication: Duplicate texts are dropped to prevent data leakage.
Training
- Model Architecture: The project uses
distilbert/distilroberta-base, a distilled version of RoBERTa, known for its efficiency and strong performance on text classification tasks. - LoRA Fine-Tuning: LoRA (Low-Rank Adaptation) is applied to reduce the number of trainable parameters, making the fine-tuning process more memory- and compute-efficient without sacrificing accuracy.
- Training Arguments: The model is trained for 2 epochs with early stopping, regular evaluation, and checkpointing. Batch sizes and learning rates are carefully chosen for stability and speed.
Evaluation
- Metrics: The model is evaluated using accuracy, precision, recall, and F1-score. These metrics provide a comprehensive view of the classifier's performance, especially in distinguishing between the two classes.
- Results: The fine-tuned model demonstrates strong performance, with high accuracy and balanced precision/recall, indicating its effectiveness in real-world scenarios.
How to Install
Follow these steps to set up the environment using Python's built-in venv:
# Clone the repository
git clone https://github.com/DeepActionPotential/FineTextTector
cd FineTextTector
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install required packages
pip install -r requirements.txt
How to Use the Software
Technologies Used
- Transformers (Hugging Face): Core library for model loading, tokenization, and training. Enables transfer learning with state-of-the-art NLP models.
- Datasets (Hugging Face): Efficient data handling, splitting, and preprocessing.
- PEFT (Parameter-Efficient Fine-Tuning): Implements LoRA for memory- and compute-efficient model adaptation.
- Optuna: Automated hyperparameter optimization to fine-tune model performance.
- Scikit-learn: Data splitting, metrics calculation, and utility functions.
- Seaborn & Matplotlib: Data visualization for EDA and result interpretation.
- NLTK: Stopword lists and basic NLP utilities.
- Python venv: Isolated environment management for reproducible installations.
These technologies collectively enable efficient, scalable, and reproducible development of advanced NLP models.
License
This project is licensed under the MIT License. See the LICENSE file for details.

