DeepActionPotential commited on
Commit
bf9beb4
·
verified ·
1 Parent(s): e46d899

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -100
README.md CHANGED
@@ -1,100 +1,111 @@
1
- # AI vs Human Text Fine-Tuned Classifier
2
-
3
- ## About the Project
4
-
5
- This project aims to develop a robust machine learning model capable of distinguishing between human-written and AI-generated text. With the rapid advancement of large language models (LLMs) such as ChatGPT and Gemini, the ability to identify the origin of a text has become crucial in various domains, including academic integrity, content moderation, misinformation detection, and authorship verification. The project leverages state-of-the-art natural language processing (NLP) techniques and transfer learning to build a binary classifier that can accurately predict whether a given text was authored by a human or generated by an AI.
6
-
7
- The workflow encompasses comprehensive exploratory data analysis (EDA), advanced text preprocessing, model selection and fine-tuning, and thorough evaluation. The final model is designed to be easily deployable and accessible for real-world applications.
8
-
9
- ## About the Dataset
10
-
11
- The dataset used in this project is sourced from Kaggle: [AI vs Human Text Dataset](https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text). It contains a large collection of text samples, each labeled as either human-written or AI-generated. The dataset is well-suited for binary classification tasks and provides a diverse range of topics and writing styles, making it ideal for training and evaluating models that need to generalize across different types of content.
12
-
13
- - **Features:**
14
- - `text`: The actual text sample.
15
- - `generated`: Label indicating the source (0 for human, 1 for AI).
16
-
17
- The dataset is split into training, validation, and test sets to ensure unbiased evaluation and robust model performance.
18
-
19
- ## Notebook Summary
20
-
21
- The main notebook, `ai_vs_human_text_fine_tuned_classifier.ipynb`, guides users through the entire process of building the classifier:
22
-
23
- 1. **Problem Definition:** Outlines the motivation and objectives.
24
- 2. **Exploratory Data Analysis (EDA):** Visualizes class distributions, text lengths, lexical richness, punctuation usage, and stopword ratios to uncover patterns and differences between human and AI texts.
25
- 3. **Text Preprocessing:** Applies normalization, stopword removal, noise filtering (removing URLs, emails, hashtags, mentions, numbers), and filters out outlier texts based on length.
26
- 4. **Model Selection:** Utilizes transfer learning with the `distilbert/distilroberta-base` model, enhanced with LoRA (Low-Rank Adaptation) for efficient fine-tuning.
27
- 5. **Training:** Fine-tunes the model on a subset of the data, using stratified splits and advanced training arguments for optimal performance.
28
- 6. **Evaluation:** Assesses the model using accuracy, precision, recall, and F1-score on a held-out test set.
29
- 7. **Deployment:** Demonstrates how to push the trained model and tokenizer to Hugging Face Hub for sharing and reuse.
30
-
31
- ## Model Results
32
-
33
- ### Preprocessing
34
-
35
- - **Lowercasing and Stripping:** All text is converted to lowercase and stripped of extra whitespace.
36
- - **Punctuation and Stopword Removal:** Punctuation is removed, and stopwords are filtered out to focus on meaningful content.
37
- - **Noise Filtering:** Regular expressions are used to remove URLs, emails, hashtags, mentions, and numbers.
38
- - **Outlier Filtering:** Texts that are extremely short or long (based on quantiles) are removed to ensure consistent input lengths for the model.
39
- - **Deduplication:** Duplicate texts are dropped to prevent data leakage.
40
-
41
- ### Training
42
-
43
- - **Model Architecture:** The project uses `distilbert/distilroberta-base`, a distilled version of RoBERTa, known for its efficiency and strong performance on text classification tasks.
44
- - **LoRA Fine-Tuning:** LoRA (Low-Rank Adaptation) is applied to reduce the number of trainable parameters, making the fine-tuning process more memory- and compute-efficient without sacrificing accuracy.
45
- - **Training Arguments:** The model is trained for 2 epochs with early stopping, regular evaluation, and checkpointing. Batch sizes and learning rates are carefully chosen for stability and speed.
46
-
47
- ### Evaluation
48
-
49
- - **Metrics:** The model is evaluated using accuracy, precision, recall, and F1-score. These metrics provide a comprehensive view of the classifier's performance, especially in distinguishing between the two classes.
50
- - **Results:** The fine-tuned model demonstrates strong performance, with high accuracy and balanced precision/recall, indicating its effectiveness in real-world scenarios.
51
-
52
- ## How to Install
53
-
54
- Follow these steps to set up the environment using Python's built-in `venv`:
55
-
56
- ```bash
57
- # Clone the repository
58
- git clone https://github.com/DeepActionPotential/FineTextTector
59
- cd FineTextTector
60
-
61
- # Create a virtual environment
62
- python -m venv venv
63
-
64
- # Activate the virtual environment
65
- # On Windows:
66
- venv\Scripts\activate
67
- # On macOS/Linux:
68
- source venv/bin/activate
69
-
70
-
71
- # Install required packages
72
- pip install -r requirements.txt
73
- ```
74
-
75
-
76
-
77
- ## How to Use the Software
78
-
79
- - ## [Demo-video](images/FineTextTector.mp4)
80
- - ![Demo-image](images/1.png)
81
- - ![Demo-image](images/2.png)
82
-
83
- ## Technologies Used
84
-
85
- - **Transformers (Hugging Face):** Core library for model loading, tokenization, and training. Enables transfer learning with state-of-the-art NLP models.
86
- - **Datasets (Hugging Face):** Efficient data handling, splitting, and preprocessing.
87
- - **PEFT (Parameter-Efficient Fine-Tuning):** Implements LoRA for memory- and compute-efficient model adaptation.
88
- - **Optuna:** Automated hyperparameter optimization to fine-tune model performance.
89
- - **Scikit-learn:** Data splitting, metrics calculation, and utility functions.
90
- - **Seaborn & Matplotlib:** Data visualization for EDA and result interpretation.
91
- - **NLTK:** Stopword lists and basic NLP utilities.
92
- - **Python venv:** Isolated environment management for reproducible installations.
93
-
94
- These technologies collectively enable efficient, scalable, and reproducible development of advanced NLP models.
95
-
96
- ## License
97
-
98
- This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
99
-
100
- ---
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ title: FineTextTector - AI Text Detector
4
+ emoji: 🤖
5
+ colorFrom: indigo
6
+ colorTo: blue
7
+ sdk: streamlit
8
+ sdk_version: 1.30.0
9
+ app_file: app.py
10
+ pinned: false
11
+ license: mit
12
+ ---
13
+
14
+ ## About the Project
15
+
16
+ This project aims to develop a robust machine learning model capable of distinguishing between human-written and AI-generated text. With the rapid advancement of large language models (LLMs) such as ChatGPT and Gemini, the ability to identify the origin of a text has become crucial in various domains, including academic integrity, content moderation, misinformation detection, and authorship verification. The project leverages state-of-the-art natural language processing (NLP) techniques and transfer learning to build a binary classifier that can accurately predict whether a given text was authored by a human or generated by an AI.
17
+
18
+ The workflow encompasses comprehensive exploratory data analysis (EDA), advanced text preprocessing, model selection and fine-tuning, and thorough evaluation. The final model is designed to be easily deployable and accessible for real-world applications.
19
+
20
+ ## About the Dataset
21
+
22
+ The dataset used in this project is sourced from Kaggle: [AI vs Human Text Dataset](https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text). It contains a large collection of text samples, each labeled as either human-written or AI-generated. The dataset is well-suited for binary classification tasks and provides a diverse range of topics and writing styles, making it ideal for training and evaluating models that need to generalize across different types of content.
23
+
24
+ - **Features:**
25
+ - `text`: The actual text sample.
26
+ - `generated`: Label indicating the source (0 for human, 1 for AI).
27
+
28
+ The dataset is split into training, validation, and test sets to ensure unbiased evaluation and robust model performance.
29
+
30
+ ## Notebook Summary
31
+
32
+ The main notebook, `ai_vs_human_text_fine_tuned_classifier.ipynb`, guides users through the entire process of building the classifier:
33
+
34
+ 1. **Problem Definition:** Outlines the motivation and objectives.
35
+ 2. **Exploratory Data Analysis (EDA):** Visualizes class distributions, text lengths, lexical richness, punctuation usage, and stopword ratios to uncover patterns and differences between human and AI texts.
36
+ 3. **Text Preprocessing:** Applies normalization, stopword removal, noise filtering (removing URLs, emails, hashtags, mentions, numbers), and filters out outlier texts based on length.
37
+ 4. **Model Selection:** Utilizes transfer learning with the `distilbert/distilroberta-base` model, enhanced with LoRA (Low-Rank Adaptation) for efficient fine-tuning.
38
+ 5. **Training:** Fine-tunes the model on a subset of the data, using stratified splits and advanced training arguments for optimal performance.
39
+ 6. **Evaluation:** Assesses the model using accuracy, precision, recall, and F1-score on a held-out test set.
40
+ 7. **Deployment:** Demonstrates how to push the trained model and tokenizer to Hugging Face Hub for sharing and reuse.
41
+
42
+ ## Model Results
43
+
44
+ ### Preprocessing
45
+
46
+ - **Lowercasing and Stripping:** All text is converted to lowercase and stripped of extra whitespace.
47
+ - **Punctuation and Stopword Removal:** Punctuation is removed, and stopwords are filtered out to focus on meaningful content.
48
+ - **Noise Filtering:** Regular expressions are used to remove URLs, emails, hashtags, mentions, and numbers.
49
+ - **Outlier Filtering:** Texts that are extremely short or long (based on quantiles) are removed to ensure consistent input lengths for the model.
50
+ - **Deduplication:** Duplicate texts are dropped to prevent data leakage.
51
+
52
+ ### Training
53
+
54
+ - **Model Architecture:** The project uses `distilbert/distilroberta-base`, a distilled version of RoBERTa, known for its efficiency and strong performance on text classification tasks.
55
+ - **LoRA Fine-Tuning:** LoRA (Low-Rank Adaptation) is applied to reduce the number of trainable parameters, making the fine-tuning process more memory- and compute-efficient without sacrificing accuracy.
56
+ - **Training Arguments:** The model is trained for 2 epochs with early stopping, regular evaluation, and checkpointing. Batch sizes and learning rates are carefully chosen for stability and speed.
57
+
58
+ ### Evaluation
59
+
60
+ - **Metrics:** The model is evaluated using accuracy, precision, recall, and F1-score. These metrics provide a comprehensive view of the classifier's performance, especially in distinguishing between the two classes.
61
+ - **Results:** The fine-tuned model demonstrates strong performance, with high accuracy and balanced precision/recall, indicating its effectiveness in real-world scenarios.
62
+
63
+ ## How to Install
64
+
65
+ Follow these steps to set up the environment using Python's built-in `venv`:
66
+
67
+ ```bash
68
+ # Clone the repository
69
+ git clone https://github.com/DeepActionPotential/FineTextTector
70
+ cd FineTextTector
71
+
72
+ # Create a virtual environment
73
+ python -m venv venv
74
+
75
+ # Activate the virtual environment
76
+ # On Windows:
77
+ venv\Scripts\activate
78
+ # On macOS/Linux:
79
+ source venv/bin/activate
80
+
81
+
82
+ # Install required packages
83
+ pip install -r requirements.txt
84
+ ```
85
+
86
+
87
+
88
+ ## How to Use the Software
89
+
90
+ - ## [Demo-video](images/FineTextTector.mp4)
91
+ - ![Demo-image](images/1.png)
92
+ - ![Demo-image](images/2.png)
93
+
94
+ ## Technologies Used
95
+
96
+ - **Transformers (Hugging Face):** Core library for model loading, tokenization, and training. Enables transfer learning with state-of-the-art NLP models.
97
+ - **Datasets (Hugging Face):** Efficient data handling, splitting, and preprocessing.
98
+ - **PEFT (Parameter-Efficient Fine-Tuning):** Implements LoRA for memory- and compute-efficient model adaptation.
99
+ - **Optuna:** Automated hyperparameter optimization to fine-tune model performance.
100
+ - **Scikit-learn:** Data splitting, metrics calculation, and utility functions.
101
+ - **Seaborn & Matplotlib:** Data visualization for EDA and result interpretation.
102
+ - **NLTK:** Stopword lists and basic NLP utilities.
103
+ - **Python venv:** Isolated environment management for reproducible installations.
104
+
105
+ These technologies collectively enable efficient, scalable, and reproducible development of advanced NLP models.
106
+
107
+ ## License
108
+
109
+ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
110
+
111
+ ---