Spaces:
Sleeping
Sleeping
Upload 5 files
Browse files- README.md +85 -14
- app.py +54 -0
- requirements.txt +7 -0
- styles.css +67 -0
- utils.py +50 -0
README.md
CHANGED
|
@@ -1,14 +1,85 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# AI Text Detector
|
| 2 |
+
|
| 3 |
+
A streamlit-based application that helps identify whether text was generated by AI or written by humans. Built using Streamlit and machine learning.
|
| 4 |
+
|
| 5 |
+
## Features
|
| 6 |
+
|
| 7 |
+
- Real-time text classification
|
| 8 |
+
- Minimum word count validation (100 words)
|
| 9 |
+
- User-friendly web interface
|
| 10 |
+
- Text preprocessing pipeline
|
| 11 |
+
- Clear visual feedback for results
|
| 12 |
+
|
| 13 |
+
## Demo
|
| 14 |
+
|
| 15 |
+
The application provides a simple yet powerful interface for checking text. Here's how it works:
|
| 16 |
+
|
| 17 |
+
### 1. Input Text
|
| 18 |
+
|
| 19 |
+

|
| 20 |
+
|
| 21 |
+
The main interface features a large text area where you can paste or type the text you want to check. The application requires a minimum of 100 words for accurate classification.
|
| 22 |
+
|
| 23 |
+
### 2. Results
|
| 24 |
+
|
| 25 |
+

|
| 26 |
+
|
| 27 |
+
After submitting the text, the application will process it and display whether it appears to be human-written or AI-generated. The results are shown with clear visual indicators and informative messages.
|
| 28 |
+
|
| 29 |
+
## Setup
|
| 30 |
+
|
| 31 |
+
1. Create and activate a virtual environment:
|
| 32 |
+
```bash
|
| 33 |
+
# Create virtual environment
|
| 34 |
+
python -m venv venv
|
| 35 |
+
|
| 36 |
+
# Activate virtual environment
|
| 37 |
+
# Windows
|
| 38 |
+
.\venv\Scripts\activate
|
| 39 |
+
# Linux/MacOS
|
| 40 |
+
source venv/bin/activate
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
2. Install the required dependencies:
|
| 44 |
+
```bash
|
| 45 |
+
pip install -r requirements.txt
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
3. Run the application:
|
| 49 |
+
```bash
|
| 50 |
+
python run.py
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
4. Open your web browser and navigate to `http://localhost:8501`
|
| 54 |
+
|
| 55 |
+
## Technical Details
|
| 56 |
+
|
| 57 |
+
The application uses a machine learning model trained to distinguish between AI-generated and human-written text. The preprocessing pipeline includes:
|
| 58 |
+
- Lowercasing
|
| 59 |
+
- Punctuation removal
|
| 60 |
+
- Stopword removal
|
| 61 |
+
- URL and email removal
|
| 62 |
+
- Number removal
|
| 63 |
+
- Non-printable character removal
|
| 64 |
+
|
| 65 |
+
## Model Training
|
| 66 |
+
|
| 67 |
+
The machine learning model used in this application was trained using the Jupyter notebook [generated-text-classification.ipynb](generated-text-classification.ipynb).
|
| 68 |
+
|
| 69 |
+
The trained model is saved as `models/best_model.joblib` and is loaded automatically when the application starts.
|
| 70 |
+
|
| 71 |
+
The model achieves 100% accuracy and an F1-score of 100, but its performance is constrained to data similar to what is presented in the training dataset. Therefore, it struggles to generalize across diverse data types. Nonetheless, it performs exceptionally well in distinguishing between AI-generated and human-generated text.
|
| 72 |
+
|
| 73 |
+
## Requirements
|
| 74 |
+
|
| 75 |
+
- Python 3.8+
|
| 76 |
+
- pip
|
| 77 |
+
- All dependencies listed in [requirements.txt](requirements.txt)
|
| 78 |
+
|
| 79 |
+
## Contributing
|
| 80 |
+
|
| 81 |
+
Contributions are welcome! Please feel free to submit a Pull Request.
|
| 82 |
+
|
| 83 |
+
## License
|
| 84 |
+
|
| 85 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
app.py
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
from utils import load_model, preprocess_text
|
| 3 |
+
import nltk
|
| 4 |
+
|
| 5 |
+
nltk.download('stopwords')
|
| 6 |
+
model = load_model('./models/best_model.joblib')
|
| 7 |
+
|
| 8 |
+
min_words_number = 100
|
| 9 |
+
|
| 10 |
+
def check_generated_text(text):
|
| 11 |
+
filtered_text = preprocess_text(text)
|
| 12 |
+
prediction = model.predict([filtered_text])
|
| 13 |
+
return not int(prediction[0])
|
| 14 |
+
|
| 15 |
+
# Load styles
|
| 16 |
+
with open("styles.css") as f:
|
| 17 |
+
st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
|
| 18 |
+
|
| 19 |
+
# Title
|
| 20 |
+
st.title("Generated Text Checker")
|
| 21 |
+
|
| 22 |
+
# Initialize session state
|
| 23 |
+
if "check_clicked" not in st.session_state:
|
| 24 |
+
st.session_state.check_clicked = False
|
| 25 |
+
|
| 26 |
+
# Use a form to isolate the check action
|
| 27 |
+
with st.form("text_check_form"):
|
| 28 |
+
user_input = st.text_area(
|
| 29 |
+
f"Enter text to check",
|
| 30 |
+
height=400,
|
| 31 |
+
placeholder=f"Paste your generated text here... it should be at least {min_words_number} words"
|
| 32 |
+
)
|
| 33 |
+
submitted = st.form_submit_button("Check text")
|
| 34 |
+
|
| 35 |
+
# Handle form submission
|
| 36 |
+
if submitted:
|
| 37 |
+
st.session_state.check_clicked = True
|
| 38 |
+
|
| 39 |
+
# Only run check when button is clicked
|
| 40 |
+
if st.session_state.check_clicked:
|
| 41 |
+
with st.spinner("Checking text..."):
|
| 42 |
+
current_length = len(user_input.split())
|
| 43 |
+
|
| 44 |
+
if current_length >= min_words_number:
|
| 45 |
+
result = check_generated_text(user_input)
|
| 46 |
+
if result:
|
| 47 |
+
st.info("✅ The text appears to be human-written!")
|
| 48 |
+
else:
|
| 49 |
+
st.info("🤖 The text appears to be AI-generated.")
|
| 50 |
+
else:
|
| 51 |
+
st.warning(f"Please enter at least {min_words_number} words.")
|
| 52 |
+
|
| 53 |
+
# Reset check state
|
| 54 |
+
st.session_state.check_clicked = False
|
requirements.txt
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
streamlit==1.30.1
|
| 2 |
+
nltk==3.8.1
|
| 3 |
+
scikit-learn==1.3.2
|
| 4 |
+
joblib==1.3.2
|
| 5 |
+
pandas==2.1.4
|
| 6 |
+
numpy==1.26.2
|
| 7 |
+
python-dotenv==1.0.0
|
styles.css
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
|
| 3 |
+
#MainMenu, header, footer {
|
| 4 |
+
visibility: hidden;
|
| 5 |
+
}
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
.stApp {
|
| 9 |
+
background-color: #343541 !important;
|
| 10 |
+
color: #ECECEF !important;
|
| 11 |
+
}
|
| 12 |
+
|
| 13 |
+
.stTextArea>div>div>textarea {
|
| 14 |
+
background-color: #40414F !important;
|
| 15 |
+
color: #ECECEF !important;
|
| 16 |
+
border-radius: 8px !important;
|
| 17 |
+
padding: 16px !important;
|
| 18 |
+
border: 1px solid #565869 !important;
|
| 19 |
+
font-size: 16px !important;
|
| 20 |
+
min-height: 300px !important;
|
| 21 |
+
}
|
| 22 |
+
|
| 23 |
+
.stTextArea>label {
|
| 24 |
+
color: #ECECEF !important;
|
| 25 |
+
font-size: 18px !important;
|
| 26 |
+
}
|
| 27 |
+
|
| 28 |
+
.stButton>button {
|
| 29 |
+
background-color: #19C37D !important;
|
| 30 |
+
color: white !important;
|
| 31 |
+
border: none !important;
|
| 32 |
+
border-radius: 8px !important;
|
| 33 |
+
padding: 12px 24px !important;
|
| 34 |
+
font-size: 16px !important;
|
| 35 |
+
font-weight: 500 !important;
|
| 36 |
+
transition: background-color 0.3s ease !important;
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
.stButton>button:hover {
|
| 40 |
+
background-color: #15A46C !important;
|
| 41 |
+
color: white !important;
|
| 42 |
+
}
|
| 43 |
+
|
| 44 |
+
.stAlert {
|
| 45 |
+
border-radius: 8px !important;
|
| 46 |
+
padding: 16px !important;
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
.stAlert [data-testid="stMarkdownContainer"] {
|
| 50 |
+
color: #ECECEF !important;
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
.stAlert.st-emotion-cache-1hyeoxa {
|
| 54 |
+
background-color: rgba(25, 195, 125, 0.1) !important;
|
| 55 |
+
border: 1px solid #19C37D !important;
|
| 56 |
+
}
|
| 57 |
+
|
| 58 |
+
.stAlert.st-emotion-cache-1d3z3hw {
|
| 59 |
+
background-color: rgba(239, 65, 70, 0.1) !important;
|
| 60 |
+
border: 1px solid #EF4146 !important;
|
| 61 |
+
}
|
| 62 |
+
|
| 63 |
+
.stTitle {
|
| 64 |
+
color: #ECECEF !important;
|
| 65 |
+
text-align: center !important;
|
| 66 |
+
margin-bottom: 32px !important;
|
| 67 |
+
}
|
utils.py
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
import joblib
|
| 3 |
+
|
| 4 |
+
import re
|
| 5 |
+
import string
|
| 6 |
+
from nltk.corpus import stopwords
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def load_model(model_path):
|
| 11 |
+
"""
|
| 12 |
+
Load a joblib model
|
| 13 |
+
|
| 14 |
+
Args:
|
| 15 |
+
- model_path (str): path to the model
|
| 16 |
+
|
| 17 |
+
Returns:
|
| 18 |
+
- model: loaded model
|
| 19 |
+
"""
|
| 20 |
+
model = joblib.load(model_path)
|
| 21 |
+
return model
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
# Set of English stopwords
|
| 26 |
+
stop_words = set(stopwords.words('english'))
|
| 27 |
+
|
| 28 |
+
def preprocess_text(text:str):
|
| 29 |
+
# Step 1: Lowercase
|
| 30 |
+
text = text.lower()
|
| 31 |
+
|
| 32 |
+
# Step 2: Strip extra whitespace
|
| 33 |
+
text = re.sub(r'\s+', ' ', text.strip())
|
| 34 |
+
|
| 35 |
+
# Step 3: Remove punctuation
|
| 36 |
+
text = text.translate(str.maketrans('', '', string.punctuation))
|
| 37 |
+
|
| 38 |
+
# Step 4: Remove stopwords
|
| 39 |
+
text = ' '.join(word for word in text.split() if word not in stop_words)
|
| 40 |
+
|
| 41 |
+
# Step 5: Remove noise (URLs, emails, hashtags, mentions, numbers, non-printables)
|
| 42 |
+
text = re.sub(r'http\S+|www\.\S+', '', text) # URLs
|
| 43 |
+
text = re.sub(r'\S+@\S+\.\S+', '', text) # Emails
|
| 44 |
+
text = re.sub(r'#[A-Za-z0-9_]+', '', text) # Hashtags
|
| 45 |
+
text = re.sub(r'@[A-Za-z0-9_]+', '', text) # Mentions
|
| 46 |
+
text = re.sub(r'\d+', '', text) # Numbers
|
| 47 |
+
text = ''.join(ch for ch in text if ch.isprintable()) # Non-printables
|
| 48 |
+
|
| 49 |
+
return text
|
| 50 |
+
|