DeepActionPotential commited on
Commit
3c4cf68
·
verified ·
1 Parent(s): acfb97e

Upload 5 files

Browse files
Files changed (5) hide show
  1. README.md +85 -14
  2. app.py +54 -0
  3. requirements.txt +7 -0
  4. styles.css +67 -0
  5. utils.py +50 -0
README.md CHANGED
@@ -1,14 +1,85 @@
1
- ---
2
- title: Textector
3
- emoji: 📊
4
- colorFrom: yellow
5
- colorTo: red
6
- sdk: streamlit
7
- sdk_version: 1.44.1
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: Textector Is a lightweight text calssifier.
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI Text Detector
2
+
3
+ A streamlit-based application that helps identify whether text was generated by AI or written by humans. Built using Streamlit and machine learning.
4
+
5
+ ## Features
6
+
7
+ - Real-time text classification
8
+ - Minimum word count validation (100 words)
9
+ - User-friendly web interface
10
+ - Text preprocessing pipeline
11
+ - Clear visual feedback for results
12
+
13
+ ## Demo
14
+
15
+ The application provides a simple yet powerful interface for checking text. Here's how it works:
16
+
17
+ ### 1. Input Text
18
+
19
+ ![Input Interface](images/1.png)
20
+
21
+ The main interface features a large text area where you can paste or type the text you want to check. The application requires a minimum of 100 words for accurate classification.
22
+
23
+ ### 2. Results
24
+
25
+ ![Results](images/2.png)
26
+
27
+ After submitting the text, the application will process it and display whether it appears to be human-written or AI-generated. The results are shown with clear visual indicators and informative messages.
28
+
29
+ ## Setup
30
+
31
+ 1. Create and activate a virtual environment:
32
+ ```bash
33
+ # Create virtual environment
34
+ python -m venv venv
35
+
36
+ # Activate virtual environment
37
+ # Windows
38
+ .\venv\Scripts\activate
39
+ # Linux/MacOS
40
+ source venv/bin/activate
41
+ ```
42
+
43
+ 2. Install the required dependencies:
44
+ ```bash
45
+ pip install -r requirements.txt
46
+ ```
47
+
48
+ 3. Run the application:
49
+ ```bash
50
+ python run.py
51
+ ```
52
+
53
+ 4. Open your web browser and navigate to `http://localhost:8501`
54
+
55
+ ## Technical Details
56
+
57
+ The application uses a machine learning model trained to distinguish between AI-generated and human-written text. The preprocessing pipeline includes:
58
+ - Lowercasing
59
+ - Punctuation removal
60
+ - Stopword removal
61
+ - URL and email removal
62
+ - Number removal
63
+ - Non-printable character removal
64
+
65
+ ## Model Training
66
+
67
+ The machine learning model used in this application was trained using the Jupyter notebook [generated-text-classification.ipynb](generated-text-classification.ipynb).
68
+
69
+ The trained model is saved as `models/best_model.joblib` and is loaded automatically when the application starts.
70
+
71
+ The model achieves 100% accuracy and an F1-score of 100, but its performance is constrained to data similar to what is presented in the training dataset. Therefore, it struggles to generalize across diverse data types. Nonetheless, it performs exceptionally well in distinguishing between AI-generated and human-generated text.
72
+
73
+ ## Requirements
74
+
75
+ - Python 3.8+
76
+ - pip
77
+ - All dependencies listed in [requirements.txt](requirements.txt)
78
+
79
+ ## Contributing
80
+
81
+ Contributions are welcome! Please feel free to submit a Pull Request.
82
+
83
+ ## License
84
+
85
+ This project is licensed under the MIT License - see the LICENSE file for details.
app.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from utils import load_model, preprocess_text
3
+ import nltk
4
+
5
+ nltk.download('stopwords')
6
+ model = load_model('./models/best_model.joblib')
7
+
8
+ min_words_number = 100
9
+
10
+ def check_generated_text(text):
11
+ filtered_text = preprocess_text(text)
12
+ prediction = model.predict([filtered_text])
13
+ return not int(prediction[0])
14
+
15
+ # Load styles
16
+ with open("styles.css") as f:
17
+ st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
18
+
19
+ # Title
20
+ st.title("Generated Text Checker")
21
+
22
+ # Initialize session state
23
+ if "check_clicked" not in st.session_state:
24
+ st.session_state.check_clicked = False
25
+
26
+ # Use a form to isolate the check action
27
+ with st.form("text_check_form"):
28
+ user_input = st.text_area(
29
+ f"Enter text to check",
30
+ height=400,
31
+ placeholder=f"Paste your generated text here... it should be at least {min_words_number} words"
32
+ )
33
+ submitted = st.form_submit_button("Check text")
34
+
35
+ # Handle form submission
36
+ if submitted:
37
+ st.session_state.check_clicked = True
38
+
39
+ # Only run check when button is clicked
40
+ if st.session_state.check_clicked:
41
+ with st.spinner("Checking text..."):
42
+ current_length = len(user_input.split())
43
+
44
+ if current_length >= min_words_number:
45
+ result = check_generated_text(user_input)
46
+ if result:
47
+ st.info("✅ The text appears to be human-written!")
48
+ else:
49
+ st.info("🤖 The text appears to be AI-generated.")
50
+ else:
51
+ st.warning(f"Please enter at least {min_words_number} words.")
52
+
53
+ # Reset check state
54
+ st.session_state.check_clicked = False
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ streamlit==1.30.1
2
+ nltk==3.8.1
3
+ scikit-learn==1.3.2
4
+ joblib==1.3.2
5
+ pandas==2.1.4
6
+ numpy==1.26.2
7
+ python-dotenv==1.0.0
styles.css ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ #MainMenu, header, footer {
4
+ visibility: hidden;
5
+ }
6
+
7
+
8
+ .stApp {
9
+ background-color: #343541 !important;
10
+ color: #ECECEF !important;
11
+ }
12
+
13
+ .stTextArea>div>div>textarea {
14
+ background-color: #40414F !important;
15
+ color: #ECECEF !important;
16
+ border-radius: 8px !important;
17
+ padding: 16px !important;
18
+ border: 1px solid #565869 !important;
19
+ font-size: 16px !important;
20
+ min-height: 300px !important;
21
+ }
22
+
23
+ .stTextArea>label {
24
+ color: #ECECEF !important;
25
+ font-size: 18px !important;
26
+ }
27
+
28
+ .stButton>button {
29
+ background-color: #19C37D !important;
30
+ color: white !important;
31
+ border: none !important;
32
+ border-radius: 8px !important;
33
+ padding: 12px 24px !important;
34
+ font-size: 16px !important;
35
+ font-weight: 500 !important;
36
+ transition: background-color 0.3s ease !important;
37
+ }
38
+
39
+ .stButton>button:hover {
40
+ background-color: #15A46C !important;
41
+ color: white !important;
42
+ }
43
+
44
+ .stAlert {
45
+ border-radius: 8px !important;
46
+ padding: 16px !important;
47
+ }
48
+
49
+ .stAlert [data-testid="stMarkdownContainer"] {
50
+ color: #ECECEF !important;
51
+ }
52
+
53
+ .stAlert.st-emotion-cache-1hyeoxa {
54
+ background-color: rgba(25, 195, 125, 0.1) !important;
55
+ border: 1px solid #19C37D !important;
56
+ }
57
+
58
+ .stAlert.st-emotion-cache-1d3z3hw {
59
+ background-color: rgba(239, 65, 70, 0.1) !important;
60
+ border: 1px solid #EF4146 !important;
61
+ }
62
+
63
+ .stTitle {
64
+ color: #ECECEF !important;
65
+ text-align: center !important;
66
+ margin-bottom: 32px !important;
67
+ }
utils.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import joblib
3
+
4
+ import re
5
+ import string
6
+ from nltk.corpus import stopwords
7
+
8
+
9
+
10
+ def load_model(model_path):
11
+ """
12
+ Load a joblib model
13
+
14
+ Args:
15
+ - model_path (str): path to the model
16
+
17
+ Returns:
18
+ - model: loaded model
19
+ """
20
+ model = joblib.load(model_path)
21
+ return model
22
+
23
+
24
+
25
+ # Set of English stopwords
26
+ stop_words = set(stopwords.words('english'))
27
+
28
+ def preprocess_text(text:str):
29
+ # Step 1: Lowercase
30
+ text = text.lower()
31
+
32
+ # Step 2: Strip extra whitespace
33
+ text = re.sub(r'\s+', ' ', text.strip())
34
+
35
+ # Step 3: Remove punctuation
36
+ text = text.translate(str.maketrans('', '', string.punctuation))
37
+
38
+ # Step 4: Remove stopwords
39
+ text = ' '.join(word for word in text.split() if word not in stop_words)
40
+
41
+ # Step 5: Remove noise (URLs, emails, hashtags, mentions, numbers, non-printables)
42
+ text = re.sub(r'http\S+|www\.\S+', '', text) # URLs
43
+ text = re.sub(r'\S+@\S+\.\S+', '', text) # Emails
44
+ text = re.sub(r'#[A-Za-z0-9_]+', '', text) # Hashtags
45
+ text = re.sub(r'@[A-Za-z0-9_]+', '', text) # Mentions
46
+ text = re.sub(r'\d+', '', text) # Numbers
47
+ text = ''.join(ch for ch in text if ch.isprintable()) # Non-printables
48
+
49
+ return text
50
+