| --- |
| title: AI Slop Detector |
| emoji: π |
| colorFrom: purple |
| colorTo: blue |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| --- |
| |
| # π AI Slop Detector |
|
|
| A comprehensive Python API and web UI for detecting AI-generated content in PDFs, DOCX files, and raw text. Uses an ensemble of multiple state-of-the-art detection methods. |
|
|
| ## Features |
|
|
| β¨ **Multi-Detector Ensemble** |
| - **RoBERTa Classifier** - Fine-tuned RoBERTa model for AI text detection |
| - **Perplexity Analysis** - Detects statistical anomalies and repetitive patterns |
| - **LLMDet** - Entropy and log-probability based detection |
| - **HuggingFace Classifier** - Generic transformer-based classification |
| - **OUTFOX Statistical** - Word/sentence length and vocabulary analysis |
|
|
| β¨ **Easy Feature Flags** |
| - Enable/disable each detector with a single config change |
| - Adjust detector weights for ensemble averaging |
| - Environment variable overrides |
|
|
| β¨ **Multiple File Formats** |
| - PDF documents |
| - DOCX/DOC files |
| - Plain text files |
| - Raw text input |
|
|
| β¨ **Persistent Storage** |
| - SQLite database (default, configurable) |
| - Upload history with timestamps |
| - Detailed result tracking and statistics |
|
|
| β¨ **Web UI** |
| - Beautiful, responsive interface |
| - Drag-and-drop file upload |
| - Real-time analysis results |
| - History and statistics views |
|
|
| β¨ **REST API** |
| - Analyze text and files via HTTP |
| - Get historical results |
| - Query statistics |
| - Full result management |
|
|
| ## Installation |
|
|
| ### Prerequisites |
| - Python 3.8+ |
| - pip or conda |
|
|
| ### Setup |
|
|
| 1. **Clone/Navigate to the project:** |
| ```bash |
| cd slop-detect |
| ``` |
|
|
| 2. **Create a Python virtual environment:** |
| ```bash |
| python -m venv venv |
| |
| # On Windows: |
| venv\Scripts\activate |
| |
| # On macOS/Linux: |
| source venv/bin/activate |
| ``` |
|
|
| 3. **Install dependencies:** |
| ```bash |
| pip install -r backend/requirements.txt |
| ``` |
|
|
| ## Configuration |
|
|
| ### Enable/Disable Detectors |
|
|
| Edit `backend/config/detectors_config.py`: |
|
|
| ```python |
| ENABLED_DETECTORS: Dict[str, bool] = { |
| "roberta": True, # Enable RoBERTa |
| "perplexity": True, # Enable Perplexity |
| "llmdet": True, # Enable LLMDet |
| "hf_classifier": True, # Enable HF Classifier |
| "outfox": False, # Disable OUTFOX |
| } |
| ``` |
|
|
| ### Set Detector Weights |
|
|
| ```python |
| DETECTOR_WEIGHTS: Dict[str, float] = { |
| "roberta": 0.30, # 30% weight |
| "perplexity": 0.25, # 25% weight |
| "llmdet": 0.25, # 25% weight |
| "hf_classifier": 0.20, # 20% weight |
| "outfox": 0.00, # 0% weight (not used) |
| } |
| ``` |
|
|
| ### Environment-based Configuration |
|
|
| You can also use environment variables to override config: |
|
|
| ```bash |
| # Enable/disable detectors |
| export ENABLE_ROBERTA=true |
| export ENABLE_PERPLEXITY=true |
| export ENABLE_LLMDET=true |
| export ENABLE_HF_CLASSIFIER=true |
| export ENABLE_OUTFOX=false |
| |
| # Database |
| export DATABASE_URL=sqlite:///slop_detect.db |
| export UPLOAD_FOLDER=./uploads |
| |
| # Flask |
| export HOST=0.0.0.0 |
| export PORT=5000 |
| export DEBUG=False |
| ``` |
|
|
| ## Running the Application |
|
|
| ### Start the Flask Server |
|
|
| ```bash |
| cd backend |
| python main.py |
| ``` |
|
|
| The API will be available at `http://localhost:5000` |
|
|
| ### API Endpoints |
|
|
| #### Health Check |
| ``` |
| GET /api/health |
| ``` |
|
|
| #### Analyze Text |
| ``` |
| POST /api/analyze/text |
| Content-Type: application/json |
| |
| { |
| "text": "Your text here...", |
| "filename": "optional_name.txt", |
| "user_id": "optional_user_id" |
| } |
| |
| Response: |
| { |
| "status": "success", |
| "result_id": 1, |
| "overall_ai_score": 0.78, |
| "overall_ai_score_percentage": "78.0%", |
| "overall_confidence": "high", |
| "status_label": "Likely AI", |
| "detector_results": { |
| "roberta": { |
| "detector_name": "roberta", |
| "score": 0.85, |
| "confidence": "high", |
| "explanation": "Very strong indicators of AI-generated text..." |
| }, |
| ... |
| }, |
| "enabled_detectors": ["roberta", "perplexity", "llmdet", "hf_classifier"], |
| "text_stats": { |
| "character_count": 1500, |
| "word_count": 250, |
| "sentence_count": 15, |
| "average_word_length": 4.8 |
| } |
| } |
| ``` |
|
|
| #### Analyze File |
| ``` |
| POST /api/analyze/file |
| FormData: |
| - file: <file> |
| - user_id: <optional_user_id> |
| |
| Response: (same as analyze/text) |
| ``` |
|
|
| #### Get All Results |
| ``` |
| GET /api/results?page=1&limit=10&sort=recent |
| |
| Response: |
| { |
| "status": "success", |
| "page": 1, |
| "limit": 10, |
| "total_count": 42, |
| "results": [...] |
| } |
| ``` |
|
|
| #### Get Specific Result |
| ``` |
| GET /api/results/{result_id} |
| |
| Response: |
| { |
| "status": "success", |
| "result": { |
| "id": 1, |
| "filename": "document.pdf", |
| "overall_ai_score": 0.78, |
| "overall_ai_score_percentage": "78.0%", |
| ... |
| } |
| } |
| ``` |
|
|
| #### Delete Result |
| ``` |
| DELETE /api/results/{result_id} |
| ``` |
|
|
| #### Update Result |
| ``` |
| PUT /api/results/{result_id} |
| Content-Type: application/json |
| |
| { |
| "notes": "Manual review: likely AI", |
| "is_flagged": true |
| } |
| ``` |
|
|
| #### Get Statistics |
| ``` |
| GET /api/statistics/summary |
| |
| Response: |
| { |
| "status": "success", |
| "summary": { |
| "total_analyses": 42, |
| "average_ai_score": 0.65, |
| "total_text_analyzed": 125000, |
| "likely_human": 15, |
| "suspicious": 12, |
| "likely_ai": 15 |
| } |
| } |
| ``` |
|
|
| #### Get Configuration |
| ``` |
| GET /api/config |
| |
| Response: |
| { |
| "status": "success", |
| "config": { |
| "enabled_detectors": [ |
| "roberta", "perplexity", "llmdet", "hf_classifier" |
| ], |
| "aggregation_method": "weighted_average", |
| "detector_weights": {...}, |
| "detector_info": {...} |
| } |
| } |
| ``` |
|
|
| ## Web Interface |
|
|
| Open `http://localhost:5000` in your browser to access the web UI. |
|
|
| ### Features: |
| - **Upload Section** - Drag-and-drop or click to upload files |
| - **Text Analysis** - Paste text directly |
| - **Results Dashboard** - View detailed analysis results |
| - **History Tab** - See all previous analyses |
| - **Statistics Tab** - View aggregate statistics |
|
|
| ## How It Works |
|
|
| ### Detection Process |
|
|
| 1. **File Parsing** - Extracts text from PDF/DOCX/TXT files |
| 2. **Text Cleaning** - Normalizes whitespace and formatting |
| 3. **Detector Ensemble** - Runs enabled detectors in parallel |
| 4. **Score Aggregation** - Combines detector scores using weighted average, max, or voting |
| 5. **Result Storage** - Saves to database with full metadata |
| 6. **Response** - Returns overall score and per-detector breakdown |
|
|
| ### Detector Details |
|
|
| #### RoBERTa Detector |
| - **Model**: roberta-base-openai-detector |
| - **Type**: Transformer-based classification |
| - **Output**: 0-1 probability score |
| - **Speed**: Medium |
|
|
| #### Perplexity Detector |
| - **Model**: GPT-2 |
| - **Method**: Analyzes token probability distributions |
| - **Detects**: Repetitive patterns, unusual word choices |
| - **Output**: 0-1 score based on perplexity, repetition, AI phrases |
|
|
| #### LLMDet Detector |
| - **Model**: BERT |
| - **Method**: Entropy and log-probability analysis |
| - **Detects**: Predictable sequences, unusual statistical patterns |
| - **Output**: 0-1 score from combined metrics |
|
|
| #### HF Classifier |
| - **Model**: Configurable (default: BERT) |
| - **Type**: Generic sequence classification |
| - **Output**: 0-1 probability score |
|
|
| #### OUTFOX Statistical |
| - **Type**: Statistical signature analysis |
| - **Detects**: Unusual word length distributions, sentence structure patterns, vocabulary diversity |
| - **Output**: 0-1 score from multiple statistical metrics |
|
|
| ### Scoring |
|
|
| Default aggregation: **Weighted Average** |
| ``` |
| Overall Score = Ξ£ (normalized_detector_score Γ weight) |
| ``` |
|
|
| Each detector's score is normalized to 0-1 range, then multiplied by its configured weight. The sum is clamped to [0, 1]. |
|
|
| ### Confidence Levels |
|
|
| - **Very Low** (< 20%) - Almost certainly human-written |
| - **Low** (20-40%) - Probably human-written |
| - **Medium** (40-60%) - Uncertain |
| - **High** (60-80%) - Probably AI-generated |
| - **Very High** (> 80%) - Almost certainly AI-generated |
|
|
| ## Project Structure |
|
|
| ``` |
| slop-detect/ |
| βββ backend/ |
| β βββ config/ |
| β β βββ settings.py # App settings |
| β β βββ detectors_config.py # Detector configuration (FEATURE FLAGS HERE) |
| β βββ detectors/ |
| β β βββ base.py # Base detector class |
| β β βββ roberta.py # RoBERTa detector |
| β β βββ perplexity.py # Perplexity detector |
| β β βββ llmdet.py # LLMDet detector |
| β β βββ hf_classifier.py # HF classifier |
| β β βββ outfox.py # OUTFOX detector |
| β β βββ ensemble.py # Ensemble manager |
| β βββ database/ |
| β β βββ models.py # SQLAlchemy models |
| β β βββ db.py # Database manager |
| β βββ api/ |
| β β βββ routes.py # Flask API routes |
| β β βββ models.py # Pydantic request/response models |
| β βββ utils/ |
| β β βββ file_parser.py # PDF/DOCX/TXT parsing |
| β β βββ highlighter.py # Text highlighting utilities |
| β βββ main.py # Flask app entry point |
| β βββ requirements.txt # Python dependencies |
| βββ frontend/ |
| β βββ index.html # Web UI (HTML + CSS + JS) |
| βββ README.md # This file |
| ``` |
|
|
| ## Customization |
|
|
| ### Change Detector Weights |
|
|
| In `backend/config/detectors_config.py`: |
|
|
| ```python |
| DETECTOR_WEIGHTS: Dict[str, float] = { |
| "roberta": 0.40, # Increase weight |
| "perplexity": 0.30, |
| "llmdet": 0.20, |
| "hf_classifier": 0.10, |
| } |
| ``` |
|
|
| ### Change Aggregation Method |
|
|
| In `backend/config/detectors_config.py`: |
|
|
| ```python |
| AGGREGATION_METHOD = "max" # Options: weighted_average, max, voting |
| ``` |
|
|
| ### Use Different Models |
|
|
| In `backend/config/detectors_config.py`: |
|
|
| ```python |
| ROBERTA_MODEL = "distilbert-base-uncased" |
| PERPLEXITY_MODEL = "gpt2-medium" |
| HF_CLASSIFIER_MODEL = "your-custom-model" |
| ``` |
|
|
| ### Add Custom Detectors |
|
|
| 1. Create a new file in `backend/detectors/` |
| 2. Inherit from `BaseDetector` |
| 3. Implement `detect()` method |
| 4. Add to `ensemble.py` initialization |
| 5. Add to `ENABLED_DETECTORS` in config |
|
|
| Example: |
|
|
| ```python |
| from detectors.base import BaseDetector, DetectorResult |
| |
| class CustomDetector(BaseDetector): |
| def __init__(self): |
| super().__init__(name="custom") |
| |
| def detect(self, text: str) -> DetectorResult: |
| # Your detection logic here |
| score = calculate_ai_score(text) |
| return DetectorResult( |
| detector_name=self.name, |
| score=score, |
| explanation="Custom detection result" |
| ) |
| ``` |
|
|
| ## Performance Tips |
|
|
| 1. **Model Caching** - Models are lazy-loaded and cached in memory |
| 2. **Parallel Detection** - Detectors can run in parallel (future enhancement) |
| 3. **Batch Processing** - Configure batch size for GPU processing |
| 4. **Disable Unused Detectors** - Reduce load by disabling detectors you don't need |
|
|
| ## Troubleshooting |
|
|
| ### Slow First Run |
| - Models need to be downloaded from Hugging Face Hub |
| - Subsequent runs will use cached models |
| - First model download can take 1-5 minutes |
|
|
| ### Out of Memory |
| - Reduce batch size in config |
| - Disable memory-intensive detectors |
| - Run on a machine with more RAM |
|
|
| ### Model Not Found |
| ``` |
| transformers.utils.RepositoryNotFoundError: Model not found |
| ``` |
| - Model name is incorrect in config |
| - Check Hugging Face Hub for correct model name |
|
|
| ### Database Locked |
| ``` |
| sqlite3.OperationalError: database is locked |
| ``` |
| - Close other connections to the database |
| - Ensure only one Flask instance is running |
| - Delete `.db-journal` file if present |
|
|
| ## Future Enhancements |
|
|
| - [ ] Parallel detector execution |
| - [ ] GPU support optimization |
| - [ ] Custom model fine-tuning |
| - [ ] Batch analysis API |
| - [ ] User authentication/authorization |
| - [ ] Document highlighting with suspicious sections |
| - [ ] Advanced filtering and search |
| - [ ] Export results to PDF/Excel |
| - [ ] API rate limiting |
| - [ ] Webhook notifications |
|
|
| ## License |
|
|
| MIT License - feel free to use and modify |
|
|
| ## References |
|
|
| - [LLMDet](https://github.com/TrustedLLM/LLMDet) |
| - [RAID](https://github.com/liamdugan/raid) |
| - [OUTFOX](https://github.com/ryuryukke/OUTFOX) |
| - [AIGTD Survey](https://github.com/Nicozwy/AIGTD-Survey) |
| - [Plagiarism Detection](https://github.com/Kyle6012/plagiarism-detection) |
| - [Hugging Face Transformers](https://huggingface.co/transformers/) |
|
|
| ## Support |
|
|
| For issues, questions, or suggestions, please open an issue on the project repository. |
|
|