--- title: AI Slop Detector emoji: 🔍 colorFrom: purple colorTo: blue sdk: docker app_port: 7860 pinned: false --- # 🔍 AI Slop Detector A comprehensive Python API and web UI for detecting AI-generated content in PDFs, DOCX files, and raw text. Uses an ensemble of multiple state-of-the-art detection methods. ## Features ✨ **Multi-Detector Ensemble** - **RoBERTa Classifier** - Fine-tuned RoBERTa model for AI text detection - **Perplexity Analysis** - Detects statistical anomalies and repetitive patterns - **LLMDet** - Entropy and log-probability based detection - **HuggingFace Classifier** - Generic transformer-based classification - **OUTFOX Statistical** - Word/sentence length and vocabulary analysis ✨ **Easy Feature Flags** - Enable/disable each detector with a single config change - Adjust detector weights for ensemble averaging - Environment variable overrides ✨ **Multiple File Formats** - PDF documents - DOCX/DOC files - Plain text files - Raw text input ✨ **Persistent Storage** - SQLite database (default, configurable) - Upload history with timestamps - Detailed result tracking and statistics ✨ **Web UI** - Beautiful, responsive interface - Drag-and-drop file upload - Real-time analysis results - History and statistics views ✨ **REST API** - Analyze text and files via HTTP - Get historical results - Query statistics - Full result management ## Installation ### Prerequisites - Python 3.8+ - pip or conda ### Setup 1. **Clone/Navigate to the project:** ```bash cd slop-detect ``` 2. **Create a Python virtual environment:** ```bash python -m venv venv # On Windows: venv\Scripts\activate # On macOS/Linux: source venv/bin/activate ``` 3. **Install dependencies:** ```bash pip install -r backend/requirements.txt ``` ## Configuration ### Enable/Disable Detectors Edit `backend/config/detectors_config.py`: ```python ENABLED_DETECTORS: Dict[str, bool] = { "roberta": True, # Enable RoBERTa "perplexity": True, # Enable Perplexity "llmdet": True, # Enable LLMDet "hf_classifier": True, # Enable HF Classifier "outfox": False, # Disable OUTFOX } ``` ### Set Detector Weights ```python DETECTOR_WEIGHTS: Dict[str, float] = { "roberta": 0.30, # 30% weight "perplexity": 0.25, # 25% weight "llmdet": 0.25, # 25% weight "hf_classifier": 0.20, # 20% weight "outfox": 0.00, # 0% weight (not used) } ``` ### Environment-based Configuration You can also use environment variables to override config: ```bash # Enable/disable detectors export ENABLE_ROBERTA=true export ENABLE_PERPLEXITY=true export ENABLE_LLMDET=true export ENABLE_HF_CLASSIFIER=true export ENABLE_OUTFOX=false # Database export DATABASE_URL=sqlite:///slop_detect.db export UPLOAD_FOLDER=./uploads # Flask export HOST=0.0.0.0 export PORT=5000 export DEBUG=False ``` ## Running the Application ### Start the Flask Server ```bash cd backend python main.py ``` The API will be available at `http://localhost:5000` ### API Endpoints #### Health Check ``` GET /api/health ``` #### Analyze Text ``` POST /api/analyze/text Content-Type: application/json { "text": "Your text here...", "filename": "optional_name.txt", "user_id": "optional_user_id" } Response: { "status": "success", "result_id": 1, "overall_ai_score": 0.78, "overall_ai_score_percentage": "78.0%", "overall_confidence": "high", "status_label": "Likely AI", "detector_results": { "roberta": { "detector_name": "roberta", "score": 0.85, "confidence": "high", "explanation": "Very strong indicators of AI-generated text..." }, ... }, "enabled_detectors": ["roberta", "perplexity", "llmdet", "hf_classifier"], "text_stats": { "character_count": 1500, "word_count": 250, "sentence_count": 15, "average_word_length": 4.8 } } ``` #### Analyze File ``` POST /api/analyze/file FormData: - file: - user_id: Response: (same as analyze/text) ``` #### Get All Results ``` GET /api/results?page=1&limit=10&sort=recent Response: { "status": "success", "page": 1, "limit": 10, "total_count": 42, "results": [...] } ``` #### Get Specific Result ``` GET /api/results/{result_id} Response: { "status": "success", "result": { "id": 1, "filename": "document.pdf", "overall_ai_score": 0.78, "overall_ai_score_percentage": "78.0%", ... } } ``` #### Delete Result ``` DELETE /api/results/{result_id} ``` #### Update Result ``` PUT /api/results/{result_id} Content-Type: application/json { "notes": "Manual review: likely AI", "is_flagged": true } ``` #### Get Statistics ``` GET /api/statistics/summary Response: { "status": "success", "summary": { "total_analyses": 42, "average_ai_score": 0.65, "total_text_analyzed": 125000, "likely_human": 15, "suspicious": 12, "likely_ai": 15 } } ``` #### Get Configuration ``` GET /api/config Response: { "status": "success", "config": { "enabled_detectors": [ "roberta", "perplexity", "llmdet", "hf_classifier" ], "aggregation_method": "weighted_average", "detector_weights": {...}, "detector_info": {...} } } ``` ## Web Interface Open `http://localhost:5000` in your browser to access the web UI. ### Features: - **Upload Section** - Drag-and-drop or click to upload files - **Text Analysis** - Paste text directly - **Results Dashboard** - View detailed analysis results - **History Tab** - See all previous analyses - **Statistics Tab** - View aggregate statistics ## How It Works ### Detection Process 1. **File Parsing** - Extracts text from PDF/DOCX/TXT files 2. **Text Cleaning** - Normalizes whitespace and formatting 3. **Detector Ensemble** - Runs enabled detectors in parallel 4. **Score Aggregation** - Combines detector scores using weighted average, max, or voting 5. **Result Storage** - Saves to database with full metadata 6. **Response** - Returns overall score and per-detector breakdown ### Detector Details #### RoBERTa Detector - **Model**: roberta-base-openai-detector - **Type**: Transformer-based classification - **Output**: 0-1 probability score - **Speed**: Medium #### Perplexity Detector - **Model**: GPT-2 - **Method**: Analyzes token probability distributions - **Detects**: Repetitive patterns, unusual word choices - **Output**: 0-1 score based on perplexity, repetition, AI phrases #### LLMDet Detector - **Model**: BERT - **Method**: Entropy and log-probability analysis - **Detects**: Predictable sequences, unusual statistical patterns - **Output**: 0-1 score from combined metrics #### HF Classifier - **Model**: Configurable (default: BERT) - **Type**: Generic sequence classification - **Output**: 0-1 probability score #### OUTFOX Statistical - **Type**: Statistical signature analysis - **Detects**: Unusual word length distributions, sentence structure patterns, vocabulary diversity - **Output**: 0-1 score from multiple statistical metrics ### Scoring Default aggregation: **Weighted Average** ``` Overall Score = Σ (normalized_detector_score × weight) ``` Each detector's score is normalized to 0-1 range, then multiplied by its configured weight. The sum is clamped to [0, 1]. ### Confidence Levels - **Very Low** (< 20%) - Almost certainly human-written - **Low** (20-40%) - Probably human-written - **Medium** (40-60%) - Uncertain - **High** (60-80%) - Probably AI-generated - **Very High** (> 80%) - Almost certainly AI-generated ## Project Structure ``` slop-detect/ ├── backend/ │ ├── config/ │ │ ├── settings.py # App settings │ │ └── detectors_config.py # Detector configuration (FEATURE FLAGS HERE) │ ├── detectors/ │ │ ├── base.py # Base detector class │ │ ├── roberta.py # RoBERTa detector │ │ ├── perplexity.py # Perplexity detector │ │ ├── llmdet.py # LLMDet detector │ │ ├── hf_classifier.py # HF classifier │ │ ├── outfox.py # OUTFOX detector │ │ └── ensemble.py # Ensemble manager │ ├── database/ │ │ ├── models.py # SQLAlchemy models │ │ └── db.py # Database manager │ ├── api/ │ │ ├── routes.py # Flask API routes │ │ └── models.py # Pydantic request/response models │ ├── utils/ │ │ ├── file_parser.py # PDF/DOCX/TXT parsing │ │ └── highlighter.py # Text highlighting utilities │ ├── main.py # Flask app entry point │ └── requirements.txt # Python dependencies ├── frontend/ │ └── index.html # Web UI (HTML + CSS + JS) └── README.md # This file ``` ## Customization ### Change Detector Weights In `backend/config/detectors_config.py`: ```python DETECTOR_WEIGHTS: Dict[str, float] = { "roberta": 0.40, # Increase weight "perplexity": 0.30, "llmdet": 0.20, "hf_classifier": 0.10, } ``` ### Change Aggregation Method In `backend/config/detectors_config.py`: ```python AGGREGATION_METHOD = "max" # Options: weighted_average, max, voting ``` ### Use Different Models In `backend/config/detectors_config.py`: ```python ROBERTA_MODEL = "distilbert-base-uncased" PERPLEXITY_MODEL = "gpt2-medium" HF_CLASSIFIER_MODEL = "your-custom-model" ``` ### Add Custom Detectors 1. Create a new file in `backend/detectors/` 2. Inherit from `BaseDetector` 3. Implement `detect()` method 4. Add to `ensemble.py` initialization 5. Add to `ENABLED_DETECTORS` in config Example: ```python from detectors.base import BaseDetector, DetectorResult class CustomDetector(BaseDetector): def __init__(self): super().__init__(name="custom") def detect(self, text: str) -> DetectorResult: # Your detection logic here score = calculate_ai_score(text) return DetectorResult( detector_name=self.name, score=score, explanation="Custom detection result" ) ``` ## Performance Tips 1. **Model Caching** - Models are lazy-loaded and cached in memory 2. **Parallel Detection** - Detectors can run in parallel (future enhancement) 3. **Batch Processing** - Configure batch size for GPU processing 4. **Disable Unused Detectors** - Reduce load by disabling detectors you don't need ## Troubleshooting ### Slow First Run - Models need to be downloaded from Hugging Face Hub - Subsequent runs will use cached models - First model download can take 1-5 minutes ### Out of Memory - Reduce batch size in config - Disable memory-intensive detectors - Run on a machine with more RAM ### Model Not Found ``` transformers.utils.RepositoryNotFoundError: Model not found ``` - Model name is incorrect in config - Check Hugging Face Hub for correct model name ### Database Locked ``` sqlite3.OperationalError: database is locked ``` - Close other connections to the database - Ensure only one Flask instance is running - Delete `.db-journal` file if present ## Future Enhancements - [ ] Parallel detector execution - [ ] GPU support optimization - [ ] Custom model fine-tuning - [ ] Batch analysis API - [ ] User authentication/authorization - [ ] Document highlighting with suspicious sections - [ ] Advanced filtering and search - [ ] Export results to PDF/Excel - [ ] API rate limiting - [ ] Webhook notifications ## License MIT License - feel free to use and modify ## References - [LLMDet](https://github.com/TrustedLLM/LLMDet) - [RAID](https://github.com/liamdugan/raid) - [OUTFOX](https://github.com/ryuryukke/OUTFOX) - [AIGTD Survey](https://github.com/Nicozwy/AIGTD-Survey) - [Plagiarism Detection](https://github.com/Kyle6012/plagiarism-detection) - [Hugging Face Transformers](https://huggingface.co/transformers/) ## Support For issues, questions, or suggestions, please open an issue on the project repository.