---
title: AI Slop Detector
emoji: 🔍
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
---

# 🔍 AI Slop Detector

A comprehensive Python API and web UI for detecting AI-generated content in PDFs, DOCX files, and raw text. Uses an ensemble of multiple state-of-the-art detection methods.

## Features

✨ **Multi-Detector Ensemble**
- **RoBERTa Classifier** - Fine-tuned RoBERTa model for AI text detection
- **Perplexity Analysis** - Detects statistical anomalies and repetitive patterns
- **LLMDet** - Entropy and log-probability based detection
- **HuggingFace Classifier** - Generic transformer-based classification
- **OUTFOX Statistical** - Word/sentence length and vocabulary analysis

✨ **Easy Feature Flags**
- Enable/disable each detector with a single config change
- Adjust detector weights for ensemble averaging
- Environment variable overrides

✨ **Multiple File Formats**
- PDF documents
- DOCX/DOC files
- Plain text files
- Raw text input

✨ **Persistent Storage**
- SQLite database (default, configurable)
- Upload history with timestamps
- Detailed result tracking and statistics

✨ **Web UI**
- Beautiful, responsive interface
- Drag-and-drop file upload
- Real-time analysis results
- History and statistics views

✨ **REST API**
- Analyze text and files via HTTP
- Get historical results
- Query statistics
- Full result management

## Installation

### Prerequisites
- Python 3.8+
- pip or conda

### Setup

1. **Clone/Navigate to the project:**
```bash
cd slop-detect
```

2. **Create a Python virtual environment:**
```bash
python -m venv venv

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate
```

3. **Install dependencies:**
```bash
pip install -r backend/requirements.txt
```

## Configuration

### Enable/Disable Detectors

Edit `backend/config/detectors_config.py`:

```python
ENABLED_DETECTORS: Dict[str, bool] = {
    "roberta": True,           # Enable RoBERTa
    "perplexity": True,        # Enable Perplexity
    "llmdet": True,            # Enable LLMDet
    "hf_classifier": True,     # Enable HF Classifier
    "outfox": False,           # Disable OUTFOX
}
```

### Set Detector Weights

```python
DETECTOR_WEIGHTS: Dict[str, float] = {
    "roberta": 0.30,           # 30% weight
    "perplexity": 0.25,        # 25% weight
    "llmdet": 0.25,            # 25% weight
    "hf_classifier": 0.20,     # 20% weight
    "outfox": 0.00,            # 0% weight (not used)
}
```

### Environment-based Configuration

You can also use environment variables to override config:

```bash
# Enable/disable detectors
export ENABLE_ROBERTA=true
export ENABLE_PERPLEXITY=true
export ENABLE_LLMDET=true
export ENABLE_HF_CLASSIFIER=true
export ENABLE_OUTFOX=false

# Database
export DATABASE_URL=sqlite:///slop_detect.db
export UPLOAD_FOLDER=./uploads

# Flask
export HOST=0.0.0.0
export PORT=5000
export DEBUG=False
```

## Running the Application

### Start the Flask Server

```bash
cd backend
python main.py
```

The API will be available at `http://localhost:5000`

### API Endpoints

#### Health Check
```
GET /api/health
```

#### Analyze Text
```
POST /api/analyze/text
Content-Type: application/json

{
    "text": "Your text here...",
    "filename": "optional_name.txt",
    "user_id": "optional_user_id"
}

Response:
{
    "status": "success",
    "result_id": 1,
    "overall_ai_score": 0.78,
    "overall_ai_score_percentage": "78.0%",
    "overall_confidence": "high",
    "status_label": "Likely AI",
    "detector_results": {
        "roberta": {
            "detector_name": "roberta",
            "score": 0.85,
            "confidence": "high",
            "explanation": "Very strong indicators of AI-generated text..."
        },
        ...
    },
    "enabled_detectors": ["roberta", "perplexity", "llmdet", "hf_classifier"],
    "text_stats": {
        "character_count": 1500,
        "word_count": 250,
        "sentence_count": 15,
        "average_word_length": 4.8
    }
}
```

#### Analyze File
```
POST /api/analyze/file
FormData:
- file: <file>
- user_id: <optional_user_id>

Response: (same as analyze/text)
```

#### Get All Results
```
GET /api/results?page=1&limit=10&sort=recent

Response:
{
    "status": "success",
    "page": 1,
    "limit": 10,
    "total_count": 42,
    "results": [...]
}
```

#### Get Specific Result
```
GET /api/results/{result_id}

Response:
{
    "status": "success",
    "result": {
        "id": 1,
        "filename": "document.pdf",
        "overall_ai_score": 0.78,
        "overall_ai_score_percentage": "78.0%",
        ...
    }
}
```

#### Delete Result
```
DELETE /api/results/{result_id}
```

#### Update Result
```
PUT /api/results/{result_id}
Content-Type: application/json

{
    "notes": "Manual review: likely AI",
    "is_flagged": true
}
```

#### Get Statistics
```
GET /api/statistics/summary

Response:
{
    "status": "success",
    "summary": {
        "total_analyses": 42,
        "average_ai_score": 0.65,
        "total_text_analyzed": 125000,
        "likely_human": 15,
        "suspicious": 12,
        "likely_ai": 15
    }
}
```

#### Get Configuration
```
GET /api/config

Response:
{
    "status": "success",
    "config": {
        "enabled_detectors": [
            "roberta", "perplexity", "llmdet", "hf_classifier"
        ],
        "aggregation_method": "weighted_average",
        "detector_weights": {...},
        "detector_info": {...}
    }
}
```

## Web Interface

Open `http://localhost:5000` in your browser to access the web UI.

### Features:
- **Upload Section** - Drag-and-drop or click to upload files
- **Text Analysis** - Paste text directly
- **Results Dashboard** - View detailed analysis results
- **History Tab** - See all previous analyses
- **Statistics Tab** - View aggregate statistics

## How It Works

### Detection Process

1. **File Parsing** - Extracts text from PDF/DOCX/TXT files
2. **Text Cleaning** - Normalizes whitespace and formatting
3. **Detector Ensemble** - Runs enabled detectors in parallel
4. **Score Aggregation** - Combines detector scores using weighted average, max, or voting
5. **Result Storage** - Saves to database with full metadata
6. **Response** - Returns overall score and per-detector breakdown

### Detector Details

#### RoBERTa Detector
- **Model**: roberta-base-openai-detector
- **Type**: Transformer-based classification
- **Output**: 0-1 probability score
- **Speed**: Medium

#### Perplexity Detector
- **Model**: GPT-2
- **Method**: Analyzes token probability distributions
- **Detects**: Repetitive patterns, unusual word choices
- **Output**: 0-1 score based on perplexity, repetition, AI phrases

#### LLMDet Detector
- **Model**: BERT
- **Method**: Entropy and log-probability analysis
- **Detects**: Predictable sequences, unusual statistical patterns
- **Output**: 0-1 score from combined metrics

#### HF Classifier
- **Model**: Configurable (default: BERT)
- **Type**: Generic sequence classification
- **Output**: 0-1 probability score

#### OUTFOX Statistical
- **Type**: Statistical signature analysis
- **Detects**: Unusual word length distributions, sentence structure patterns, vocabulary diversity
- **Output**: 0-1 score from multiple statistical metrics

### Scoring

Default aggregation: **Weighted Average**
```
Overall Score = Σ (normalized_detector_score × weight)
```

Each detector's score is normalized to 0-1 range, then multiplied by its configured weight. The sum is clamped to [0, 1].

### Confidence Levels

- **Very Low** (< 20%) - Almost certainly human-written
- **Low** (20-40%) - Probably human-written
- **Medium** (40-60%) - Uncertain
- **High** (60-80%) - Probably AI-generated
- **Very High** (> 80%) - Almost certainly AI-generated

## Project Structure

```
slop-detect/
├── backend/
│   ├── config/
│   │   ├── settings.py          # App settings
│   │   └── detectors_config.py  # Detector configuration (FEATURE FLAGS HERE)
│   ├── detectors/
│   │   ├── base.py              # Base detector class
│   │   ├── roberta.py           # RoBERTa detector
│   │   ├── perplexity.py        # Perplexity detector
│   │   ├── llmdet.py            # LLMDet detector
│   │   ├── hf_classifier.py     # HF classifier
│   │   ├── outfox.py            # OUTFOX detector
│   │   └── ensemble.py          # Ensemble manager
│   ├── database/
│   │   ├── models.py            # SQLAlchemy models
│   │   └── db.py                # Database manager
│   ├── api/
│   │   ├── routes.py            # Flask API routes
│   │   └── models.py            # Pydantic request/response models
│   ├── utils/
│   │   ├── file_parser.py       # PDF/DOCX/TXT parsing
│   │   └── highlighter.py       # Text highlighting utilities
│   ├── main.py                  # Flask app entry point
│   └── requirements.txt          # Python dependencies
├── frontend/
│   └── index.html               # Web UI (HTML + CSS + JS)
└── README.md                    # This file
```

## Customization

### Change Detector Weights

In `backend/config/detectors_config.py`:

```python
DETECTOR_WEIGHTS: Dict[str, float] = {
    "roberta": 0.40,           # Increase weight
    "perplexity": 0.30,
    "llmdet": 0.20,
    "hf_classifier": 0.10,
}
```

### Change Aggregation Method

In `backend/config/detectors_config.py`:

```python
AGGREGATION_METHOD = "max"  # Options: weighted_average, max, voting
```

### Use Different Models

In `backend/config/detectors_config.py`:

```python
ROBERTA_MODEL = "distilbert-base-uncased"
PERPLEXITY_MODEL = "gpt2-medium"
HF_CLASSIFIER_MODEL = "your-custom-model"
```

### Add Custom Detectors

1. Create a new file in `backend/detectors/`
2. Inherit from `BaseDetector`
3. Implement `detect()` method
4. Add to `ensemble.py` initialization
5. Add to `ENABLED_DETECTORS` in config

Example:

```python
from detectors.base import BaseDetector, DetectorResult

class CustomDetector(BaseDetector):
    def __init__(self):
        super().__init__(name="custom")
    
    def detect(self, text: str) -> DetectorResult:
        # Your detection logic here
        score = calculate_ai_score(text)
        return DetectorResult(
            detector_name=self.name,
            score=score,
            explanation="Custom detection result"
        )
```

## Performance Tips

1. **Model Caching** - Models are lazy-loaded and cached in memory
2. **Parallel Detection** - Detectors can run in parallel (future enhancement)
3. **Batch Processing** - Configure batch size for GPU processing
4. **Disable Unused Detectors** - Reduce load by disabling detectors you don't need

## Troubleshooting

### Slow First Run
- Models need to be downloaded from Hugging Face Hub
- Subsequent runs will use cached models
- First model download can take 1-5 minutes

### Out of Memory
- Reduce batch size in config
- Disable memory-intensive detectors
- Run on a machine with more RAM

### Model Not Found
```
transformers.utils.RepositoryNotFoundError: Model not found
```
- Model name is incorrect in config
- Check Hugging Face Hub for correct model name

### Database Locked
```
sqlite3.OperationalError: database is locked
```
- Close other connections to the database
- Ensure only one Flask instance is running
- Delete `.db-journal` file if present

## Future Enhancements

- [ ] Parallel detector execution
- [ ] GPU support optimization
- [ ] Custom model fine-tuning
- [ ] Batch analysis API
- [ ] User authentication/authorization
- [ ] Document highlighting with suspicious sections
- [ ] Advanced filtering and search
- [ ] Export results to PDF/Excel
- [ ] API rate limiting
- [ ] Webhook notifications

## License

MIT License - feel free to use and modify

## References

- [LLMDet](https://github.com/TrustedLLM/LLMDet)
- [RAID](https://github.com/liamdugan/raid)
- [OUTFOX](https://github.com/ryuryukke/OUTFOX)
- [AIGTD Survey](https://github.com/Nicozwy/AIGTD-Survey)
- [Plagiarism Detection](https://github.com/Kyle6012/plagiarism-detection)
- [Hugging Face Transformers](https://huggingface.co/transformers/)

## Support

For issues, questions, or suggestions, please open an issue on the project repository.