misakovhearst
Initial deploy
48c7fed
metadata
title: AI Slop Detector
emoji: πŸ”
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
pinned: false

πŸ” AI Slop Detector

A comprehensive Python API and web UI for detecting AI-generated content in PDFs, DOCX files, and raw text. Uses an ensemble of multiple state-of-the-art detection methods.

Features

✨ Multi-Detector Ensemble

  • RoBERTa Classifier - Fine-tuned RoBERTa model for AI text detection
  • Perplexity Analysis - Detects statistical anomalies and repetitive patterns
  • LLMDet - Entropy and log-probability based detection
  • HuggingFace Classifier - Generic transformer-based classification
  • OUTFOX Statistical - Word/sentence length and vocabulary analysis

✨ Easy Feature Flags

  • Enable/disable each detector with a single config change
  • Adjust detector weights for ensemble averaging
  • Environment variable overrides

✨ Multiple File Formats

  • PDF documents
  • DOCX/DOC files
  • Plain text files
  • Raw text input

✨ Persistent Storage

  • SQLite database (default, configurable)
  • Upload history with timestamps
  • Detailed result tracking and statistics

✨ Web UI

  • Beautiful, responsive interface
  • Drag-and-drop file upload
  • Real-time analysis results
  • History and statistics views

✨ REST API

  • Analyze text and files via HTTP
  • Get historical results
  • Query statistics
  • Full result management

Installation

Prerequisites

  • Python 3.8+
  • pip or conda

Setup

  1. Clone/Navigate to the project:
cd slop-detect
  1. Create a Python virtual environment:
python -m venv venv

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate
  1. Install dependencies:
pip install -r backend/requirements.txt

Configuration

Enable/Disable Detectors

Edit backend/config/detectors_config.py:

ENABLED_DETECTORS: Dict[str, bool] = {
    "roberta": True,           # Enable RoBERTa
    "perplexity": True,        # Enable Perplexity
    "llmdet": True,            # Enable LLMDet
    "hf_classifier": True,     # Enable HF Classifier
    "outfox": False,           # Disable OUTFOX
}

Set Detector Weights

DETECTOR_WEIGHTS: Dict[str, float] = {
    "roberta": 0.30,           # 30% weight
    "perplexity": 0.25,        # 25% weight
    "llmdet": 0.25,            # 25% weight
    "hf_classifier": 0.20,     # 20% weight
    "outfox": 0.00,            # 0% weight (not used)
}

Environment-based Configuration

You can also use environment variables to override config:

# Enable/disable detectors
export ENABLE_ROBERTA=true
export ENABLE_PERPLEXITY=true
export ENABLE_LLMDET=true
export ENABLE_HF_CLASSIFIER=true
export ENABLE_OUTFOX=false

# Database
export DATABASE_URL=sqlite:///slop_detect.db
export UPLOAD_FOLDER=./uploads

# Flask
export HOST=0.0.0.0
export PORT=5000
export DEBUG=False

Running the Application

Start the Flask Server

cd backend
python main.py

The API will be available at http://localhost:5000

API Endpoints

Health Check

GET /api/health

Analyze Text

POST /api/analyze/text
Content-Type: application/json

{
    "text": "Your text here...",
    "filename": "optional_name.txt",
    "user_id": "optional_user_id"
}

Response:
{
    "status": "success",
    "result_id": 1,
    "overall_ai_score": 0.78,
    "overall_ai_score_percentage": "78.0%",
    "overall_confidence": "high",
    "status_label": "Likely AI",
    "detector_results": {
        "roberta": {
            "detector_name": "roberta",
            "score": 0.85,
            "confidence": "high",
            "explanation": "Very strong indicators of AI-generated text..."
        },
        ...
    },
    "enabled_detectors": ["roberta", "perplexity", "llmdet", "hf_classifier"],
    "text_stats": {
        "character_count": 1500,
        "word_count": 250,
        "sentence_count": 15,
        "average_word_length": 4.8
    }
}

Analyze File

POST /api/analyze/file
FormData:
- file: <file>
- user_id: <optional_user_id>

Response: (same as analyze/text)

Get All Results

GET /api/results?page=1&limit=10&sort=recent

Response:
{
    "status": "success",
    "page": 1,
    "limit": 10,
    "total_count": 42,
    "results": [...]
}

Get Specific Result

GET /api/results/{result_id}

Response:
{
    "status": "success",
    "result": {
        "id": 1,
        "filename": "document.pdf",
        "overall_ai_score": 0.78,
        "overall_ai_score_percentage": "78.0%",
        ...
    }
}

Delete Result

DELETE /api/results/{result_id}

Update Result

PUT /api/results/{result_id}
Content-Type: application/json

{
    "notes": "Manual review: likely AI",
    "is_flagged": true
}

Get Statistics

GET /api/statistics/summary

Response:
{
    "status": "success",
    "summary": {
        "total_analyses": 42,
        "average_ai_score": 0.65,
        "total_text_analyzed": 125000,
        "likely_human": 15,
        "suspicious": 12,
        "likely_ai": 15
    }
}

Get Configuration

GET /api/config

Response:
{
    "status": "success",
    "config": {
        "enabled_detectors": [
            "roberta", "perplexity", "llmdet", "hf_classifier"
        ],
        "aggregation_method": "weighted_average",
        "detector_weights": {...},
        "detector_info": {...}
    }
}

Web Interface

Open http://localhost:5000 in your browser to access the web UI.

Features:

  • Upload Section - Drag-and-drop or click to upload files
  • Text Analysis - Paste text directly
  • Results Dashboard - View detailed analysis results
  • History Tab - See all previous analyses
  • Statistics Tab - View aggregate statistics

How It Works

Detection Process

  1. File Parsing - Extracts text from PDF/DOCX/TXT files
  2. Text Cleaning - Normalizes whitespace and formatting
  3. Detector Ensemble - Runs enabled detectors in parallel
  4. Score Aggregation - Combines detector scores using weighted average, max, or voting
  5. Result Storage - Saves to database with full metadata
  6. Response - Returns overall score and per-detector breakdown

Detector Details

RoBERTa Detector

  • Model: roberta-base-openai-detector
  • Type: Transformer-based classification
  • Output: 0-1 probability score
  • Speed: Medium

Perplexity Detector

  • Model: GPT-2
  • Method: Analyzes token probability distributions
  • Detects: Repetitive patterns, unusual word choices
  • Output: 0-1 score based on perplexity, repetition, AI phrases

LLMDet Detector

  • Model: BERT
  • Method: Entropy and log-probability analysis
  • Detects: Predictable sequences, unusual statistical patterns
  • Output: 0-1 score from combined metrics

HF Classifier

  • Model: Configurable (default: BERT)
  • Type: Generic sequence classification
  • Output: 0-1 probability score

OUTFOX Statistical

  • Type: Statistical signature analysis
  • Detects: Unusual word length distributions, sentence structure patterns, vocabulary diversity
  • Output: 0-1 score from multiple statistical metrics

Scoring

Default aggregation: Weighted Average

Overall Score = Ξ£ (normalized_detector_score Γ— weight)

Each detector's score is normalized to 0-1 range, then multiplied by its configured weight. The sum is clamped to [0, 1].

Confidence Levels

  • Very Low (< 20%) - Almost certainly human-written
  • Low (20-40%) - Probably human-written
  • Medium (40-60%) - Uncertain
  • High (60-80%) - Probably AI-generated
  • Very High (> 80%) - Almost certainly AI-generated

Project Structure

slop-detect/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ config/
β”‚   β”‚   β”œβ”€β”€ settings.py          # App settings
β”‚   β”‚   └── detectors_config.py  # Detector configuration (FEATURE FLAGS HERE)
β”‚   β”œβ”€β”€ detectors/
β”‚   β”‚   β”œβ”€β”€ base.py              # Base detector class
β”‚   β”‚   β”œβ”€β”€ roberta.py           # RoBERTa detector
β”‚   β”‚   β”œβ”€β”€ perplexity.py        # Perplexity detector
β”‚   β”‚   β”œβ”€β”€ llmdet.py            # LLMDet detector
β”‚   β”‚   β”œβ”€β”€ hf_classifier.py     # HF classifier
β”‚   β”‚   β”œβ”€β”€ outfox.py            # OUTFOX detector
β”‚   β”‚   └── ensemble.py          # Ensemble manager
β”‚   β”œβ”€β”€ database/
β”‚   β”‚   β”œβ”€β”€ models.py            # SQLAlchemy models
β”‚   β”‚   └── db.py                # Database manager
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ routes.py            # Flask API routes
β”‚   β”‚   └── models.py            # Pydantic request/response models
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ file_parser.py       # PDF/DOCX/TXT parsing
β”‚   β”‚   └── highlighter.py       # Text highlighting utilities
β”‚   β”œβ”€β”€ main.py                  # Flask app entry point
β”‚   └── requirements.txt          # Python dependencies
β”œβ”€β”€ frontend/
β”‚   └── index.html               # Web UI (HTML + CSS + JS)
└── README.md                    # This file

Customization

Change Detector Weights

In backend/config/detectors_config.py:

DETECTOR_WEIGHTS: Dict[str, float] = {
    "roberta": 0.40,           # Increase weight
    "perplexity": 0.30,
    "llmdet": 0.20,
    "hf_classifier": 0.10,
}

Change Aggregation Method

In backend/config/detectors_config.py:

AGGREGATION_METHOD = "max"  # Options: weighted_average, max, voting

Use Different Models

In backend/config/detectors_config.py:

ROBERTA_MODEL = "distilbert-base-uncased"
PERPLEXITY_MODEL = "gpt2-medium"
HF_CLASSIFIER_MODEL = "your-custom-model"

Add Custom Detectors

  1. Create a new file in backend/detectors/
  2. Inherit from BaseDetector
  3. Implement detect() method
  4. Add to ensemble.py initialization
  5. Add to ENABLED_DETECTORS in config

Example:

from detectors.base import BaseDetector, DetectorResult

class CustomDetector(BaseDetector):
    def __init__(self):
        super().__init__(name="custom")
    
    def detect(self, text: str) -> DetectorResult:
        # Your detection logic here
        score = calculate_ai_score(text)
        return DetectorResult(
            detector_name=self.name,
            score=score,
            explanation="Custom detection result"
        )

Performance Tips

  1. Model Caching - Models are lazy-loaded and cached in memory
  2. Parallel Detection - Detectors can run in parallel (future enhancement)
  3. Batch Processing - Configure batch size for GPU processing
  4. Disable Unused Detectors - Reduce load by disabling detectors you don't need

Troubleshooting

Slow First Run

  • Models need to be downloaded from Hugging Face Hub
  • Subsequent runs will use cached models
  • First model download can take 1-5 minutes

Out of Memory

  • Reduce batch size in config
  • Disable memory-intensive detectors
  • Run on a machine with more RAM

Model Not Found

transformers.utils.RepositoryNotFoundError: Model not found
  • Model name is incorrect in config
  • Check Hugging Face Hub for correct model name

Database Locked

sqlite3.OperationalError: database is locked
  • Close other connections to the database
  • Ensure only one Flask instance is running
  • Delete .db-journal file if present

Future Enhancements

  • Parallel detector execution
  • GPU support optimization
  • Custom model fine-tuning
  • Batch analysis API
  • User authentication/authorization
  • Document highlighting with suspicious sections
  • Advanced filtering and search
  • Export results to PDF/Excel
  • API rate limiting
  • Webhook notifications

License

MIT License - feel free to use and modify

References

Support

For issues, questions, or suggestions, please open an issue on the project repository.