Spaces:

guardianrobot
/

screencomply_documents

Running

App Files Files Community

screencomply_documents / README.md

misakovhearst

Initial deploy

48c7fed 29 days ago

preview code

raw

history blame contribute delete

12.3 kB

metadata

title: AI Slop Detector
emoji: 🔍
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
pinned: false

🔍 AI Slop Detector

A comprehensive Python API and web UI for detecting AI-generated content in PDFs, DOCX files, and raw text. Uses an ensemble of multiple state-of-the-art detection methods.

Features

✨ Multi-Detector Ensemble

RoBERTa Classifier - Fine-tuned RoBERTa model for AI text detection
Perplexity Analysis - Detects statistical anomalies and repetitive patterns
LLMDet - Entropy and log-probability based detection
HuggingFace Classifier - Generic transformer-based classification
OUTFOX Statistical - Word/sentence length and vocabulary analysis

✨ Easy Feature Flags

Enable/disable each detector with a single config change
Adjust detector weights for ensemble averaging
Environment variable overrides

✨ Multiple File Formats

PDF documents
DOCX/DOC files
Plain text files
Raw text input

✨ Persistent Storage

SQLite database (default, configurable)
Upload history with timestamps
Detailed result tracking and statistics

✨ Web UI

Beautiful, responsive interface
Drag-and-drop file upload
Real-time analysis results
History and statistics views

✨ REST API

Analyze text and files via HTTP
Get historical results
Query statistics
Full result management

Installation

Prerequisites

Python 3.8+
pip or conda

Setup

Clone/Navigate to the project:

cd slop-detect

Create a Python virtual environment:

python -m venv venv

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

Install dependencies:

pip install -r backend/requirements.txt

Configuration

Enable/Disable Detectors

Edit backend/config/detectors_config.py:

ENABLED_DETECTORS: Dict[str, bool] = {
    "roberta": True,           # Enable RoBERTa
    "perplexity": True,        # Enable Perplexity
    "llmdet": True,            # Enable LLMDet
    "hf_classifier": True,     # Enable HF Classifier
    "outfox": False,           # Disable OUTFOX
}

Set Detector Weights

DETECTOR_WEIGHTS: Dict[str, float] = {
    "roberta": 0.30,           # 30% weight
    "perplexity": 0.25,        # 25% weight
    "llmdet": 0.25,            # 25% weight
    "hf_classifier": 0.20,     # 20% weight
    "outfox": 0.00,            # 0% weight (not used)
}

Environment-based Configuration

You can also use environment variables to override config:

# Enable/disable detectors
export ENABLE_ROBERTA=true
export ENABLE_PERPLEXITY=true
export ENABLE_LLMDET=true
export ENABLE_HF_CLASSIFIER=true
export ENABLE_OUTFOX=false

# Database
export DATABASE_URL=sqlite:///slop_detect.db
export UPLOAD_FOLDER=./uploads

# Flask
export HOST=0.0.0.0
export PORT=5000
export DEBUG=False

Running the Application

Start the Flask Server

cd backend
python main.py

The API will be available at http://localhost:5000

API Endpoints

Health Check

GET /api/health

Analyze Text

POST /api/analyze/text
Content-Type: application/json

{
    "text": "Your text here...",
    "filename": "optional_name.txt",
    "user_id": "optional_user_id"
}

Response:
{
    "status": "success",
    "result_id": 1,
    "overall_ai_score": 0.78,
    "overall_ai_score_percentage": "78.0%",
    "overall_confidence": "high",
    "status_label": "Likely AI",
    "detector_results": {
        "roberta": {
            "detector_name": "roberta",
            "score": 0.85,
            "confidence": "high",
            "explanation": "Very strong indicators of AI-generated text..."
        },
        ...
    },
    "enabled_detectors": ["roberta", "perplexity", "llmdet", "hf_classifier"],
    "text_stats": {
        "character_count": 1500,
        "word_count": 250,
        "sentence_count": 15,
        "average_word_length": 4.8
    }
}

Analyze File

POST /api/analyze/file
FormData:
- file: <file>
- user_id: <optional_user_id>

Response: (same as analyze/text)

Get All Results

GET /api/results?page=1&limit=10&sort=recent

Response:
{
    "status": "success",
    "page": 1,
    "limit": 10,
    "total_count": 42,
    "results": [...]
}

Get Specific Result

GET /api/results/{result_id}

Response:
{
    "status": "success",
    "result": {
        "id": 1,
        "filename": "document.pdf",
        "overall_ai_score": 0.78,
        "overall_ai_score_percentage": "78.0%",
        ...
    }
}

Delete Result

DELETE /api/results/{result_id}

Update Result

PUT /api/results/{result_id}
Content-Type: application/json

{
    "notes": "Manual review: likely AI",
    "is_flagged": true
}

Get Statistics

GET /api/statistics/summary

Response:
{
    "status": "success",
    "summary": {
        "total_analyses": 42,
        "average_ai_score": 0.65,
        "total_text_analyzed": 125000,
        "likely_human": 15,
        "suspicious": 12,
        "likely_ai": 15
    }
}

Get Configuration

GET /api/config

Response:
{
    "status": "success",
    "config": {
        "enabled_detectors": [
            "roberta", "perplexity", "llmdet", "hf_classifier"
        ],
        "aggregation_method": "weighted_average",
        "detector_weights": {...},
        "detector_info": {...}
    }
}

Web Interface

Open http://localhost:5000 in your browser to access the web UI.

Features:

Upload Section - Drag-and-drop or click to upload files
Text Analysis - Paste text directly
Results Dashboard - View detailed analysis results
History Tab - See all previous analyses
Statistics Tab - View aggregate statistics

How It Works

Detection Process

File Parsing - Extracts text from PDF/DOCX/TXT files
Text Cleaning - Normalizes whitespace and formatting
Detector Ensemble - Runs enabled detectors in parallel
Score Aggregation - Combines detector scores using weighted average, max, or voting
Result Storage - Saves to database with full metadata
Response - Returns overall score and per-detector breakdown

Detector Details

RoBERTa Detector

Model: roberta-base-openai-detector
Type: Transformer-based classification
Output: 0-1 probability score
Speed: Medium

Perplexity Detector

Model: GPT-2
Method: Analyzes token probability distributions
Detects: Repetitive patterns, unusual word choices
Output: 0-1 score based on perplexity, repetition, AI phrases

LLMDet Detector

Model: BERT
Method: Entropy and log-probability analysis
Detects: Predictable sequences, unusual statistical patterns
Output: 0-1 score from combined metrics

HF Classifier

Model: Configurable (default: BERT)
Type: Generic sequence classification
Output: 0-1 probability score

OUTFOX Statistical

Type: Statistical signature analysis
Detects: Unusual word length distributions, sentence structure patterns, vocabulary diversity
Output: 0-1 score from multiple statistical metrics

Scoring

Default aggregation: Weighted Average

Overall Score = Σ (normalized_detector_score × weight)

Each detector's score is normalized to 0-1 range, then multiplied by its configured weight. The sum is clamped to [0, 1].

Confidence Levels

Very Low (< 20%) - Almost certainly human-written
Low (20-40%) - Probably human-written
Medium (40-60%) - Uncertain
High (60-80%) - Probably AI-generated
Very High (> 80%) - Almost certainly AI-generated

Project Structure

slop-detect/
├── backend/
│   ├── config/
│   │   ├── settings.py          # App settings
│   │   └── detectors_config.py  # Detector configuration (FEATURE FLAGS HERE)
│   ├── detectors/
│   │   ├── base.py              # Base detector class
│   │   ├── roberta.py           # RoBERTa detector
│   │   ├── perplexity.py        # Perplexity detector
│   │   ├── llmdet.py            # LLMDet detector
│   │   ├── hf_classifier.py     # HF classifier
│   │   ├── outfox.py            # OUTFOX detector
│   │   └── ensemble.py          # Ensemble manager
│   ├── database/
│   │   ├── models.py            # SQLAlchemy models
│   │   └── db.py                # Database manager
│   ├── api/
│   │   ├── routes.py            # Flask API routes
│   │   └── models.py            # Pydantic request/response models
│   ├── utils/
│   │   ├── file_parser.py       # PDF/DOCX/TXT parsing
│   │   └── highlighter.py       # Text highlighting utilities
│   ├── main.py                  # Flask app entry point
│   └── requirements.txt          # Python dependencies
├── frontend/
│   └── index.html               # Web UI (HTML + CSS + JS)
└── README.md                    # This file

Customization

Change Detector Weights

In backend/config/detectors_config.py:

DETECTOR_WEIGHTS: Dict[str, float] = {
    "roberta": 0.40,           # Increase weight
    "perplexity": 0.30,
    "llmdet": 0.20,
    "hf_classifier": 0.10,
}

Change Aggregation Method

In backend/config/detectors_config.py:

AGGREGATION_METHOD = "max"  # Options: weighted_average, max, voting

Use Different Models

In backend/config/detectors_config.py:

ROBERTA_MODEL = "distilbert-base-uncased"
PERPLEXITY_MODEL = "gpt2-medium"
HF_CLASSIFIER_MODEL = "your-custom-model"

Add Custom Detectors

Create a new file in backend/detectors/
Inherit from BaseDetector
Implement detect() method
Add to ensemble.py initialization
Add to ENABLED_DETECTORS in config

Example:

from detectors.base import BaseDetector, DetectorResult

class CustomDetector(BaseDetector):
    def __init__(self):
        super().__init__(name="custom")
    
    def detect(self, text: str) -> DetectorResult:
        # Your detection logic here
        score = calculate_ai_score(text)
        return DetectorResult(
            detector_name=self.name,
            score=score,
            explanation="Custom detection result"
        )

Performance Tips

Model Caching - Models are lazy-loaded and cached in memory
Parallel Detection - Detectors can run in parallel (future enhancement)
Batch Processing - Configure batch size for GPU processing
Disable Unused Detectors - Reduce load by disabling detectors you don't need

Troubleshooting

Slow First Run

Models need to be downloaded from Hugging Face Hub
Subsequent runs will use cached models
First model download can take 1-5 minutes

Out of Memory

Reduce batch size in config
Disable memory-intensive detectors
Run on a machine with more RAM

Model Not Found

transformers.utils.RepositoryNotFoundError: Model not found

Model name is incorrect in config
Check Hugging Face Hub for correct model name

Database Locked

sqlite3.OperationalError: database is locked

Close other connections to the database
Ensure only one Flask instance is running
Delete .db-journal file if present

Future Enhancements

Parallel detector execution
GPU support optimization
Custom model fine-tuning
Batch analysis API
User authentication/authorization
Document highlighting with suspicious sections
Advanced filtering and search
Export results to PDF/Excel
API rate limiting
Webhook notifications

License

MIT License - feel free to use and modify

References

Support

For issues, questions, or suggestions, please open an issue on the project repository.