misakovhearst
Initial deploy
48c7fed
---
title: AI Slop Detector
emoji: πŸ”
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
---
# πŸ” AI Slop Detector
A comprehensive Python API and web UI for detecting AI-generated content in PDFs, DOCX files, and raw text. Uses an ensemble of multiple state-of-the-art detection methods.
## Features
✨ **Multi-Detector Ensemble**
- **RoBERTa Classifier** - Fine-tuned RoBERTa model for AI text detection
- **Perplexity Analysis** - Detects statistical anomalies and repetitive patterns
- **LLMDet** - Entropy and log-probability based detection
- **HuggingFace Classifier** - Generic transformer-based classification
- **OUTFOX Statistical** - Word/sentence length and vocabulary analysis
✨ **Easy Feature Flags**
- Enable/disable each detector with a single config change
- Adjust detector weights for ensemble averaging
- Environment variable overrides
✨ **Multiple File Formats**
- PDF documents
- DOCX/DOC files
- Plain text files
- Raw text input
✨ **Persistent Storage**
- SQLite database (default, configurable)
- Upload history with timestamps
- Detailed result tracking and statistics
✨ **Web UI**
- Beautiful, responsive interface
- Drag-and-drop file upload
- Real-time analysis results
- History and statistics views
✨ **REST API**
- Analyze text and files via HTTP
- Get historical results
- Query statistics
- Full result management
## Installation
### Prerequisites
- Python 3.8+
- pip or conda
### Setup
1. **Clone/Navigate to the project:**
```bash
cd slop-detect
```
2. **Create a Python virtual environment:**
```bash
python -m venv venv
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
```
3. **Install dependencies:**
```bash
pip install -r backend/requirements.txt
```
## Configuration
### Enable/Disable Detectors
Edit `backend/config/detectors_config.py`:
```python
ENABLED_DETECTORS: Dict[str, bool] = {
"roberta": True, # Enable RoBERTa
"perplexity": True, # Enable Perplexity
"llmdet": True, # Enable LLMDet
"hf_classifier": True, # Enable HF Classifier
"outfox": False, # Disable OUTFOX
}
```
### Set Detector Weights
```python
DETECTOR_WEIGHTS: Dict[str, float] = {
"roberta": 0.30, # 30% weight
"perplexity": 0.25, # 25% weight
"llmdet": 0.25, # 25% weight
"hf_classifier": 0.20, # 20% weight
"outfox": 0.00, # 0% weight (not used)
}
```
### Environment-based Configuration
You can also use environment variables to override config:
```bash
# Enable/disable detectors
export ENABLE_ROBERTA=true
export ENABLE_PERPLEXITY=true
export ENABLE_LLMDET=true
export ENABLE_HF_CLASSIFIER=true
export ENABLE_OUTFOX=false
# Database
export DATABASE_URL=sqlite:///slop_detect.db
export UPLOAD_FOLDER=./uploads
# Flask
export HOST=0.0.0.0
export PORT=5000
export DEBUG=False
```
## Running the Application
### Start the Flask Server
```bash
cd backend
python main.py
```
The API will be available at `http://localhost:5000`
### API Endpoints
#### Health Check
```
GET /api/health
```
#### Analyze Text
```
POST /api/analyze/text
Content-Type: application/json
{
"text": "Your text here...",
"filename": "optional_name.txt",
"user_id": "optional_user_id"
}
Response:
{
"status": "success",
"result_id": 1,
"overall_ai_score": 0.78,
"overall_ai_score_percentage": "78.0%",
"overall_confidence": "high",
"status_label": "Likely AI",
"detector_results": {
"roberta": {
"detector_name": "roberta",
"score": 0.85,
"confidence": "high",
"explanation": "Very strong indicators of AI-generated text..."
},
...
},
"enabled_detectors": ["roberta", "perplexity", "llmdet", "hf_classifier"],
"text_stats": {
"character_count": 1500,
"word_count": 250,
"sentence_count": 15,
"average_word_length": 4.8
}
}
```
#### Analyze File
```
POST /api/analyze/file
FormData:
- file: <file>
- user_id: <optional_user_id>
Response: (same as analyze/text)
```
#### Get All Results
```
GET /api/results?page=1&limit=10&sort=recent
Response:
{
"status": "success",
"page": 1,
"limit": 10,
"total_count": 42,
"results": [...]
}
```
#### Get Specific Result
```
GET /api/results/{result_id}
Response:
{
"status": "success",
"result": {
"id": 1,
"filename": "document.pdf",
"overall_ai_score": 0.78,
"overall_ai_score_percentage": "78.0%",
...
}
}
```
#### Delete Result
```
DELETE /api/results/{result_id}
```
#### Update Result
```
PUT /api/results/{result_id}
Content-Type: application/json
{
"notes": "Manual review: likely AI",
"is_flagged": true
}
```
#### Get Statistics
```
GET /api/statistics/summary
Response:
{
"status": "success",
"summary": {
"total_analyses": 42,
"average_ai_score": 0.65,
"total_text_analyzed": 125000,
"likely_human": 15,
"suspicious": 12,
"likely_ai": 15
}
}
```
#### Get Configuration
```
GET /api/config
Response:
{
"status": "success",
"config": {
"enabled_detectors": [
"roberta", "perplexity", "llmdet", "hf_classifier"
],
"aggregation_method": "weighted_average",
"detector_weights": {...},
"detector_info": {...}
}
}
```
## Web Interface
Open `http://localhost:5000` in your browser to access the web UI.
### Features:
- **Upload Section** - Drag-and-drop or click to upload files
- **Text Analysis** - Paste text directly
- **Results Dashboard** - View detailed analysis results
- **History Tab** - See all previous analyses
- **Statistics Tab** - View aggregate statistics
## How It Works
### Detection Process
1. **File Parsing** - Extracts text from PDF/DOCX/TXT files
2. **Text Cleaning** - Normalizes whitespace and formatting
3. **Detector Ensemble** - Runs enabled detectors in parallel
4. **Score Aggregation** - Combines detector scores using weighted average, max, or voting
5. **Result Storage** - Saves to database with full metadata
6. **Response** - Returns overall score and per-detector breakdown
### Detector Details
#### RoBERTa Detector
- **Model**: roberta-base-openai-detector
- **Type**: Transformer-based classification
- **Output**: 0-1 probability score
- **Speed**: Medium
#### Perplexity Detector
- **Model**: GPT-2
- **Method**: Analyzes token probability distributions
- **Detects**: Repetitive patterns, unusual word choices
- **Output**: 0-1 score based on perplexity, repetition, AI phrases
#### LLMDet Detector
- **Model**: BERT
- **Method**: Entropy and log-probability analysis
- **Detects**: Predictable sequences, unusual statistical patterns
- **Output**: 0-1 score from combined metrics
#### HF Classifier
- **Model**: Configurable (default: BERT)
- **Type**: Generic sequence classification
- **Output**: 0-1 probability score
#### OUTFOX Statistical
- **Type**: Statistical signature analysis
- **Detects**: Unusual word length distributions, sentence structure patterns, vocabulary diversity
- **Output**: 0-1 score from multiple statistical metrics
### Scoring
Default aggregation: **Weighted Average**
```
Overall Score = Ξ£ (normalized_detector_score Γ— weight)
```
Each detector's score is normalized to 0-1 range, then multiplied by its configured weight. The sum is clamped to [0, 1].
### Confidence Levels
- **Very Low** (< 20%) - Almost certainly human-written
- **Low** (20-40%) - Probably human-written
- **Medium** (40-60%) - Uncertain
- **High** (60-80%) - Probably AI-generated
- **Very High** (> 80%) - Almost certainly AI-generated
## Project Structure
```
slop-detect/
β”œβ”€β”€ backend/
β”‚ β”œβ”€β”€ config/
β”‚ β”‚ β”œβ”€β”€ settings.py # App settings
β”‚ β”‚ └── detectors_config.py # Detector configuration (FEATURE FLAGS HERE)
β”‚ β”œβ”€β”€ detectors/
β”‚ β”‚ β”œβ”€β”€ base.py # Base detector class
β”‚ β”‚ β”œβ”€β”€ roberta.py # RoBERTa detector
β”‚ β”‚ β”œβ”€β”€ perplexity.py # Perplexity detector
β”‚ β”‚ β”œβ”€β”€ llmdet.py # LLMDet detector
β”‚ β”‚ β”œβ”€β”€ hf_classifier.py # HF classifier
β”‚ β”‚ β”œβ”€β”€ outfox.py # OUTFOX detector
β”‚ β”‚ └── ensemble.py # Ensemble manager
β”‚ β”œβ”€β”€ database/
β”‚ β”‚ β”œβ”€β”€ models.py # SQLAlchemy models
β”‚ β”‚ └── db.py # Database manager
β”‚ β”œβ”€β”€ api/
β”‚ β”‚ β”œβ”€β”€ routes.py # Flask API routes
β”‚ β”‚ └── models.py # Pydantic request/response models
β”‚ β”œβ”€β”€ utils/
β”‚ β”‚ β”œβ”€β”€ file_parser.py # PDF/DOCX/TXT parsing
β”‚ β”‚ └── highlighter.py # Text highlighting utilities
β”‚ β”œβ”€β”€ main.py # Flask app entry point
β”‚ └── requirements.txt # Python dependencies
β”œβ”€β”€ frontend/
β”‚ └── index.html # Web UI (HTML + CSS + JS)
└── README.md # This file
```
## Customization
### Change Detector Weights
In `backend/config/detectors_config.py`:
```python
DETECTOR_WEIGHTS: Dict[str, float] = {
"roberta": 0.40, # Increase weight
"perplexity": 0.30,
"llmdet": 0.20,
"hf_classifier": 0.10,
}
```
### Change Aggregation Method
In `backend/config/detectors_config.py`:
```python
AGGREGATION_METHOD = "max" # Options: weighted_average, max, voting
```
### Use Different Models
In `backend/config/detectors_config.py`:
```python
ROBERTA_MODEL = "distilbert-base-uncased"
PERPLEXITY_MODEL = "gpt2-medium"
HF_CLASSIFIER_MODEL = "your-custom-model"
```
### Add Custom Detectors
1. Create a new file in `backend/detectors/`
2. Inherit from `BaseDetector`
3. Implement `detect()` method
4. Add to `ensemble.py` initialization
5. Add to `ENABLED_DETECTORS` in config
Example:
```python
from detectors.base import BaseDetector, DetectorResult
class CustomDetector(BaseDetector):
def __init__(self):
super().__init__(name="custom")
def detect(self, text: str) -> DetectorResult:
# Your detection logic here
score = calculate_ai_score(text)
return DetectorResult(
detector_name=self.name,
score=score,
explanation="Custom detection result"
)
```
## Performance Tips
1. **Model Caching** - Models are lazy-loaded and cached in memory
2. **Parallel Detection** - Detectors can run in parallel (future enhancement)
3. **Batch Processing** - Configure batch size for GPU processing
4. **Disable Unused Detectors** - Reduce load by disabling detectors you don't need
## Troubleshooting
### Slow First Run
- Models need to be downloaded from Hugging Face Hub
- Subsequent runs will use cached models
- First model download can take 1-5 minutes
### Out of Memory
- Reduce batch size in config
- Disable memory-intensive detectors
- Run on a machine with more RAM
### Model Not Found
```
transformers.utils.RepositoryNotFoundError: Model not found
```
- Model name is incorrect in config
- Check Hugging Face Hub for correct model name
### Database Locked
```
sqlite3.OperationalError: database is locked
```
- Close other connections to the database
- Ensure only one Flask instance is running
- Delete `.db-journal` file if present
## Future Enhancements
- [ ] Parallel detector execution
- [ ] GPU support optimization
- [ ] Custom model fine-tuning
- [ ] Batch analysis API
- [ ] User authentication/authorization
- [ ] Document highlighting with suspicious sections
- [ ] Advanced filtering and search
- [ ] Export results to PDF/Excel
- [ ] API rate limiting
- [ ] Webhook notifications
## License
MIT License - feel free to use and modify
## References
- [LLMDet](https://github.com/TrustedLLM/LLMDet)
- [RAID](https://github.com/liamdugan/raid)
- [OUTFOX](https://github.com/ryuryukke/OUTFOX)
- [AIGTD Survey](https://github.com/Nicozwy/AIGTD-Survey)
- [Plagiarism Detection](https://github.com/Kyle6012/plagiarism-detection)
- [Hugging Face Transformers](https://huggingface.co/transformers/)
## Support
For issues, questions, or suggestions, please open an issue on the project repository.