Spaces:

guardianrobot
/

screencomply_documents

Sleeping

App Files Files Community

screencomply_documents / README.md

misakovhearst

Initial deploy

48c7fed about 1 month ago

preview code

raw

history blame contribute delete

12.3 kB

	---
	title: AI Slop Detector
	emoji: 🔍
	colorFrom: purple
	colorTo: blue
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# 🔍 AI Slop Detector

	A comprehensive Python API and web UI for detecting AI-generated content in PDFs, DOCX files, and raw text. Uses an ensemble of multiple state-of-the-art detection methods.

	## Features

	✨ Multi-Detector Ensemble
	- RoBERTa Classifier - Fine-tuned RoBERTa model for AI text detection
	- Perplexity Analysis - Detects statistical anomalies and repetitive patterns
	- LLMDet - Entropy and log-probability based detection
	- HuggingFace Classifier - Generic transformer-based classification
	- OUTFOX Statistical - Word/sentence length and vocabulary analysis

	✨ Easy Feature Flags
	- Enable/disable each detector with a single config change
	- Adjust detector weights for ensemble averaging
	- Environment variable overrides

	✨ Multiple File Formats
	- PDF documents
	- DOCX/DOC files
	- Plain text files
	- Raw text input

	✨ Persistent Storage
	- SQLite database (default, configurable)
	- Upload history with timestamps
	- Detailed result tracking and statistics

	✨ Web UI
	- Beautiful, responsive interface
	- Drag-and-drop file upload
	- Real-time analysis results
	- History and statistics views

	✨ REST API
	- Analyze text and files via HTTP
	- Get historical results
	- Query statistics
	- Full result management

	## Installation

	### Prerequisites
	- Python 3.8+
	- pip or conda

	### Setup

	1. Clone/Navigate to the project:
	```bash
	cd slop-detect
	```

	2. Create a Python virtual environment:
	```bash
	python -m venv venv

	# On Windows:
	venv\Scripts\activate

	# On macOS/Linux:
	source venv/bin/activate
	```

	3. Install dependencies:
	```bash
	pip install -r backend/requirements.txt
	```

	## Configuration

	### Enable/Disable Detectors

	Edit `backend/config/detectors_config.py`:

	```python
	ENABLED_DETECTORS: Dict[str, bool] = {
	"roberta": True, # Enable RoBERTa
	"perplexity": True, # Enable Perplexity
	"llmdet": True, # Enable LLMDet
	"hf_classifier": True, # Enable HF Classifier
	"outfox": False, # Disable OUTFOX
	}
	```

	### Set Detector Weights

	```python
	DETECTOR_WEIGHTS: Dict[str, float] = {
	"roberta": 0.30, # 30% weight
	"perplexity": 0.25, # 25% weight
	"llmdet": 0.25, # 25% weight
	"hf_classifier": 0.20, # 20% weight
	"outfox": 0.00, # 0% weight (not used)
	}
	```

	### Environment-based Configuration

	You can also use environment variables to override config:

	```bash
	# Enable/disable detectors
	export ENABLE_ROBERTA=true
	export ENABLE_PERPLEXITY=true
	export ENABLE_LLMDET=true
	export ENABLE_HF_CLASSIFIER=true
	export ENABLE_OUTFOX=false

	# Database
	export DATABASE_URL=sqlite:///slop_detect.db
	export UPLOAD_FOLDER=./uploads

	# Flask
	export HOST=0.0.0.0
	export PORT=5000
	export DEBUG=False
	```

	## Running the Application

	### Start the Flask Server

	```bash
	cd backend
	python main.py
	```

	The API will be available at `http://localhost:5000`

	### API Endpoints

	#### Health Check
	```
	GET /api/health
	```

	#### Analyze Text
	```
	POST /api/analyze/text
	Content-Type: application/json

	{
	"text": "Your text here...",
	"filename": "optional_name.txt",
	"user_id": "optional_user_id"
	}

	Response:
	{
	"status": "success",
	"result_id": 1,
	"overall_ai_score": 0.78,
	"overall_ai_score_percentage": "78.0%",
	"overall_confidence": "high",
	"status_label": "Likely AI",
	"detector_results": {
	"roberta": {
	"detector_name": "roberta",
	"score": 0.85,
	"confidence": "high",
	"explanation": "Very strong indicators of AI-generated text..."
	},
	...
	},
	"enabled_detectors": ["roberta", "perplexity", "llmdet", "hf_classifier"],
	"text_stats": {
	"character_count": 1500,
	"word_count": 250,
	"sentence_count": 15,
	"average_word_length": 4.8
	}
	}
	```

	#### Analyze File
	```
	POST /api/analyze/file
	FormData:
	- file: <file>
	- user_id: <optional_user_id>

	Response: (same as analyze/text)
	```

	#### Get All Results
	```
	GET /api/results?page=1&limit=10&sort=recent

	Response:
	{
	"status": "success",
	"page": 1,
	"limit": 10,
	"total_count": 42,
	"results": [...]
	}
	```

	#### Get Specific Result
	```
	GET /api/results/{result_id}

	Response:
	{
	"status": "success",
	"result": {
	"id": 1,
	"filename": "document.pdf",
	"overall_ai_score": 0.78,
	"overall_ai_score_percentage": "78.0%",
	...
	}
	}
	```

	#### Delete Result
	```
	DELETE /api/results/{result_id}
	```

	#### Update Result
	```
	PUT /api/results/{result_id}
	Content-Type: application/json

	{
	"notes": "Manual review: likely AI",
	"is_flagged": true
	}
	```

	#### Get Statistics
	```
	GET /api/statistics/summary

	Response:
	{
	"status": "success",
	"summary": {
	"total_analyses": 42,
	"average_ai_score": 0.65,
	"total_text_analyzed": 125000,
	"likely_human": 15,
	"suspicious": 12,
	"likely_ai": 15
	}
	}
	```

	#### Get Configuration
	```
	GET /api/config

	Response:
	{
	"status": "success",
	"config": {
	"enabled_detectors": [
	"roberta", "perplexity", "llmdet", "hf_classifier"
	],
	"aggregation_method": "weighted_average",
	"detector_weights": {...},
	"detector_info": {...}
	}
	}
	```

	## Web Interface

	Open `http://localhost:5000` in your browser to access the web UI.

	### Features:
	- Upload Section - Drag-and-drop or click to upload files
	- Text Analysis - Paste text directly
	- Results Dashboard - View detailed analysis results
	- History Tab - See all previous analyses
	- Statistics Tab - View aggregate statistics

	## How It Works

	### Detection Process

	1. File Parsing - Extracts text from PDF/DOCX/TXT files
	2. Text Cleaning - Normalizes whitespace and formatting
	3. Detector Ensemble - Runs enabled detectors in parallel
	4. Score Aggregation - Combines detector scores using weighted average, max, or voting
	5. Result Storage - Saves to database with full metadata
	6. Response - Returns overall score and per-detector breakdown

	### Detector Details

	#### RoBERTa Detector
	- Model: roberta-base-openai-detector
	- Type: Transformer-based classification
	- Output: 0-1 probability score
	- Speed: Medium

	#### Perplexity Detector
	- Model: GPT-2
	- Method: Analyzes token probability distributions
	- Detects: Repetitive patterns, unusual word choices
	- Output: 0-1 score based on perplexity, repetition, AI phrases

	#### LLMDet Detector
	- Model: BERT
	- Method: Entropy and log-probability analysis
	- Detects: Predictable sequences, unusual statistical patterns
	- Output: 0-1 score from combined metrics

	#### HF Classifier
	- Model: Configurable (default: BERT)
	- Type: Generic sequence classification
	- Output: 0-1 probability score

	#### OUTFOX Statistical
	- Type: Statistical signature analysis
	- Detects: Unusual word length distributions, sentence structure patterns, vocabulary diversity
	- Output: 0-1 score from multiple statistical metrics

	### Scoring

	Default aggregation: Weighted Average
	```
	Overall Score = Σ (normalized_detector_score × weight)
	```

	Each detector's score is normalized to 0-1 range, then multiplied by its configured weight. The sum is clamped to [0, 1].

	### Confidence Levels

	- Very Low (< 20%) - Almost certainly human-written
	- Low (20-40%) - Probably human-written
	- Medium (40-60%) - Uncertain
	- High (60-80%) - Probably AI-generated
	- Very High (> 80%) - Almost certainly AI-generated

	## Project Structure

	```
	slop-detect/
	├── backend/
	│ ├── config/
	│ │ ├── settings.py # App settings
	│ │ └── detectors_config.py # Detector configuration (FEATURE FLAGS HERE)
	│ ├── detectors/
	│ │ ├── base.py # Base detector class
	│ │ ├── roberta.py # RoBERTa detector
	│ │ ├── perplexity.py # Perplexity detector
	│ │ ├── llmdet.py # LLMDet detector
	│ │ ├── hf_classifier.py # HF classifier
	│ │ ├── outfox.py # OUTFOX detector
	│ │ └── ensemble.py # Ensemble manager
	│ ├── database/
	│ │ ├── models.py # SQLAlchemy models
	│ │ └── db.py # Database manager
	│ ├── api/
	│ │ ├── routes.py # Flask API routes
	│ │ └── models.py # Pydantic request/response models
	│ ├── utils/
	│ │ ├── file_parser.py # PDF/DOCX/TXT parsing
	│ │ └── highlighter.py # Text highlighting utilities
	│ ├── main.py # Flask app entry point
	│ └── requirements.txt # Python dependencies
	├── frontend/
	│ └── index.html # Web UI (HTML + CSS + JS)
	└── README.md # This file
	```

	## Customization

	### Change Detector Weights

	In `backend/config/detectors_config.py`:

	```python
	DETECTOR_WEIGHTS: Dict[str, float] = {
	"roberta": 0.40, # Increase weight
	"perplexity": 0.30,
	"llmdet": 0.20,
	"hf_classifier": 0.10,
	}
	```

	### Change Aggregation Method

	In `backend/config/detectors_config.py`:

	```python
	AGGREGATION_METHOD = "max" # Options: weighted_average, max, voting
	```

	### Use Different Models

	In `backend/config/detectors_config.py`:

	```python
	ROBERTA_MODEL = "distilbert-base-uncased"
	PERPLEXITY_MODEL = "gpt2-medium"
	HF_CLASSIFIER_MODEL = "your-custom-model"
	```

	### Add Custom Detectors

	1. Create a new file in `backend/detectors/`
	2. Inherit from `BaseDetector`
	3. Implement `detect()` method
	4. Add to `ensemble.py` initialization
	5. Add to `ENABLED_DETECTORS` in config

	Example:

	```python
	from detectors.base import BaseDetector, DetectorResult

	class CustomDetector(BaseDetector):
	def __init__(self):
	super().__init__(name="custom")

	def detect(self, text: str) -> DetectorResult:
	# Your detection logic here
	score = calculate_ai_score(text)
	return DetectorResult(
	detector_name=self.name,
	score=score,
	explanation="Custom detection result"
	)
	```

	## Performance Tips

	1. Model Caching - Models are lazy-loaded and cached in memory
	2. Parallel Detection - Detectors can run in parallel (future enhancement)
	3. Batch Processing - Configure batch size for GPU processing
	4. Disable Unused Detectors - Reduce load by disabling detectors you don't need

	## Troubleshooting

	### Slow First Run
	- Models need to be downloaded from Hugging Face Hub
	- Subsequent runs will use cached models
	- First model download can take 1-5 minutes

	### Out of Memory
	- Reduce batch size in config
	- Disable memory-intensive detectors
	- Run on a machine with more RAM

	### Model Not Found
	```
	transformers.utils.RepositoryNotFoundError: Model not found
	```
	- Model name is incorrect in config
	- Check Hugging Face Hub for correct model name

	### Database Locked
	```
	sqlite3.OperationalError: database is locked
	```
	- Close other connections to the database
	- Ensure only one Flask instance is running
	- Delete `.db-journal` file if present

	## Future Enhancements

	- [ ] Parallel detector execution
	- [ ] GPU support optimization
	- [ ] Custom model fine-tuning
	- [ ] Batch analysis API
	- [ ] User authentication/authorization
	- [ ] Document highlighting with suspicious sections
	- [ ] Advanced filtering and search
	- [ ] Export results to PDF/Excel
	- [ ] API rate limiting
	- [ ] Webhook notifications

	## License

	MIT License - feel free to use and modify

	## References

	- [LLMDet](https://github.com/TrustedLLM/LLMDet)
	- [RAID](https://github.com/liamdugan/raid)
	- [OUTFOX](https://github.com/ryuryukke/OUTFOX)
	- [AIGTD Survey](https://github.com/Nicozwy/AIGTD-Survey)
	- [Plagiarism Detection](https://github.com/Kyle6012/plagiarism-detection)
	- [Hugging Face Transformers](https://huggingface.co/transformers/)

	## Support

	For issues, questions, or suggestions, please open an issue on the project repository.