Spaces:

jessejohnson
/

plg4-dev-server

Paused

App Files Files Community

plg4-dev-server / backend /docs /sanitization_guide.md

Jesse Johnson

New commit for backend deployment: 2025-09-25_13-24-03

c59d808 5 months ago

preview code

raw

history blame contribute delete

3.94 kB

	# Simplified Data Sanitization Documentation

	## Overview

	The simplified data sanitization module provides focused input validation and sanitization for the Recipe Recommendation Bot API. It's designed specifically for recipe chatbot context with essential security protection.

	## Features

	### 🛡️ Essential Security Protection
	- XSS Prevention: HTML encoding and basic script removal
	- Input Validation: Length limits and content validation
	- Whitespace Normalization: Clean formatting

	### 🔧 Simple Configuration
	- Maximum Message Length: 1000 characters
	- Minimum Message Length: 1 character
	- Single Method: One sanitization method for all inputs

	## Usage

	### Basic Sanitization

	```python
	from utils.sanitization import sanitize_user_input

	# Sanitize any user input (chat messages, demo prompts)
	clean_input = sanitize_user_input("What are some chicken recipes?")
	```

	### Advanced Usage

	```python
	from utils.sanitization import DataSanitizer

	# Direct class usage
	sanitizer = DataSanitizer()
	clean_text = sanitizer.sanitize_input("User input")
	```

	## Security Patterns Handled

	### Basic XSS Protection
	- `<script>` tags → Removed
	- `javascript:` URLs → Cleaned
	- Event handlers (`onclick`, `onload`) → Removed
	- HTML entities → Properly encoded

	### Input Validation
	- Length limits (1-1000 characters)
	- Empty input detection
	- Whitespace normalization

	## Integration

	The sanitization is automatically applied in FastAPI endpoints:

	### Chat Endpoint
	```python
	class ChatMessage(BaseModel):
	message: str = Field(..., min_length=1, max_length=1000)

	@validator('message')
	def sanitize_message_field(cls, v):
	return sanitize_user_input(v)
	```

	### Demo Endpoint
	```python
	@app.get("/demo")
	def demo(prompt: str = "What recipes do you have?"):
	sanitized_prompt = sanitize_user_input(prompt)
	# ... rest of the logic
	```

	## Error Handling

	The sanitization raises `ValueError` for invalid input:

	```python
	try:
	clean_input = sanitize_user_input(user_input)
	except ValueError as e:
	return {"error": f"Invalid input: {str(e)}"}
	```

	## Testing

	Run the sanitization tests:

	```bash
	python3 test_sanitization.py
	```

	The test suite covers:
	- Normal recipe-related messages
	- Basic harmful content (scripts, JavaScript)
	- Length validation
	- Whitespace normalization
	- Edge cases

	## What's Simplified

	### Removed Overly Complex Features:
	- ❌ SQL injection patterns (not relevant for LLM chatbot)
	- ❌ Command injection patterns (not applicable)
	- ❌ Separate strict/relaxed modes (unnecessary complexity)
	- ❌ Multiple sanitization methods (unified approach)

	### Kept Essential Features:
	- ✅ Basic XSS protection
	- ✅ Input length validation
	- ✅ HTML encoding
	- ✅ Whitespace normalization
	- ✅ Clear error messages

	## Performance

	- Lightweight: Minimal regex patterns
	- Fast: Simple operations only
	- Memory Efficient: No complex state
	- Recipe-Focused: Context-appropriate validation

	## Examples

	### Valid Inputs (Cleaned):
	```python
	"What are chicken recipes?" → "What are chicken recipes?"
	"<script>alert('xss')</script>Tell me about pasta" → "Tell me about pasta"
	" How to cook rice? " → "How to cook rice?"
	"What about desserts & sweets?" → "What about desserts & sweets?"
	```

	### Invalid Inputs (Rejected):
	```python
	"" → ValueError: Input cannot be empty
	"a" * 1001 → ValueError: Input too long (maximum 1000 characters)
	```

	## Best Practices

	1. Keep It Simple: Focus on actual threats for recipe chatbot
	2. Context Appropriate: Don't over-engineer for non-existent threats
	3. User Friendly: Allow normal recipe-related punctuation
	4. Clear Errors: Provide helpful error messages
	5. Test Regularly: Verify with real recipe queries

	This simplified approach provides adequate protection while maintaining usability for a recipe recommendation chatbot context.