plg4-dev-server / backend /docs /sanitization_guide.md
Jesse Johnson
New commit for backend deployment: 2025-09-25_13-24-03
c59d808
# Simplified Data Sanitization Documentation
## Overview
The simplified data sanitization module provides focused input validation and sanitization for the Recipe Recommendation Bot API. It's designed specifically for recipe chatbot context with essential security protection.
## Features
### πŸ›‘οΈ **Essential Security Protection**
- **XSS Prevention**: HTML encoding and basic script removal
- **Input Validation**: Length limits and content validation
- **Whitespace Normalization**: Clean formatting
### πŸ”§ **Simple Configuration**
- **Maximum Message Length**: 1000 characters
- **Minimum Message Length**: 1 character
- **Single Method**: One sanitization method for all inputs
## Usage
### Basic Sanitization
```python
from utils.sanitization import sanitize_user_input
# Sanitize any user input (chat messages, demo prompts)
clean_input = sanitize_user_input("What are some chicken recipes?")
```
### Advanced Usage
```python
from utils.sanitization import DataSanitizer
# Direct class usage
sanitizer = DataSanitizer()
clean_text = sanitizer.sanitize_input("User input")
```
## Security Patterns Handled
### Basic XSS Protection
- `<script>` tags β†’ Removed
- `javascript:` URLs β†’ Cleaned
- Event handlers (`onclick`, `onload`) β†’ Removed
- HTML entities β†’ Properly encoded
### Input Validation
- Length limits (1-1000 characters)
- Empty input detection
- Whitespace normalization
## Integration
The sanitization is automatically applied in FastAPI endpoints:
### Chat Endpoint
```python
class ChatMessage(BaseModel):
message: str = Field(..., min_length=1, max_length=1000)
@validator('message')
def sanitize_message_field(cls, v):
return sanitize_user_input(v)
```
### Demo Endpoint
```python
@app.get("/demo")
def demo(prompt: str = "What recipes do you have?"):
sanitized_prompt = sanitize_user_input(prompt)
# ... rest of the logic
```
## Error Handling
The sanitization raises `ValueError` for invalid input:
```python
try:
clean_input = sanitize_user_input(user_input)
except ValueError as e:
return {"error": f"Invalid input: {str(e)}"}
```
## Testing
Run the sanitization tests:
```bash
python3 test_sanitization.py
```
The test suite covers:
- Normal recipe-related messages
- Basic harmful content (scripts, JavaScript)
- Length validation
- Whitespace normalization
- Edge cases
## What's Simplified
### Removed Overly Complex Features:
- ❌ SQL injection patterns (not relevant for LLM chatbot)
- ❌ Command injection patterns (not applicable)
- ❌ Separate strict/relaxed modes (unnecessary complexity)
- ❌ Multiple sanitization methods (unified approach)
### Kept Essential Features:
- βœ… Basic XSS protection
- βœ… Input length validation
- βœ… HTML encoding
- βœ… Whitespace normalization
- βœ… Clear error messages
## Performance
- **Lightweight**: Minimal regex patterns
- **Fast**: Simple operations only
- **Memory Efficient**: No complex state
- **Recipe-Focused**: Context-appropriate validation
## Examples
### Valid Inputs (Cleaned):
```python
"What are chicken recipes?" β†’ "What are chicken recipes?"
"<script>alert('xss')</script>Tell me about pasta" β†’ "Tell me about pasta"
" How to cook rice? " β†’ "How to cook rice?"
"What about desserts & sweets?" β†’ "What about desserts &amp; sweets?"
```
### Invalid Inputs (Rejected):
```python
"" β†’ ValueError: Input cannot be empty
"a" * 1001 β†’ ValueError: Input too long (maximum 1000 characters)
```
## Best Practices
1. **Keep It Simple**: Focus on actual threats for recipe chatbot
2. **Context Appropriate**: Don't over-engineer for non-existent threats
3. **User Friendly**: Allow normal recipe-related punctuation
4. **Clear Errors**: Provide helpful error messages
5. **Test Regularly**: Verify with real recipe queries
This simplified approach provides adequate protection while maintaining usability for a recipe recommendation chatbot context.