Spaces:

jessejohnson
/

plg4-dev-server

Paused

File size: 3,943 Bytes

c59d808

# Simplified Data Sanitization Documentation

## Overview

The simplified data sanitization module provides focused input validation and sanitization for the Recipe Recommendation Bot API. It's designed specifically for recipe chatbot context with essential security protection.

## Features

### 🛡️ **Essential Security Protection**
- **XSS Prevention**: HTML encoding and basic script removal
- **Input Validation**: Length limits and content validation
- **Whitespace Normalization**: Clean formatting

### 🔧 **Simple Configuration**
- **Maximum Message Length**: 1000 characters
- **Minimum Message Length**: 1 character
- **Single Method**: One sanitization method for all inputs

## Usage

### Basic Sanitization

```python
from utils.sanitization import sanitize_user_input

# Sanitize any user input (chat messages, demo prompts)
clean_input = sanitize_user_input("What are some chicken recipes?")
```

### Advanced Usage

```python
from utils.sanitization import DataSanitizer

# Direct class usage
sanitizer = DataSanitizer()
clean_text = sanitizer.sanitize_input("User input")
```

## Security Patterns Handled

### Basic XSS Protection
- `<script>` tags → Removed
- `javascript:` URLs → Cleaned
- Event handlers (`onclick`, `onload`) → Removed
- HTML entities → Properly encoded

### Input Validation
- Length limits (1-1000 characters)
- Empty input detection
- Whitespace normalization

## Integration

The sanitization is automatically applied in FastAPI endpoints:

### Chat Endpoint
```python
class ChatMessage(BaseModel):
    message: str = Field(..., min_length=1, max_length=1000)
    
    @validator('message')
    def sanitize_message_field(cls, v):
        return sanitize_user_input(v)
```

### Demo Endpoint
```python
@app.get("/demo")
def demo(prompt: str = "What recipes do you have?"):
    sanitized_prompt = sanitize_user_input(prompt)
    # ... rest of the logic
```

## Error Handling

The sanitization raises `ValueError` for invalid input:

```python
try:
    clean_input = sanitize_user_input(user_input)
except ValueError as e:
    return {"error": f"Invalid input: {str(e)}"}
```

## Testing

Run the sanitization tests:

```bash
python3 test_sanitization.py
```

The test suite covers:
- Normal recipe-related messages
- Basic harmful content (scripts, JavaScript)
- Length validation
- Whitespace normalization
- Edge cases

## What's Simplified

### Removed Overly Complex Features:
- ❌ SQL injection patterns (not relevant for LLM chatbot)
- ❌ Command injection patterns (not applicable)
- ❌ Separate strict/relaxed modes (unnecessary complexity)
- ❌ Multiple sanitization methods (unified approach)

### Kept Essential Features:
- ✅ Basic XSS protection
- ✅ Input length validation
- ✅ HTML encoding
- ✅ Whitespace normalization
- ✅ Clear error messages

## Performance

- **Lightweight**: Minimal regex patterns
- **Fast**: Simple operations only
- **Memory Efficient**: No complex state
- **Recipe-Focused**: Context-appropriate validation

## Examples

### Valid Inputs (Cleaned):
```python
"What are chicken recipes?" → "What are chicken recipes?"
"<script>alert('xss')</script>Tell me about pasta" → "Tell me about pasta"
"   How to cook rice?   " → "How to cook rice?"
"What about desserts & sweets?" → "What about desserts &amp; sweets?"
```

### Invalid Inputs (Rejected):
```python
"" → ValueError: Input cannot be empty
"a" * 1001 → ValueError: Input too long (maximum 1000 characters)
```

## Best Practices

1. **Keep It Simple**: Focus on actual threats for recipe chatbot
2. **Context Appropriate**: Don't over-engineer for non-existent threats
3. **User Friendly**: Allow normal recipe-related punctuation
4. **Clear Errors**: Provide helpful error messages
5. **Test Regularly**: Verify with real recipe queries

This simplified approach provides adequate protection while maintaining usability for a recipe recommendation chatbot context.