| # Simplified Data Sanitization Documentation | |
| ## Overview | |
| The simplified data sanitization module provides focused input validation and sanitization for the Recipe Recommendation Bot API. It's designed specifically for recipe chatbot context with essential security protection. | |
| ## Features | |
| ### π‘οΈ **Essential Security Protection** | |
| - **XSS Prevention**: HTML encoding and basic script removal | |
| - **Input Validation**: Length limits and content validation | |
| - **Whitespace Normalization**: Clean formatting | |
| ### π§ **Simple Configuration** | |
| - **Maximum Message Length**: 1000 characters | |
| - **Minimum Message Length**: 1 character | |
| - **Single Method**: One sanitization method for all inputs | |
| ## Usage | |
| ### Basic Sanitization | |
| ```python | |
| from utils.sanitization import sanitize_user_input | |
| # Sanitize any user input (chat messages, demo prompts) | |
| clean_input = sanitize_user_input("What are some chicken recipes?") | |
| ``` | |
| ### Advanced Usage | |
| ```python | |
| from utils.sanitization import DataSanitizer | |
| # Direct class usage | |
| sanitizer = DataSanitizer() | |
| clean_text = sanitizer.sanitize_input("User input") | |
| ``` | |
| ## Security Patterns Handled | |
| ### Basic XSS Protection | |
| - `<script>` tags β Removed | |
| - `javascript:` URLs β Cleaned | |
| - Event handlers (`onclick`, `onload`) β Removed | |
| - HTML entities β Properly encoded | |
| ### Input Validation | |
| - Length limits (1-1000 characters) | |
| - Empty input detection | |
| - Whitespace normalization | |
| ## Integration | |
| The sanitization is automatically applied in FastAPI endpoints: | |
| ### Chat Endpoint | |
| ```python | |
| class ChatMessage(BaseModel): | |
| message: str = Field(..., min_length=1, max_length=1000) | |
| @validator('message') | |
| def sanitize_message_field(cls, v): | |
| return sanitize_user_input(v) | |
| ``` | |
| ### Demo Endpoint | |
| ```python | |
| @app.get("/demo") | |
| def demo(prompt: str = "What recipes do you have?"): | |
| sanitized_prompt = sanitize_user_input(prompt) | |
| # ... rest of the logic | |
| ``` | |
| ## Error Handling | |
| The sanitization raises `ValueError` for invalid input: | |
| ```python | |
| try: | |
| clean_input = sanitize_user_input(user_input) | |
| except ValueError as e: | |
| return {"error": f"Invalid input: {str(e)}"} | |
| ``` | |
| ## Testing | |
| Run the sanitization tests: | |
| ```bash | |
| python3 test_sanitization.py | |
| ``` | |
| The test suite covers: | |
| - Normal recipe-related messages | |
| - Basic harmful content (scripts, JavaScript) | |
| - Length validation | |
| - Whitespace normalization | |
| - Edge cases | |
| ## What's Simplified | |
| ### Removed Overly Complex Features: | |
| - β SQL injection patterns (not relevant for LLM chatbot) | |
| - β Command injection patterns (not applicable) | |
| - β Separate strict/relaxed modes (unnecessary complexity) | |
| - β Multiple sanitization methods (unified approach) | |
| ### Kept Essential Features: | |
| - β Basic XSS protection | |
| - β Input length validation | |
| - β HTML encoding | |
| - β Whitespace normalization | |
| - β Clear error messages | |
| ## Performance | |
| - **Lightweight**: Minimal regex patterns | |
| - **Fast**: Simple operations only | |
| - **Memory Efficient**: No complex state | |
| - **Recipe-Focused**: Context-appropriate validation | |
| ## Examples | |
| ### Valid Inputs (Cleaned): | |
| ```python | |
| "What are chicken recipes?" β "What are chicken recipes?" | |
| "<script>alert('xss')</script>Tell me about pasta" β "Tell me about pasta" | |
| " How to cook rice? " β "How to cook rice?" | |
| "What about desserts & sweets?" β "What about desserts & sweets?" | |
| ``` | |
| ### Invalid Inputs (Rejected): | |
| ```python | |
| "" β ValueError: Input cannot be empty | |
| "a" * 1001 β ValueError: Input too long (maximum 1000 characters) | |
| ``` | |
| ## Best Practices | |
| 1. **Keep It Simple**: Focus on actual threats for recipe chatbot | |
| 2. **Context Appropriate**: Don't over-engineer for non-existent threats | |
| 3. **User Friendly**: Allow normal recipe-related punctuation | |
| 4. **Clear Errors**: Provide helpful error messages | |
| 5. **Test Regularly**: Verify with real recipe queries | |
| This simplified approach provides adequate protection while maintaining usability for a recipe recommendation chatbot context. | |