plg4-dev-server / backend /docs /sanitization_guide.md
Jesse Johnson
New commit for backend deployment: 2025-09-25_13-24-03
c59d808

Simplified Data Sanitization Documentation

Overview

The simplified data sanitization module provides focused input validation and sanitization for the Recipe Recommendation Bot API. It's designed specifically for recipe chatbot context with essential security protection.

Features

πŸ›‘οΈ Essential Security Protection

  • XSS Prevention: HTML encoding and basic script removal
  • Input Validation: Length limits and content validation
  • Whitespace Normalization: Clean formatting

πŸ”§ Simple Configuration

  • Maximum Message Length: 1000 characters
  • Minimum Message Length: 1 character
  • Single Method: One sanitization method for all inputs

Usage

Basic Sanitization

from utils.sanitization import sanitize_user_input

# Sanitize any user input (chat messages, demo prompts)
clean_input = sanitize_user_input("What are some chicken recipes?")

Advanced Usage

from utils.sanitization import DataSanitizer

# Direct class usage
sanitizer = DataSanitizer()
clean_text = sanitizer.sanitize_input("User input")

Security Patterns Handled

Basic XSS Protection

  • <script> tags β†’ Removed
  • javascript: URLs β†’ Cleaned
  • Event handlers (onclick, onload) β†’ Removed
  • HTML entities β†’ Properly encoded

Input Validation

  • Length limits (1-1000 characters)
  • Empty input detection
  • Whitespace normalization

Integration

The sanitization is automatically applied in FastAPI endpoints:

Chat Endpoint

class ChatMessage(BaseModel):
    message: str = Field(..., min_length=1, max_length=1000)
    
    @validator('message')
    def sanitize_message_field(cls, v):
        return sanitize_user_input(v)

Demo Endpoint

@app.get("/demo")
def demo(prompt: str = "What recipes do you have?"):
    sanitized_prompt = sanitize_user_input(prompt)
    # ... rest of the logic

Error Handling

The sanitization raises ValueError for invalid input:

try:
    clean_input = sanitize_user_input(user_input)
except ValueError as e:
    return {"error": f"Invalid input: {str(e)}"}

Testing

Run the sanitization tests:

python3 test_sanitization.py

The test suite covers:

  • Normal recipe-related messages
  • Basic harmful content (scripts, JavaScript)
  • Length validation
  • Whitespace normalization
  • Edge cases

What's Simplified

Removed Overly Complex Features:

  • ❌ SQL injection patterns (not relevant for LLM chatbot)
  • ❌ Command injection patterns (not applicable)
  • ❌ Separate strict/relaxed modes (unnecessary complexity)
  • ❌ Multiple sanitization methods (unified approach)

Kept Essential Features:

  • βœ… Basic XSS protection
  • βœ… Input length validation
  • βœ… HTML encoding
  • βœ… Whitespace normalization
  • βœ… Clear error messages

Performance

  • Lightweight: Minimal regex patterns
  • Fast: Simple operations only
  • Memory Efficient: No complex state
  • Recipe-Focused: Context-appropriate validation

Examples

Valid Inputs (Cleaned):

"What are chicken recipes?" β†’ "What are chicken recipes?"
"<script>alert('xss')</script>Tell me about pasta" β†’ "Tell me about pasta"
"   How to cook rice?   " β†’ "How to cook rice?"
"What about desserts & sweets?" β†’ "What about desserts &amp; sweets?"

Invalid Inputs (Rejected):

"" β†’ ValueError: Input cannot be empty
"a" * 1001 β†’ ValueError: Input too long (maximum 1000 characters)

Best Practices

  1. Keep It Simple: Focus on actual threats for recipe chatbot
  2. Context Appropriate: Don't over-engineer for non-existent threats
  3. User Friendly: Allow normal recipe-related punctuation
  4. Clear Errors: Provide helpful error messages
  5. Test Regularly: Verify with real recipe queries

This simplified approach provides adequate protection while maintaining usability for a recipe recommendation chatbot context.