simple-text-analyzer / MEMORY_HANDLER_MIGRATION.md
egumasa's picture
added memory file handler
4b36911

Memory Handler Migration Guide

Why Memory-Based File Handling?

The original FileUploadHandler saves files to /tmp directory, which can cause 403 Forbidden errors on restricted environments like:

  • Hugging Face Spaces
  • Some cloud platforms with read-only filesystems
  • Containers with security restrictions

The MemoryFileHandler processes files entirely in memory, avoiding filesystem access.

Caveats and Limitations

1. Memory Usage

  • Issue: All file content is loaded into RAM
  • Impact: Large files (near 300MB limit) could cause memory issues
  • Mitigation: The 300MB file size limit helps prevent OOM errors

2. ZIP File Handling

  • Issue: ZIP files need special handling as they require file-like objects
  • Current approach: Load entire ZIP into memory using BytesIO
  • Limitation: Extracting large ZIP files could spike memory usage

3. Session State Persistence

  • Issue: Streamlit reloads can clear memory
  • Solution: Store processed content in st.session_state
  • Limitation: Session state also uses memory

4. Multiple File Processing

  • Issue: Batch processing multiple files multiplies memory usage
  • Example: 10 files × 30MB each = 300MB in memory
  • Mitigation: Process files sequentially, not in parallel

5. Binary vs Text Files

  • Issue: Binary files (images, etc.) need different handling
  • Solution: as_text parameter in process_uploaded_file()

Implementation Status

✅ Completed:

  • ui_components.py - Text input file uploads
  • comparison_functions.py - Comparison file uploads
  • frequency_handlers.py - Created frequency_handlers_updated.py
  • utils/__init__.py - Exports both handlers

⚠️ Need Updates:

  • analysis_handlers.py - Complex due to ZIP file handling
  • pos_handlers.py - Batch file processing
  • reference_manager.py - Custom reference uploads
  • config_manager.py - YAML config uploads

Migration Examples

Simple File Upload

# OLD - FileUploadHandler
temp_path = FileUploadHandler.save_to_temp(uploaded_file, prefix="text")
if temp_path:
    content = FileUploadHandler.read_from_temp(temp_path)
    # ... process content
    FileUploadHandler.cleanup_temp_file(temp_path)

# NEW - MemoryFileHandler
content = MemoryFileHandler.process_uploaded_file(uploaded_file, as_text=True)
if content:
    # ... process content
    # No cleanup needed!

ZIP File Handling

# OLD - FileUploadHandler
zip_file = FileUploadHandler.handle_zip_file(uploaded_file)
with zip_file as zip_ref:
    for file_info in zip_ref.infolist():
        content = zip_ref.read(file_info.filename)

# NEW - MemoryFileHandler
file_contents = MemoryFileHandler.handle_zip_file(uploaded_file)
if file_contents:
    for filename, content in file_contents.items():
        # Process each file

DataFrame Processing

# OLD - Manual CSV parsing
content = FileUploadHandler.read_from_temp(temp_path)
df = pd.read_csv(StringIO(content.decode('utf-8')))

# NEW - Direct DataFrame creation
df = MemoryFileHandler.process_csv_tsv_file(uploaded_file)

When to Use Which Handler

Use MemoryFileHandler when:

  • Deploying to restricted environments (Hugging Face Spaces)
  • Files are reasonably sized (<100MB preferred)
  • You need maximum compatibility

Consider FileUploadHandler when:

  • Processing very large files (>200MB)
  • Running locally with full filesystem access
  • Need to preserve files across sessions

Complete Migration Steps

  1. Update imports:

    from web_app.utils import MemoryFileHandler
    
  2. Replace file operations:

    • Remove save_to_temp() calls
    • Remove cleanup_temp_file() calls
    • Use process_uploaded_file() directly
  3. Update error handling:

    • Remove 403-specific error messages
    • Add memory-related error handling
  4. Test thoroughly:

    • Test with small files first
    • Test with maximum size files
    • Test with multiple files

Performance Considerations

Memory Usage Formula:

Total Memory = File Size + Processing Overhead + Session State Storage

Example for 50MB file:

  • File content: 50MB
  • String conversion: ~50MB (if text)
  • DataFrame creation: ~100-200MB (depends on data)
  • Total: ~200-300MB peak usage

Recommendations

  1. For Hugging Face Spaces: Use MemoryFileHandler exclusively
  2. For local deployment: Either handler works, choose based on file sizes
  3. For production: Consider implementing both with automatic fallback
  4. Monitor memory: Add memory usage tracking for large deployments