Spaces:

egumasa
/

simple-text-analyzer

Building

App Files Files Community

simple-text-analyzer / MEMORY_HANDLER_MIGRATION.md

egumasa

added memory file handler

4b36911 9 months ago

preview code

raw

history blame contribute delete

4.63 kB

Memory Handler Migration Guide

Why Memory-Based File Handling?

The original FileUploadHandler saves files to /tmp directory, which can cause 403 Forbidden errors on restricted environments like:

Hugging Face Spaces
Some cloud platforms with read-only filesystems
Containers with security restrictions

The MemoryFileHandler processes files entirely in memory, avoiding filesystem access.

Caveats and Limitations

1. Memory Usage

Issue: All file content is loaded into RAM
Impact: Large files (near 300MB limit) could cause memory issues
Mitigation: The 300MB file size limit helps prevent OOM errors

2. ZIP File Handling

Issue: ZIP files need special handling as they require file-like objects
Current approach: Load entire ZIP into memory using BytesIO
Limitation: Extracting large ZIP files could spike memory usage

3. Session State Persistence

Issue: Streamlit reloads can clear memory
Solution: Store processed content in st.session_state
Limitation: Session state also uses memory

4. Multiple File Processing

Issue: Batch processing multiple files multiplies memory usage
Example: 10 files × 30MB each = 300MB in memory
Mitigation: Process files sequentially, not in parallel

5. Binary vs Text Files

Issue: Binary files (images, etc.) need different handling
Solution: as_text parameter in process_uploaded_file()

Implementation Status

✅ Completed:

ui_components.py - Text input file uploads
comparison_functions.py - Comparison file uploads
frequency_handlers.py - Created frequency_handlers_updated.py
utils/__init__.py - Exports both handlers

⚠️ Need Updates:

analysis_handlers.py - Complex due to ZIP file handling
pos_handlers.py - Batch file processing
reference_manager.py - Custom reference uploads
config_manager.py - YAML config uploads

Migration Examples

Simple File Upload

# OLD - FileUploadHandler
temp_path = FileUploadHandler.save_to_temp(uploaded_file, prefix="text")
if temp_path:
    content = FileUploadHandler.read_from_temp(temp_path)
    # ... process content
    FileUploadHandler.cleanup_temp_file(temp_path)

# NEW - MemoryFileHandler
content = MemoryFileHandler.process_uploaded_file(uploaded_file, as_text=True)
if content:
    # ... process content
    # No cleanup needed!

ZIP File Handling

# OLD - FileUploadHandler
zip_file = FileUploadHandler.handle_zip_file(uploaded_file)
with zip_file as zip_ref:
    for file_info in zip_ref.infolist():
        content = zip_ref.read(file_info.filename)

# NEW - MemoryFileHandler
file_contents = MemoryFileHandler.handle_zip_file(uploaded_file)
if file_contents:
    for filename, content in file_contents.items():
        # Process each file

DataFrame Processing

# OLD - Manual CSV parsing
content = FileUploadHandler.read_from_temp(temp_path)
df = pd.read_csv(StringIO(content.decode('utf-8')))

# NEW - Direct DataFrame creation
df = MemoryFileHandler.process_csv_tsv_file(uploaded_file)

When to Use Which Handler

Use MemoryFileHandler when:

Deploying to restricted environments (Hugging Face Spaces)
Files are reasonably sized (<100MB preferred)
You need maximum compatibility

Consider FileUploadHandler when:

Processing very large files (>200MB)
Running locally with full filesystem access
Need to preserve files across sessions

Complete Migration Steps

Update imports:

from web_app.utils import MemoryFileHandler

Replace file operations:
- Remove save_to_temp() calls
- Remove cleanup_temp_file() calls
- Use process_uploaded_file() directly
Update error handling:
- Remove 403-specific error messages
- Add memory-related error handling
Test thoroughly:
- Test with small files first
- Test with maximum size files
- Test with multiple files

Performance Considerations

Memory Usage Formula:

Total Memory = File Size + Processing Overhead + Session State Storage

Example for 50MB file:

File content: 50MB
String conversion: ~50MB (if text)
DataFrame creation: ~100-200MB (depends on data)
Total: ~200-300MB peak usage

Recommendations

For Hugging Face Spaces: Use MemoryFileHandler exclusively
For local deployment: Either handler works, choose based on file sizes
For production: Consider implementing both with automatic fallback
Monitor memory: Add memory usage tracking for large deployments