Spaces:
Building
Building
Memory Handler Migration Guide
Why Memory-Based File Handling?
The original FileUploadHandler saves files to /tmp directory, which can cause 403 Forbidden errors on restricted environments like:
- Hugging Face Spaces
- Some cloud platforms with read-only filesystems
- Containers with security restrictions
The MemoryFileHandler processes files entirely in memory, avoiding filesystem access.
Caveats and Limitations
1. Memory Usage
- Issue: All file content is loaded into RAM
- Impact: Large files (near 300MB limit) could cause memory issues
- Mitigation: The 300MB file size limit helps prevent OOM errors
2. ZIP File Handling
- Issue: ZIP files need special handling as they require file-like objects
- Current approach: Load entire ZIP into memory using BytesIO
- Limitation: Extracting large ZIP files could spike memory usage
3. Session State Persistence
- Issue: Streamlit reloads can clear memory
- Solution: Store processed content in
st.session_state - Limitation: Session state also uses memory
4. Multiple File Processing
- Issue: Batch processing multiple files multiplies memory usage
- Example: 10 files × 30MB each = 300MB in memory
- Mitigation: Process files sequentially, not in parallel
5. Binary vs Text Files
- Issue: Binary files (images, etc.) need different handling
- Solution:
as_textparameter inprocess_uploaded_file()
Implementation Status
✅ Completed:
ui_components.py- Text input file uploadscomparison_functions.py- Comparison file uploadsfrequency_handlers.py- Createdfrequency_handlers_updated.pyutils/__init__.py- Exports both handlers
⚠️ Need Updates:
analysis_handlers.py- Complex due to ZIP file handlingpos_handlers.py- Batch file processingreference_manager.py- Custom reference uploadsconfig_manager.py- YAML config uploads
Migration Examples
Simple File Upload
# OLD - FileUploadHandler
temp_path = FileUploadHandler.save_to_temp(uploaded_file, prefix="text")
if temp_path:
content = FileUploadHandler.read_from_temp(temp_path)
# ... process content
FileUploadHandler.cleanup_temp_file(temp_path)
# NEW - MemoryFileHandler
content = MemoryFileHandler.process_uploaded_file(uploaded_file, as_text=True)
if content:
# ... process content
# No cleanup needed!
ZIP File Handling
# OLD - FileUploadHandler
zip_file = FileUploadHandler.handle_zip_file(uploaded_file)
with zip_file as zip_ref:
for file_info in zip_ref.infolist():
content = zip_ref.read(file_info.filename)
# NEW - MemoryFileHandler
file_contents = MemoryFileHandler.handle_zip_file(uploaded_file)
if file_contents:
for filename, content in file_contents.items():
# Process each file
DataFrame Processing
# OLD - Manual CSV parsing
content = FileUploadHandler.read_from_temp(temp_path)
df = pd.read_csv(StringIO(content.decode('utf-8')))
# NEW - Direct DataFrame creation
df = MemoryFileHandler.process_csv_tsv_file(uploaded_file)
When to Use Which Handler
Use MemoryFileHandler when:
- Deploying to restricted environments (Hugging Face Spaces)
- Files are reasonably sized (<100MB preferred)
- You need maximum compatibility
Consider FileUploadHandler when:
- Processing very large files (>200MB)
- Running locally with full filesystem access
- Need to preserve files across sessions
Complete Migration Steps
Update imports:
from web_app.utils import MemoryFileHandlerReplace file operations:
- Remove
save_to_temp()calls - Remove
cleanup_temp_file()calls - Use
process_uploaded_file()directly
- Remove
Update error handling:
- Remove 403-specific error messages
- Add memory-related error handling
Test thoroughly:
- Test with small files first
- Test with maximum size files
- Test with multiple files
Performance Considerations
Memory Usage Formula:
Total Memory = File Size + Processing Overhead + Session State Storage
Example for 50MB file:
- File content: 50MB
- String conversion: ~50MB (if text)
- DataFrame creation: ~100-200MB (depends on data)
- Total: ~200-300MB peak usage
Recommendations
- For Hugging Face Spaces: Use MemoryFileHandler exclusively
- For local deployment: Either handler works, choose based on file sizes
- For production: Consider implementing both with automatic fallback
- Monitor memory: Add memory usage tracking for large deployments