Spaces:
Building
Batch Processing Performance Optimization
Performance Issues Identified
1. Multiple Analysis Calls Per File (Biggest Issue)
The original implementation made 3 separate calls to analyze_text() for each file:
- One for Content Words (CW)
- One for Function Words (FW)
- One for n-grams (without word type filter)
Each call runs the entire SpaCy pipeline (tokenization, POS tagging, dependency parsing), essentially tripling the processing time.
2. Memory Accumulation
- All results stored in memory with detailed token information
- No streaming or chunking capabilities
- Everything stays in memory until batch completes
3. Default Model Size
- Default SpaCy model is 'trf' (transformer-based), which is much slower than 'md'
- Found in
session_manager.py:'model_size': 'trf'
Optimizations Implemented
Phase 1: Single-Pass Analysis (70% Performance Gain)
Changes Made:
Modified
analyze_text()method to supportseparate_word_typesparameter- Processes both CW and FW in a single pass through the text
- Collects statistics for both word types simultaneously
- N-grams are processed in the same pass
Updated batch processing handlers to use single-pass analysis:
# OLD: 3 separate calls for word_type in ['CW', 'FW']: analysis = analyzer.analyze_text(text, ...) full_analysis = analyzer.analyze_text(text, ...) # for n-grams # NEW: Single optimized call analysis = analyzer.analyze_text( text_content, selected_indices, separate_word_types=True # Process CW/FW separately in same pass )Added optimized batch method
analyze_batch_memory():- Works directly with in-memory file contents
- Supports all new analysis parameters
- Maintains backward compatibility
Performance Recommendations
1. Use 'md' Model Instead of 'trf'
The transformer model ('trf') is significantly slower. For batch processing, consider using 'md':
- 3-5x faster processing
- Still provides good accuracy for lexical sophistication analysis
2. Enable Smart Defaults
Smart defaults optimize which measures to compute, reducing unnecessary calculations.
3. For Very Large Batches
Consider implementing:
- Chunk processing (process N files at a time)
- Parallel processing using multiprocessing
- Results streaming to disk instead of memory accumulation
Expected Performance Gains
With the optimizations implemented:
- ~70% reduction in processing time from eliminating redundant analysis calls
- Additional 20-30% possible by switching from 'trf' to 'md' model
- Memory usage remains similar but could be optimized further with streaming
How to Use the Optimized Version
The optimizations are transparent to users. The batch processing will automatically use the single-pass analysis when:
- No specific word type filter is selected
- Processing files that need both CW and FW analysis
For legacy compatibility, the old analyze_batch() method has been updated to use the optimized approach internally.
GPU Status Monitoring in Debug Mode
The web app now includes comprehensive GPU status information in debug mode. To access:
- Enable "🐛 Debug Mode" in the sidebar
- Expand the "GPU Status" section
Features
PyTorch/CUDA Information:
- PyTorch installation and version
- CUDA availability and version
- Number of GPUs and their names
- GPU memory usage (allocated, reserved, free)
SpaCy GPU Configuration:
- SpaCy GPU enablement status
- Current GPU device being used
- spacy-transformers installation status
Active Model GPU Status:
- Current model's device configuration
- GPU optimization status (mixed precision, batch sizes)
- SpaCy version information
Performance Tips:
- Optimization recommendations
- Common troubleshooting guidance
Benefits
This integrated GPU monitoring eliminates the need for the separate test_gpu_support.py script for most use cases. Developers can now:
- Quickly verify GPU availability without running external scripts
- Monitor GPU memory usage during batch processing
- Confirm that models are correctly utilizing GPU acceleration
- Troubleshoot performance issues more effectively
Usage Example
When processing large batches with transformer models:
- Enable debug mode to monitor GPU utilization
- Check that the model is using GPU (not CPU fallback)
- Monitor memory usage to prevent out-of-memory errors
- Adjust batch sizes based on available GPU memory