Spaces:
Building
Building
| # Batch Processing Performance Optimization | |
| ## Performance Issues Identified | |
| ### 1. **Multiple Analysis Calls Per File (Biggest Issue)** | |
| The original implementation made 3 separate calls to `analyze_text()` for each file: | |
| - One for Content Words (CW) | |
| - One for Function Words (FW) | |
| - One for n-grams (without word type filter) | |
| Each call runs the entire SpaCy pipeline (tokenization, POS tagging, dependency parsing), essentially **tripling** the processing time. | |
| ### 2. **Memory Accumulation** | |
| - All results stored in memory with detailed token information | |
| - No streaming or chunking capabilities | |
| - Everything stays in memory until batch completes | |
| ### 3. **Default Model Size** | |
| - Default SpaCy model is 'trf' (transformer-based), which is much slower than 'md' | |
| - Found in `session_manager.py`: `'model_size': 'trf'` | |
| ## Optimizations Implemented | |
| ### Phase 1: Single-Pass Analysis (70% Performance Gain) | |
| **Changes Made:** | |
| 1. **Modified `analyze_text()` method** to support `separate_word_types` parameter | |
| - Processes both CW and FW in a single pass through the text | |
| - Collects statistics for both word types simultaneously | |
| - N-grams are processed in the same pass | |
| 2. **Updated batch processing handlers** to use single-pass analysis: | |
| ```python | |
| # OLD: 3 separate calls | |
| for word_type in ['CW', 'FW']: | |
| analysis = analyzer.analyze_text(text, ...) | |
| full_analysis = analyzer.analyze_text(text, ...) # for n-grams | |
| # NEW: Single optimized call | |
| analysis = analyzer.analyze_text( | |
| text_content, | |
| selected_indices, | |
| separate_word_types=True # Process CW/FW separately in same pass | |
| ) | |
| ``` | |
| 3. **Added optimized batch method** `analyze_batch_memory()`: | |
| - Works directly with in-memory file contents | |
| - Supports all new analysis parameters | |
| - Maintains backward compatibility | |
| ## Performance Recommendations | |
| ### 1. **Use 'md' Model Instead of 'trf'** | |
| The transformer model ('trf') is significantly slower. For batch processing, consider using 'md': | |
| - 3-5x faster processing | |
| - Still provides good accuracy for lexical sophistication analysis | |
| ### 2. **Enable Smart Defaults** | |
| Smart defaults optimize which measures to compute, reducing unnecessary calculations. | |
| ### 3. **For Very Large Batches** | |
| Consider implementing: | |
| - Chunk processing (process N files at a time) | |
| - Parallel processing using multiprocessing | |
| - Results streaming to disk instead of memory accumulation | |
| ## Expected Performance Gains | |
| With the optimizations implemented: | |
| - **~70% reduction** in processing time from eliminating redundant analysis calls | |
| - **Additional 20-30%** possible by switching from 'trf' to 'md' model | |
| - **Memory usage** remains similar but could be optimized further with streaming | |
| ## How to Use the Optimized Version | |
| The optimizations are transparent to users. The batch processing will automatically use the single-pass analysis when: | |
| - No specific word type filter is selected | |
| - Processing files that need both CW and FW analysis | |
| For legacy compatibility, the old `analyze_batch()` method has been updated to use the optimized approach internally. | |
| ## GPU Status Monitoring in Debug Mode | |
| The web app now includes comprehensive GPU status information in debug mode. To access: | |
| 1. Enable "๐ Debug Mode" in the sidebar | |
| 2. Expand the "GPU Status" section | |
| ### Features | |
| **PyTorch/CUDA Information:** | |
| - PyTorch installation and version | |
| - CUDA availability and version | |
| - Number of GPUs and their names | |
| - GPU memory usage (allocated, reserved, free) | |
| **SpaCy GPU Configuration:** | |
| - SpaCy GPU enablement status | |
| - Current GPU device being used | |
| - spacy-transformers installation status | |
| **Active Model GPU Status:** | |
| - Current model's device configuration | |
| - GPU optimization status (mixed precision, batch sizes) | |
| - SpaCy version information | |
| **Performance Tips:** | |
| - Optimization recommendations | |
| - Common troubleshooting guidance | |
| ### Benefits | |
| This integrated GPU monitoring eliminates the need for the separate `test_gpu_support.py` script for most use cases. Developers can now: | |
| - Quickly verify GPU availability without running external scripts | |
| - Monitor GPU memory usage during batch processing | |
| - Confirm that models are correctly utilizing GPU acceleration | |
| - Troubleshoot performance issues more effectively | |
| ### Usage Example | |
| When processing large batches with transformer models: | |
| 1. Enable debug mode to monitor GPU utilization | |
| 2. Check that the model is using GPU (not CPU fallback) | |
| 3. Monitor memory usage to prevent out-of-memory errors | |
| 4. Adjust batch sizes based on available GPU memory | |