Spaces:
Building
Building
File size: 4,553 Bytes
4d2898f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
# Batch Processing Performance Optimization
## Performance Issues Identified
### 1. **Multiple Analysis Calls Per File (Biggest Issue)**
The original implementation made 3 separate calls to `analyze_text()` for each file:
- One for Content Words (CW)
- One for Function Words (FW)
- One for n-grams (without word type filter)
Each call runs the entire SpaCy pipeline (tokenization, POS tagging, dependency parsing), essentially **tripling** the processing time.
### 2. **Memory Accumulation**
- All results stored in memory with detailed token information
- No streaming or chunking capabilities
- Everything stays in memory until batch completes
### 3. **Default Model Size**
- Default SpaCy model is 'trf' (transformer-based), which is much slower than 'md'
- Found in `session_manager.py`: `'model_size': 'trf'`
## Optimizations Implemented
### Phase 1: Single-Pass Analysis (70% Performance Gain)
**Changes Made:**
1. **Modified `analyze_text()` method** to support `separate_word_types` parameter
- Processes both CW and FW in a single pass through the text
- Collects statistics for both word types simultaneously
- N-grams are processed in the same pass
2. **Updated batch processing handlers** to use single-pass analysis:
```python
# OLD: 3 separate calls
for word_type in ['CW', 'FW']:
analysis = analyzer.analyze_text(text, ...)
full_analysis = analyzer.analyze_text(text, ...) # for n-grams
# NEW: Single optimized call
analysis = analyzer.analyze_text(
text_content,
selected_indices,
separate_word_types=True # Process CW/FW separately in same pass
)
```
3. **Added optimized batch method** `analyze_batch_memory()`:
- Works directly with in-memory file contents
- Supports all new analysis parameters
- Maintains backward compatibility
## Performance Recommendations
### 1. **Use 'md' Model Instead of 'trf'**
The transformer model ('trf') is significantly slower. For batch processing, consider using 'md':
- 3-5x faster processing
- Still provides good accuracy for lexical sophistication analysis
### 2. **Enable Smart Defaults**
Smart defaults optimize which measures to compute, reducing unnecessary calculations.
### 3. **For Very Large Batches**
Consider implementing:
- Chunk processing (process N files at a time)
- Parallel processing using multiprocessing
- Results streaming to disk instead of memory accumulation
## Expected Performance Gains
With the optimizations implemented:
- **~70% reduction** in processing time from eliminating redundant analysis calls
- **Additional 20-30%** possible by switching from 'trf' to 'md' model
- **Memory usage** remains similar but could be optimized further with streaming
## How to Use the Optimized Version
The optimizations are transparent to users. The batch processing will automatically use the single-pass analysis when:
- No specific word type filter is selected
- Processing files that need both CW and FW analysis
For legacy compatibility, the old `analyze_batch()` method has been updated to use the optimized approach internally.
## GPU Status Monitoring in Debug Mode
The web app now includes comprehensive GPU status information in debug mode. To access:
1. Enable "🐛 Debug Mode" in the sidebar
2. Expand the "GPU Status" section
### Features
**PyTorch/CUDA Information:**
- PyTorch installation and version
- CUDA availability and version
- Number of GPUs and their names
- GPU memory usage (allocated, reserved, free)
**SpaCy GPU Configuration:**
- SpaCy GPU enablement status
- Current GPU device being used
- spacy-transformers installation status
**Active Model GPU Status:**
- Current model's device configuration
- GPU optimization status (mixed precision, batch sizes)
- SpaCy version information
**Performance Tips:**
- Optimization recommendations
- Common troubleshooting guidance
### Benefits
This integrated GPU monitoring eliminates the need for the separate `test_gpu_support.py` script for most use cases. Developers can now:
- Quickly verify GPU availability without running external scripts
- Monitor GPU memory usage during batch processing
- Confirm that models are correctly utilizing GPU acceleration
- Troubleshoot performance issues more effectively
### Usage Example
When processing large batches with transformer models:
1. Enable debug mode to monitor GPU utilization
2. Check that the model is using GPU (not CPU fallback)
3. Monitor memory usage to prevent out-of-memory errors
4. Adjust batch sizes based on available GPU memory
|