Spaces:

egumasa
/

simple-text-analyzer

Building

App Files Files Community

simple-text-analyzer / BATCH_PERFORMANCE_OPTIMIZATION.md

egumasa

gpu support

4d2898f 7 months ago

preview code

raw

history blame contribute delete

4.55 kB

	# Batch Processing Performance Optimization

	## Performance Issues Identified

	### 1. Multiple Analysis Calls Per File (Biggest Issue)
	The original implementation made 3 separate calls to `analyze_text()` for each file:
	- One for Content Words (CW)
	- One for Function Words (FW)
	- One for n-grams (without word type filter)

	Each call runs the entire SpaCy pipeline (tokenization, POS tagging, dependency parsing), essentially tripling the processing time.

	### 2. Memory Accumulation
	- All results stored in memory with detailed token information
	- No streaming or chunking capabilities
	- Everything stays in memory until batch completes

	### 3. Default Model Size
	- Default SpaCy model is 'trf' (transformer-based), which is much slower than 'md'
	- Found in `session_manager.py`: `'model_size': 'trf'`

	## Optimizations Implemented

	### Phase 1: Single-Pass Analysis (70% Performance Gain)

	Changes Made:

	1. Modified `analyze_text()` method to support `separate_word_types` parameter
	- Processes both CW and FW in a single pass through the text
	- Collects statistics for both word types simultaneously
	- N-grams are processed in the same pass

	2. Updated batch processing handlers to use single-pass analysis:
	```python
	# OLD: 3 separate calls
	for word_type in ['CW', 'FW']:
	analysis = analyzer.analyze_text(text, ...)
	full_analysis = analyzer.analyze_text(text, ...) # for n-grams

	# NEW: Single optimized call
	analysis = analyzer.analyze_text(
	text_content,
	selected_indices,
	separate_word_types=True # Process CW/FW separately in same pass
	)
	```

	3. Added optimized batch method `analyze_batch_memory()`:
	- Works directly with in-memory file contents
	- Supports all new analysis parameters
	- Maintains backward compatibility

	## Performance Recommendations

	### 1. Use 'md' Model Instead of 'trf'
	The transformer model ('trf') is significantly slower. For batch processing, consider using 'md':
	- 3-5x faster processing
	- Still provides good accuracy for lexical sophistication analysis

	### 2. Enable Smart Defaults
	Smart defaults optimize which measures to compute, reducing unnecessary calculations.

	### 3. For Very Large Batches
	Consider implementing:
	- Chunk processing (process N files at a time)
	- Parallel processing using multiprocessing
	- Results streaming to disk instead of memory accumulation

	## Expected Performance Gains

	With the optimizations implemented:
	- ~70% reduction in processing time from eliminating redundant analysis calls
	- Additional 20-30% possible by switching from 'trf' to 'md' model
	- Memory usage remains similar but could be optimized further with streaming

	## How to Use the Optimized Version

	The optimizations are transparent to users. The batch processing will automatically use the single-pass analysis when:
	- No specific word type filter is selected
	- Processing files that need both CW and FW analysis

	For legacy compatibility, the old `analyze_batch()` method has been updated to use the optimized approach internally.

	## GPU Status Monitoring in Debug Mode

	The web app now includes comprehensive GPU status information in debug mode. To access:

	1. Enable "🐛 Debug Mode" in the sidebar
	2. Expand the "GPU Status" section

	### Features

	PyTorch/CUDA Information:
	- PyTorch installation and version
	- CUDA availability and version
	- Number of GPUs and their names
	- GPU memory usage (allocated, reserved, free)

	SpaCy GPU Configuration:
	- SpaCy GPU enablement status
	- Current GPU device being used
	- spacy-transformers installation status

	Active Model GPU Status:
	- Current model's device configuration
	- GPU optimization status (mixed precision, batch sizes)
	- SpaCy version information

	Performance Tips:
	- Optimization recommendations
	- Common troubleshooting guidance

	### Benefits

	This integrated GPU monitoring eliminates the need for the separate `test_gpu_support.py` script for most use cases. Developers can now:

	- Quickly verify GPU availability without running external scripts
	- Monitor GPU memory usage during batch processing
	- Confirm that models are correctly utilizing GPU acceleration
	- Troubleshoot performance issues more effectively

	### Usage Example

	When processing large batches with transformer models:
	1. Enable debug mode to monitor GPU utilization
	2. Check that the model is using GPU (not CPU fallback)
	3. Monitor memory usage to prevent out-of-memory errors
	4. Adjust batch sizes based on available GPU memory