Spaces:

raylim
/

mosaic-zero

Sleeping

App Files Files Community

raylim Claude Sonnet 4.5 commited on Jan 8

Commit

6e06a36

unverified ·

1 Parent(s): 42bcf72

Add comprehensive logging for batch processing verification

Browse files

Improves logging throughout the batch processing pipeline to clearly
demonstrate that the optimization is working correctly.

Model Manager (src/mosaic/model_manager.py):
- Add detailed GPU detection and memory reporting
- Show total GPU memory and memory before/after loading each model
- Log memory management strategy (T4 aggressive vs A100 caching)
- Clear banner indicating models loaded ONCE per batch
- Show Paladin model cache hits vs new loads with ✓ indicators
- Distinguish between aggressive mode (load/free) and caching mode

Batch Analysis (src/mosaic/batch_analysis.py):
- Add comprehensive timing for entire batch and per-slide processing
- Log batch start with clear separator
- Show per-slide progress with [n/total] indicators
- Track and report slide processing times (avg, min, max, total)
- Calculate and display batch overhead vs processing time
- Add detailed summary at end with:
* Success/failure counts
* Model loading time (done once)
* Total batch time
* Per-slide statistics
* Efficiency metrics
* Optimization benefits summary

Pipeline Analysis (src/mosaic/analysis.py):
- Indicate when using PRE-LOADED models (marker classifier, Aeon)
- Makes it clear models are being reused, not reloaded

Example output format:
================================================================================
BATCH PROCESSING: Starting analysis of 10 slides
================================================================================
GPU detected: NVIDIA Tesla T4
Memory management strategy: AGGRESSIVE (T4)
Loading models...
✓ Marker Classifier loaded (GPU: 0.15 GB)
✓ Aeon model loaded (GPU: 0.45 GB)
✓ All core models loaded (Total: 0.45 GB)
These models will be REUSED for all slides in this batch
Model loading completed in 3.2s

[1/10] Processing: slide1.svs
Using pre-loaded models (no disk I/O for core models)
[1/10] ✓ Completed in 45.2s

BATCH PROCESSING SUMMARY
Total slides: 10
Successfully processed: 10
Model loading time: 3.2s (done ONCE for entire batch)
Total batch time: 458.5s
Per-slide times: Avg: 45.5s, Min: 42.1s, Max: 48.3s
✓ Batch processing optimization benefits:
- Models loaded ONCE (not once per slide)
- Reduced disk I/O for model loading

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Files changed (3) hide show

src/mosaic/analysis.py +2 -2
src/mosaic/batch_analysis.py +76 -15
src/mosaic/model_manager.py +42 -14

src/mosaic/analysis.py CHANGED Viewed

@@ -540,7 +540,7 @@ def _run_inference_pipeline_with_models(
     # Step 2: Filter features using pre-loaded marker classifier
     start_time = pd.Timestamp.now()
     progress(0.35, desc="Filtering features with marker classifier")
-    logger.info("Filtering features with marker classifier")
     _, filtered_coords = filter_features(
         ctranspath_features,
         coords,
@@ -563,7 +563,7 @@ def _run_inference_pipeline_with_models(
     # Check if cancer subtype is unknown
     if cancer_subtype in ["Unknown", None]:
-        logger.info("Running Aeon inference (cancer subtype unknown)")
         aeon_results = _run_aeon_inference_with_model(
             features,
             model_cache.aeon_model,  # Use pre-loaded Aeon model

     # Step 2: Filter features using pre-loaded marker classifier
     start_time = pd.Timestamp.now()
     progress(0.35, desc="Filtering features with marker classifier")
+    logger.info("Filtering features with PRE-LOADED marker classifier")
     _, filtered_coords = filter_features(
         ctranspath_features,
         coords,
     # Check if cancer subtype is unknown
     if cancer_subtype in ["Unknown", None]:
+        logger.info("Running Aeon inference with PRE-LOADED model (cancer subtype unknown)")
         aeon_results = _run_aeon_inference_with_model(
             features,
             model_cache.aeon_model,  # Use pre-loaded Aeon model

src/mosaic/batch_analysis.py CHANGED Viewed

@@ -7,6 +7,7 @@ overhead compared to processing slides individually.
 from typing import Dict, List, Optional, Tuple
 import pandas as pd
 from loguru import logger
 from mosaic.model_manager import load_all_models
@@ -77,24 +78,31 @@ def analyze_slides_batch(
         progress = lambda frac, desc: None  # No-op progress function
     num_slides = len(slides)
-    logger.info(f"Starting batch analysis of {num_slides} slides with models loaded once")
     # Step 1: Load all models once
-    logger.info("Loading models for batch processing...")
     progress(0.0, desc="Loading models for batch processing")
     try:
         model_cache = load_all_models(
             use_gpu=True,
             aggressive_memory_mgmt=aggressive_memory_mgmt,
         )
-        logger.info("Models loaded successfully")
         # Log memory strategy
         if model_cache.aggressive_memory_mgmt:
             logger.info(
-                "Using aggressive memory management (T4-style): "
-                "Paladin models will be loaded and freed per slide"
             )
         else:
             logger.info(
@@ -110,6 +118,11 @@ def analyze_slides_batch(
     all_slide_masks = []
     all_aeon_results = []
     all_paladin_results = []
     try:
         for idx, (slide_path, (_, row)) in enumerate(zip(slides, settings_df.iterrows())):
@@ -119,7 +132,10 @@ def analyze_slides_batch(
             progress_frac = (idx + 0.1) / num_slides
             progress(progress_frac, desc=f"Analyzing slide {idx + 1}/{num_slides}: {slide_name}")
-            logger.info(f"Processing slide {idx + 1}/{num_slides}: {slide_name}")
             try:
                 # Use batch-optimized analysis with pre-loaded models
@@ -137,6 +153,9 @@ def analyze_slides_batch(
                     progress=progress,
                 )
                 # Collect results
                 if slide_mask is not None:
                     all_slide_masks.append((slide_mask, slide_name))
@@ -154,24 +173,66 @@ def analyze_slides_batch(
                     )
                     all_paladin_results.append(paladin_results)
-                logger.info(f"Successfully processed slide {idx + 1}/{num_slides}")
             except Exception as e:
-                logger.exception(f"Error processing slide {slide_name}: {e}")
                 # Continue with next slide instead of failing entire batch
                 continue
     finally:
         # Step 3: Always cleanup models (even if there were errors)
         logger.info("Cleaning up models...")
         progress(0.99, desc="Cleaning up models")
         model_cache.cleanup()
-        logger.info("Model cleanup complete")
-    progress(1.0, desc=f"Batch analysis complete ({num_slides} slides)")
-    logger.info(
-        f"Batch analysis complete: "
-        f"Processed {len(all_slide_masks)}/{num_slides} slides successfully"
-    )
     return all_slide_masks, all_aeon_results, all_paladin_results

 from typing import Dict, List, Optional, Tuple
 import pandas as pd
+import time
 from loguru import logger
 from mosaic.model_manager import load_all_models
         progress = lambda frac, desc: None  # No-op progress function
     num_slides = len(slides)
+    batch_start_time = time.time()
+    logger.info("=" * 80)
+    logger.info(f"BATCH PROCESSING: Starting analysis of {num_slides} slides")
+    logger.info("=" * 80)
     # Step 1: Load all models once
     progress(0.0, desc="Loading models for batch processing")
+    model_load_start = time.time()
     try:
         model_cache = load_all_models(
             use_gpu=True,
             aggressive_memory_mgmt=aggressive_memory_mgmt,
         )
+        model_load_time = time.time() - model_load_start
+        logger.info(f"Model loading completed in {model_load_time:.2f}s")
+        logger.info("")
         # Log memory strategy
         if model_cache.aggressive_memory_mgmt:
             logger.info(
+                "Memory strategy: AGGRESSIVE (T4-style) - "
+                "Paladin models loaded/freed per slide"
             )
         else:
             logger.info(
     all_slide_masks = []
     all_aeon_results = []
     all_paladin_results = []
+    slide_times = []
+    logger.info("=" * 80)
+    logger.info("Processing slides with PRE-LOADED models (no model reloading!)")
+    logger.info("=" * 80)
     try:
         for idx, (slide_path, (_, row)) in enumerate(zip(slides, settings_df.iterrows())):
             progress_frac = (idx + 0.1) / num_slides
             progress(progress_frac, desc=f"Analyzing slide {idx + 1}/{num_slides}: {slide_name}")
+            logger.info("")
+            logger.info(f"[{idx + 1}/{num_slides}] Processing: {slide_name}")
+            logger.info(f"         Using pre-loaded models (no disk I/O for core models)")
+            slide_start_time = time.time()
             try:
                 # Use batch-optimized analysis with pre-loaded models
                     progress=progress,
                 )
+                slide_time = time.time() - slide_start_time
+                slide_times.append(slide_time)
                 # Collect results
                 if slide_mask is not None:
                     all_slide_masks.append((slide_mask, slide_name))
                     )
                     all_paladin_results.append(paladin_results)
+                logger.info(f"[{idx + 1}/{num_slides}] ✓ Completed in {slide_time:.2f}s")
             except Exception as e:
+                slide_time = time.time() - slide_start_time
+                slide_times.append(slide_time)
+                logger.exception(f"[{idx + 1}/{num_slides}] ✗ Failed after {slide_time:.2f}s: {e}")
                 # Continue with next slide instead of failing entire batch
                 continue
     finally:
         # Step 3: Always cleanup models (even if there were errors)
+        logger.info("")
+        logger.info("=" * 80)
         logger.info("Cleaning up models...")
         progress(0.99, desc="Cleaning up models")
         model_cache.cleanup()
+        logger.info("✓ Model cleanup complete")
+    # Calculate batch statistics
+    batch_total_time = time.time() - batch_start_time
+    num_successful = len(all_slide_masks)
+    num_failed = num_slides - num_successful
+    # Log comprehensive summary
+    logger.info("=" * 80)
+    logger.info("BATCH PROCESSING SUMMARY")
+    logger.info("=" * 80)
+    logger.info(f"Total slides:        {num_slides}")
+    logger.info(f"Successfully processed: {num_successful}")
+    logger.info(f"Failed:              {num_failed}")
+    logger.info("")
+    logger.info(f"Model loading time:  {model_load_time:.2f}s (done ONCE for entire batch)")
+    logger.info(f"Total batch time:    {batch_total_time:.2f}s")
+    if slide_times:
+        avg_slide_time = sum(slide_times) / len(slide_times)
+        min_slide_time = min(slide_times)
+        max_slide_time = max(slide_times)
+        total_slide_time = sum(slide_times)
+        logger.info("")
+        logger.info("Per-slide processing times:")
+        logger.info(f"  Average: {avg_slide_time:.2f}s")
+        logger.info(f"  Min:     {min_slide_time:.2f}s")
+        logger.info(f"  Max:     {max_slide_time:.2f}s")
+        logger.info(f"  Total:   {total_slide_time:.2f}s")
+        # Calculate efficiency
+        overhead_time = batch_total_time - total_slide_time
+        logger.info("")
+        logger.info(f"Batch overhead:      {overhead_time:.2f}s ({overhead_time/batch_total_time*100:.1f}%)")
+        logger.info(f"Slide processing:    {total_slide_time:.2f}s ({total_slide_time/batch_total_time*100:.1f}%)")
+    logger.info("")
+    logger.info("✓ Batch processing optimization benefits:")
+    logger.info("  - Models loaded ONCE (not once per slide)")
+    logger.info("  - Reduced disk I/O for model loading")
+    logger.info(f"  - Processed {num_slides} slides with shared model cache")
+    logger.info("=" * 80)
+    progress(1.0, desc=f"Batch analysis complete ({num_successful}/{num_slides} successful)")
     return all_slide_masks, all_aeon_results, all_paladin_results

src/mosaic/model_manager.py CHANGED Viewed

@@ -117,7 +117,9 @@ def load_all_models(
         FileNotFoundError: If model files are not found in data/ directory
         RuntimeError: If CUDA is requested but not available
     """
-    logger.info("Loading models for batch processing...")
     # Detect GPU type
     device = torch.device("cpu")
@@ -127,15 +129,23 @@ def load_all_models(
         device = torch.device("cuda")
         gpu_name = torch.cuda.get_device_name(0)
         is_t4_gpu = "T4" in gpu_name
-        logger.info(f"Detected GPU: {gpu_name}")
         # Auto-detect memory management strategy
         if aggressive_memory_mgmt is None:
             aggressive_memory_mgmt = is_t4_gpu
-            logger.info(
-                f"Auto-detected memory management: "
-                f"{'aggressive (T4)' if is_t4_gpu else 'caching (high-memory GPU)'}"
-            )
     elif use_gpu and not torch.cuda.is_available():
         logger.warning("GPU requested but CUDA not available, falling back to CPU")
         use_gpu = False
@@ -173,7 +183,11 @@ def load_all_models(
     with open(marker_classifier_path, "rb") as f:
         marker_classifier = pickle.load(f)  # nosec
-    logger.info("Marker Classifier loaded successfully")
     # Load Aeon model
     logger.info("Loading Aeon model...")
@@ -185,12 +199,23 @@ def load_all_models(
         aeon_model = pickle.load(f)  # nosec
     aeon_model.to(device)
     aeon_model.eval()
-    logger.info("Aeon model loaded successfully")
-    # Log memory usage
     if use_gpu and torch.cuda.is_available():
         mem_allocated = torch.cuda.memory_allocated() / (1024**3)
-        logger.info(f"GPU memory after loading core models: {mem_allocated:.2f} GB")
     # Create ModelCache
     cache = ModelCache(
@@ -203,7 +228,6 @@ def load_all_models(
         device=device,
     )
-    logger.info("All core models loaded successfully")
     return cache
@@ -232,11 +256,15 @@ def load_paladin_model_for_inference(
     # Check cache first (only used in non-aggressive mode)
     if not cache.aggressive_memory_mgmt and model_key in cache.paladin_models:
-        logger.debug(f"Using cached Paladin model: {model_path.name}")
         return cache.paladin_models[model_key]
     # Load model from disk
-    logger.debug(f"Loading Paladin model: {model_path.name}")
     with open(model_path, "rb") as f:
         model = pickle.load(f)  # nosec
@@ -246,6 +274,6 @@ def load_paladin_model_for_inference(
     # Cache if not in aggressive mode
     if not cache.aggressive_memory_mgmt:
         cache.paladin_models[model_key] = model
-        logger.debug(f"Cached Paladin model: {model_path.name}")
     return model

         FileNotFoundError: If model files are not found in data/ directory
         RuntimeError: If CUDA is requested but not available
     """
+    logger.info("=" * 80)
+    logger.info("BATCH PROCESSING: Loading models (this happens ONCE per batch)")
+    logger.info("=" * 80)
     # Detect GPU type
     device = torch.device("cpu")
         device = torch.device("cuda")
         gpu_name = torch.cuda.get_device_name(0)
         is_t4_gpu = "T4" in gpu_name
+        gpu_memory_total = torch.cuda.get_device_properties(0).total_memory / (1024**3)
+        logger.info(f"GPU detected: {gpu_name}")
+        logger.info(f"GPU total memory: {gpu_memory_total:.2f} GB")
+        # Log initial GPU memory
+        mem_before = torch.cuda.memory_allocated() / (1024**3)
+        logger.info(f"GPU memory before loading models: {mem_before:.2f} GB")
         # Auto-detect memory management strategy
         if aggressive_memory_mgmt is None:
             aggressive_memory_mgmt = is_t4_gpu
+            strategy = "AGGRESSIVE (T4)" if is_t4_gpu else "CACHING (High-Memory GPU)"
+            logger.info(f"Memory management strategy: {strategy}")
+            if is_t4_gpu:
+                logger.info("  → Paladin models will be loaded and freed per slide")
+            else:
+                logger.info("  → Paladin models will be cached and reused across slides")
     elif use_gpu and not torch.cuda.is_available():
         logger.warning("GPU requested but CUDA not available, falling back to CPU")
         use_gpu = False
     with open(marker_classifier_path, "rb") as f:
         marker_classifier = pickle.load(f)  # nosec
+    logger.info("✓ Marker Classifier loaded")
+    if use_gpu and torch.cuda.is_available():
+        mem = torch.cuda.memory_allocated() / (1024**3)
+        logger.info(f"  GPU memory: {mem:.2f} GB")
     # Load Aeon model
     logger.info("Loading Aeon model...")
         aeon_model = pickle.load(f)  # nosec
     aeon_model.to(device)
     aeon_model.eval()
+    logger.info("✓ Aeon model loaded")
+    if use_gpu and torch.cuda.is_available():
+        mem = torch.cuda.memory_allocated() / (1024**3)
+        logger.info(f"  GPU memory: {mem:.2f} GB")
+    # Log final memory usage
+    logger.info("-" * 80)
     if use_gpu and torch.cuda.is_available():
         mem_allocated = torch.cuda.memory_allocated() / (1024**3)
+        logger.info(f"✓ All core models loaded to GPU")
+        logger.info(f"  Total GPU memory used: {mem_allocated:.2f} GB")
+        logger.info(f"  These models will be REUSED for all slides in this batch")
+    else:
+        logger.info("✓ All core models loaded to CPU")
+        logger.info("  These models will be REUSED for all slides in this batch")
+    logger.info("-" * 80)
     # Create ModelCache
     cache = ModelCache(
         device=device,
     )
     return cache
     # Check cache first (only used in non-aggressive mode)
     if not cache.aggressive_memory_mgmt and model_key in cache.paladin_models:
+        logger.info(f"  ✓ Using CACHED Paladin model: {model_path.name} (no disk I/O!)")
         return cache.paladin_models[model_key]
     # Load model from disk
+    if cache.aggressive_memory_mgmt:
+        logger.info(f"  → Loading Paladin model: {model_path.name} (will free after use)")
+    else:
+        logger.info(f"  → Loading Paladin model: {model_path.name} (will cache for reuse)")
     with open(model_path, "rb") as f:
         model = pickle.load(f)  # nosec
     # Cache if not in aggressive mode
     if not cache.aggressive_memory_mgmt:
         cache.paladin_models[model_key] = model
+        logger.info(f"  ✓ Cached Paladin model for future reuse")
     return model