ivus-segmentation / docs /multitask_finetuning_comprehensive_memo.md
Aditya2162's picture
Upload folder using huggingface_hub
1d197a4 verified
# IVUS Segmentation and Bifurcation Detection
## Comprehensive Multi-Task Fine-Tuning Report
Date: February 20, 2026
## 1) Purpose and Scope
This report documents the full methodology used to adapt a pretrained IVUS segmentation model into a multi-task model that performs:
- Lumen segmentation (pixel-level)
- Bifurcation detection (frame-level)
The goal is to provide a self-contained technical description of model design, training behavior, threshold calibration, results, and limitations.
## 2) Problem Setup
Given an IVUS frame `x`, we optimize two tasks:
1. Segmentation output `M_hat`: lumen mask over pixels
2. Classification output `y_hat`: bifurcation probability in `[0,1]`
The model is trained at frame level. There is no temporal model (no recurrence, no sequence transformer, no optical flow objective).
## 3) Data and Labels
### 3.1 Data organization
The dataset is built from a frame-bank of manually labeled IVUS frames with train/validation/test partitions.
Split counts:
- Train: 420
- Validation: 90
- Test: 90
### 3.2 Label distributions
Bifurcation positive rate by split:
- Train: 65.2%
- Validation: 65.6%
- Test: 65.6%
Lumen annotation coverage by split:
- Train: 47.4%
- Validation: 51.1%
- Test: 53.3%
This means classification supervision is denser than segmentation supervision in the multi-task setting.
### 3.3 Balance visualizations
![Split class balance](./memo_assets/split_class_balance_stacked.png)
![Positive rate by split](./memo_assets/positive_rate_by_split.png)
![Lumen coverage by split](./memo_assets/lumen_coverage_by_split.png)
## 4) Model Design
### 4.1 Backbone + multi-task head
A pretrained segmentation backbone is reused as initialization.
A lightweight **multi-task classification head** is attached on top of segmentation logits:
- Global average pooling over spatial dimensions
- Dense layer (ReLU)
- Dropout
- Final sigmoid output for bifurcation probability
This is a multi-task head, not an attention module.
### 4.2 Task coupling strategy
The segmentation branch and classification branch share upstream representation. This encourages feature reuse while keeping task-specific outputs separate.
### 4.3 Conceptual architecture
![Multi-task training and inference diagram](./memo_assets/multitask_pipeline_diagram.png)
## 5) Preprocessing and Input Construction
For each frame:
1. Apply central black-circle preprocessing (to suppress catheter/artifacts near center).
2. Convert grayscale to network input representation.
3. Align labels to frame indices.
For segmentation labels, only frames with valid lumen polygons are supervised.
## 6) Loss Functions and Optimization
Let `i` index samples in a minibatch.
- `m_i in {0,1}^{H x W}`: ground-truth lumen mask
- `m_hat_i`: predicted lumen probability map
- `y_i in {0,1}`: bifurcation label
- `y_hat_i in (0,1)`: bifurcation probability
- `h_i in {0,1}`: has-mask indicator (1 if segmentation label exists)
### 6.1 Segmentation loss
Weighted BCE + Dice:
```text
L_seg,i = L_wbce(m_i, m_hat_i; w_pos) + lambda_dice * L_dice(m_i, m_hat_i)
```
Masked batch aggregation (only labeled masks contribute):
```text
L_seg = (sum_i h_i * L_seg,i) / (sum_i h_i + eps)
```
### 6.2 Classification loss
Binary cross entropy:
```text
L_cls = (1/B) * sum_i L_bce(y_i, y_hat_i)
```
### 6.3 Total objective
```text
L_total = w_seg * L_seg + w_cls * L_cls
```
### 6.4 Optimization behavior
- GradientTape-style explicit optimization loop for multi-task fine-tuning
- Gradient clipping by global norm for stability
- Early stopping using validation objective
- Best-checkpoint restore before final export
## 7) Threshold Selection and Operating Point
After model training, bifurcation threshold `t` is selected on validation data by grid search over candidate thresholds.
For each `t`:
```text
y_hat_i^(t) = 1[y_hat_i >= t]
```
Compute precision, recall, F1, accuracy, etc., then choose:
```text
t* = argmax_t F1_val(t)
```
The selected threshold is persisted and reused during runtime inference.
## 8) Training Dynamics
### 8.1 Multi-task fine-tuning dynamics
![Multi-task training dynamics](./memo_assets/multitask_training_dynamics.png)
Observed behavior:
- Validation classification AUC stabilizes high relatively early.
- Validation F1 is more threshold-sensitive and fluctuates more.
- Segmentation metrics remain strong but vary with sparse segmentation supervision.
### 8.2 Lumen-only fine-tuning dynamics
![Lumen fine-tune dynamics](./memo_assets/lumen_finetune_dynamics.png)
## 9) Test Performance Summary
### 9.1 Multi-task test metrics
Segmentation (subset with lumen labels):
- IoU: 0.856
- Dice: 0.923
Bifurcation classification:
- Accuracy: 0.900
- Precision: 0.891
- Recall: 0.966
- F1: 0.927
- AUC: 0.961
Confusion matrix:
![Multitask confusion matrix](./memo_assets/multitask_test_confusion_matrix.png)
Metric snapshot:
![Multitask metric snapshot](./memo_assets/multitask_test_metric_snapshot.png)
### 9.2 Segmentation regime comparison
![Segmentation comparison](./memo_assets/segmentation_regime_comparison.png)
Note: compared evaluations do not use identical sample sets, so the comparison is directional.
## 10) Threshold and Calibration Diagnostics
Standalone classifier diagnostics (supporting analysis):
![Threshold sweep](./memo_assets/standalone_threshold_sweep.png)
![Probability histogram](./memo_assets/standalone_probability_hist.png)
![Reliability diagram](./memo_assets/standalone_reliability_diagram.png)
![Precision-recall curve with operating point](./memo_assets/precision_recall_curve_with_operating_point.png)
These plots illustrate threshold sensitivity, score separation, and calibration quality.
## 11) Limitations
### 11.1 Split caveat: source overlap
Train/validation/test share source pullback files (frame-level partitioning rather than source-level partitioning).
Because the model is frame-independent, this is not temporal leakage. However, repeated source style/statistics across splits can make in-domain metrics optimistic.
![Split source overlap](./memo_assets/split_source_overlap_heatmap.png)
### 11.2 Uneven supervision density
Only about half of samples carry segmentation labels. This creates an imbalance between classification and segmentation supervision in multi-task training.
### 11.3 Domain shift across source groups
Performance can vary substantially by source group.
![Group-wise standalone metrics](./memo_assets/standalone_group_metrics.png)
This indicates a need for stronger cross-source robustness analysis.
### 11.4 Head capacity tradeoff
The current multi-task head is intentionally lightweight. This helps stability and runtime cost, but may under-capture fine spatial context around bifurcation patterns.
## 12) Practical Conclusions
1. The current multi-task approach is effective and operationally coherent.
2. Validation-driven thresholding is critical and should remain part of deployment.
3. The largest methodological caveat is source-overlap evaluation, not temporal modeling leakage.
4. Next major quality gain will likely come from stricter source-level split protocols and robustness-focused evaluation.
## 13) Reproducibility Note
This report is intended to be self-contained. Supporting figures are stored under `docs/memo_assets/`.
PDF export command:
```bash
scripts/analysis/export_memo_pdf.sh
```