Mirror benchmark ledger into README

9831cc2 verified 4 days ago

12.2 kB

	---
	library_name: keras
	base_model: google/tipsv2-l14
	pipeline_tag: image-classification
	gated: true
	extra_gated_prompt: "Request access for research or deployment evaluation. Please share a short justification for why you need the DermoLens model."
	extra_gated_fields:
	Affiliation: text
	Intended use: text
	Research use only: checkbox
	tags:
	- dermatology
	- medical-imaging
	- multiple-instance-learning
	- tensorflow
	- pytorch
	- tipsv2
	- binary-classification
	- infectious-screening
	license: other
	---

	# DermoLens TIPSv2 + MIL Infectious Screening Deployment Package

	This folder packages the latest Training-C production candidate for Hugging Face or container-based deployment.

	The realistic deployment model is a two-tier pipeline:

	1. Raw dermatology images are passed through `google/tipsv2-l14`.
	2. The resulting TIPSv2 CLS embeddings are passed into the DermoLens MIL classifier.
	3. The MIL probability is converted to a final class using the production threshold.

	## Caution

	This package is TIPSv2-only. Do not use Derm Foundation embeddings, Derm Foundation `.npz` archives, or 6144-d feature files.

	## Access Requests

	This repository is intended to be published as a gated Hugging Face model card.

	By default, Hugging Face already collects the requester email and username for gated models. The extra fields above add:

	- a short free-text justification
	- intended use
	- a research-only acknowledgment checkbox

	If the repository remains private, the request form will not be visible. To use the request workflow, the model should be published as a public gated repo.

	## What Is Included

	```text
	deploy-hf/
	README.md
	README.production-bundle.md
	Dockerfile
	requirements.txt
	deployment_config.json
	model/
	binary_tipsv2_screening_model.keras
	metadata/
	thresholds.json
	production_config.json
	production_validation_metrics.json
	production_training_history.csv
	production_validation_predictions.npz
	revised_binary_label_summary.json
	best_hyperparameters.json
	figures/
	production_learning_curves.png
	src/
	inference.py
	tipsv2_common_training_reference.py
	scripts/
	download_tipsv2.py
	tipsv2-local-reference/
	configuration and remote-code reference files from the local TIPSv2 checkout
	```

	## Model Formats

	There are two model components:

	\| Component \| Model \| Framework / Format \| Role \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| Feature extractor \| `google/tipsv2-l14` \| PyTorch via Hugging Face Transformers remote code, `safetensors` weights \| Converts raw images to 1024-d CLS embeddings \|
	\| MIL classifier \| `binary_tipsv2_screening_model.keras` \| TensorFlow / Keras `.keras` \| Converts a `(3, 1024)` casebag to `P(Infectious)` \|

	The current system is therefore mixed-framework: PyTorch for TIPSv2 and TensorFlow/Keras for the MIL head.

	## Comprehensive Benchmarks

	The full benchmark ledger is also copied into this package as [`FINAL_BENCHMARKS.md`](FINAL_BENCHMARKS.md). The same content is mirrored below so the Hugging Face model card is self-contained.

	### Binary Screening

	Evaluation performed on the full 3,061 case validation set.
	Operating Threshold: `P(Infectious) >= 0.35`

	\| Model Architecture / Format \| Model Size \| AUC \| Accuracy \| Precision \| Recall (Sensitivity) \| F1 Score \| Notes \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Original Keras (Training-C) \| 1.15 GB+ \| 0.755784 \| 0.661875 \| 0.544653 \| 0.780960 \| 0.641745 \| The original fragmented FP32 pipeline. \|
	\| PyTorch Unified (FP32) \| 1,865 MB \| 0.755784 \| 0.661875 \| 0.544653 \| 0.780960 \| 0.641745 \| The final production monolith. Mathematically identical to Keras. \|
	\| PyTorch Unified (FP16) \| 932 MB \| 0.755789 \| 0.661875 \| 0.544653 \| 0.780960 \| 0.641745 \| Halves RAM usage with essentially no accuracy loss. \|
	\| LiteRT Edge (FP32) \| 1,163 MB \| 0.755784 \| 0.661875 \| 0.544653 \| 0.780960 \| 0.641745 \| Mathematically identical to PyTorch FP32. \|
	\| LiteRT Edge (INT8 PTQ) \| 297 MB \| 0.736973 \| 0.669716 \| 0.561798 \| 0.673968 \| 0.612792 \| The quantization tradeoff. Lower sensitivity. \|

	### 10-Disease Classification

	Evaluation performed on the preliminary 2,336 case dataset.
	Representative class-level agreement is shown below; equivalence holds across the 10 classes.

	#### Class 0 (Eczema) - Threshold: 0.4747

	\| Model Architecture / Format \| AUC \| Accuracy \| Precision \| Recall \| F1 Score \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Original Keras (Training-A) \| 0.739529 \| 0.656678 \| 0.598756 \| 0.729167 \| 0.657558 \|
	\| PyTorch Unified (FP32) \| 0.739529 \| 0.656678 \| 0.598756 \| 0.729167 \| 0.657558 \|
	\| PyTorch Unified (FP16) \| 0.739538 \| 0.656678 \| 0.598756 \| 0.729167 \| 0.657558 \|

	#### Class 1 (Allergic Contact Dermatitis) - Threshold: 0.3838

	\| Model Architecture / Format \| AUC \| Accuracy \| Precision \| Recall \| F1 Score \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Original Keras (Training-A) \| 0.739767 \| 0.684932 \| 0.572334 \| 0.620848 \| 0.595604 \|
	\| PyTorch Unified (FP32) \| 0.739767 \| 0.684932 \| 0.572334 \| 0.620848 \| 0.595604 \|
	\| PyTorch Unified (FP16) \| 0.739774 \| 0.685360 \| 0.572785 \| 0.621993 \| 0.596376 \|

	## Technical Conclusions

	1. Mathematical Equivalence: The manual port of the complex gated attention pooling and global average pooling layers from Keras to PyTorch is numerically aligned for the supported benchmark runs.
	2. The Power of FP16: Converting the PyTorch unified engine to FP16 reduces the Docker container memory footprint while preserving the clinical ROC-AUC and sensitivity from the FP32 runs.
	3. LiteRT Limitations: The LiteRT FP32 export is mathematically sound, but FP16 conversion for large vision transformers can fail in the Google AI Edge toolchain. INT8 PTQ succeeds but reduces clinical sensitivity.

	Final Deployment Target: `unified_engine_fp16_weights.pt` running on CPU via FastAPI.

	## Production Decision Rule

	The MIL model outputs one probability:

	```text
	P(Infectious)
	```

	The production threshold is:

	```text
	0.35
	```

	Final classification:

	```python
	if p_infectious >= 0.35:
	prediction = "Infectious"
	else:
	prediction = "Non Infectious"
	```

	Do not silently use `0.5` for production inference.

	## Input Contract

	Input to the full deployment pipeline:

	```text
	1 to 3 RGB images from the same patient case
	```

	Input to the MIL classifier after TIPSv2:

	```text
	casebag.shape == (3, 1024)
	```

	Rules:

	- Each submitted image is converted to RGB.
	- Each image is resized to `448 x 448`, matching the Training-C extraction process.
	- Each image is passed through `google/tipsv2-l14` using `model.encode_image(pixel_values)`.
	- Each row is the final-layer TIPSv2 CLS token: `out.cls_token[0, 0]`.
	- Each CLS embedding must be 1024-d.
	- Cases with fewer than 3 images are automatically zero-padded to 3 MIL slots.
	- Do not mix images from different patient cases.
	- Do not flatten this into image-level classification unless explicitly doing a different experiment.

	## Exact Casebag Behavior

	The MIL model always receives exactly 3 slots:

	```text
	(3, 1024)
	```

	If 1 image is submitted:

	```text
	slot 1 = TIPSv2(image_1)
	slot 2 = zeros(1024)
	slot 3 = zeros(1024)
	```

	If 2 images are submitted:

	```text
	slot 1 = TIPSv2(image_1)
	slot 2 = TIPSv2(image_2)
	slot 3 = zeros(1024)
	```

	If 3 images are submitted:

	```text
	slot 1 = TIPSv2(image_1)
	slot 2 = TIPSv2(image_2)
	slot 3 = TIPSv2(image_3)
	```

	The padding is handled automatically by `src/inference.py`. The submitted image order is preserved. If a case has more than 3 images, do not pass all images blindly; select/split intentionally because this model was trained with at most 3 images per case.

	## Encoding Contract From Training-C

	The deployment encoder must match Training-C:

	```python
	image = Image.open(image_path).convert("RGB")
	pixel_values = Resize((448, 448))(image)
	pixel_values = ToTensor()(pixel_values).unsqueeze(0)
	out = tipsv2_model.encode_image(pixel_values)
	embedding = out.cls_token[0, 0].float().cpu().numpy()
	```

	This is the same logic used during `APR26/data_extraction/extract_all_cases_tipsv2.py`.

	Do not use:

	- patch-token averages,
	- register tokens,
	- normalized text/image similarity vectors,
	- Derm Foundation embeddings,
	- image-level logits from another model.

	## Production Metrics

	Training-C production validation metrics at threshold `0.35`:

	\| Metric \| Value \|
	\| :--- \| ---: \|
	\| ROC AUC \| 0.7194 \|
	\| PR-AUC \| 0.5868 \|
	\| Sensitivity / Recall \| 0.7697 \|
	\| Specificity \| 0.5851 \|
	\| Accuracy \| 0.6565 \|
	\| Precision \| 0.5394 \|
	\| F1 \| 0.6343 \|
	\| Youden J \| 0.3548 \|

	Dataset state:

	\| Item \| Value \|
	\| :--- \| ---: \|
	\| Cases \| 3,061 \|
	\| Images / embeddings \| 6,517 \|
	\| Infectious cases \| 1,187 \|
	\| Non-infectious cases \| 1,874 \|
	\| Feature dimension \| 1024 \|

	## Running Inference Locally

	From this folder:

	```bash
	pip install -r requirements.txt
	python src/inference.py case_image_1.png case_image_2.png
	```

	The script accepts 1 to 3 images from the same patient case.

	If TIPSv2 is already cached locally:

	```bash
	python src/inference.py case_image_1.png --local-files-only
	```

	If using a vendored/local TIPSv2 folder:

	```bash
	python src/inference.py case_image_1.png --tipsv2-model /path/to/google/tipsv2-l14/snapshot
	```

	Output example:

	```json
	{
	"prediction": "Infectious",
	"p_infectious": 0.47,
	"threshold": 0.35,
	"image_count": 2,
	"rule": "Infectious if P(Infectious) >= threshold else Non Infectious"
	}
	```

	## Docker Usage

	Build:

	```bash
	docker build -t dermolens-tipsv2-mil .
	```

	Run:

	```bash
	docker run --rm -v "$PWD/examples:/data" dermolens-tipsv2-mil /data/case_image_1.png /data/case_image_2.png
	```

	The default Dockerfile does not bake the 1.8 GB TIPSv2 weights into the image. This keeps the image smaller and lets the runtime download or mount the Hugging Face cache.

	For a self-contained container, uncomment this line in the Dockerfile:

	```dockerfile
	# RUN python scripts/download_tipsv2.py
	```

	That will pre-cache `google/tipsv2-l14` inside the image.

	## Hugging Face Push Strategy

	Recommended setup:

	1. Push this `deploy-hf/` folder as the DermoLens model repository.
	2. Reference `google/tipsv2-l14` as the upstream feature extractor instead of duplicating the full TIPSv2 weights.
	3. Include `src/inference.py` as the canonical end-to-end raw-image inference code.
	4. Put the raw image dataset in a separate Hugging Face dataset repository.
	5. Keep case-level metadata in the dataset repository so MIL grouping is preserved.

	This is better than copying TIPSv2 weights into our repo because:

	- TIPSv2 is already a Hugging Face model with its own versioning.
	- The real weights are about 1.8 GB.
	- Duplicating them creates storage, sync, and licensing ambiguity.
	- The deployment container can still be fully self-contained by pre-caching TIPSv2 at Docker build time.

	## Should We Convert Models?

	Current recommendation: do not convert yet.

	Reason:

	- TIPSv2 uses PyTorch / Hugging Face remote code.
	- The MIL head is small and already saved as TensorFlow/Keras.
	- Mixed-framework inference is acceptable inside Docker.
	- Conversion adds risk unless we have a specific deployment target that requires ONNX/TFLite/TensorRT.

	Future conversion options:

	\| Option \| When useful \| Risk \|
	\| :--- \| :--- \| :--- \|
	\| Convert MIL Keras model to ONNX \| If we want one ONNX runtime for the MIL head \| Low to moderate \|
	\| Convert TIPSv2 to ONNX \| If deploying to a strict ONNX/TensorRT environment \| Higher, because custom remote code and image encoder outputs must be validated \|
	\| Retrain/rebuild MIL head in PyTorch \| If we want a single PyTorch-only pipeline \| Moderate, requires reproducing Training-C weights or retraining \|
	\| Keep mixed PyTorch + TensorFlow \| Best current path for Hugging Face/Cloud Run/GCE \| Larger dependency footprint \|

	For Hugging Face, GCloud, Firebase-backed services, or generic Docker deployment, the current mixed-framework package is the pragmatic choice.

	## Deployment Interpretation

	This is a research production candidate, not a standalone clinical diagnostic device. It is suitable for controlled research inference, screening-threshold experiments, and deployment engineering validation.