GoodBaiBai88
/

M3D-CLIP

Image Feature Extraction

feature-extraction

3D medical CLIP

Image-text retrieval

Model card Files Files and versions

GoodBaiBai88 commited on Apr 29, 2024

Commit

988ef80

·

verified ·

1 Parent(s): 49696db

Update README.md

Files changed (1) hide show

README.md +48 -0

README.md CHANGED Viewed

@@ -1,3 +1,51 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+metrics:
+- accuracy
+pipeline_tag: image-feature-extraction
+tags:
+- 3D medical CLIP
+- Image-text retrieval
 ---
+M3D-CLIP is a 3D medical CLIP model, which aligns vision and language through contrastive loss on [M3D-Cap](https://huggingface.co/datasets/GoodBaiBai88/M3D-Cap) dataset.
+The vision encoder uses 3D ViT with 32*256*256 image size and 4*16*16 patch size.
+The text encoder utilizes a pre-trained BERT as initialization.
+# Quickstart
+```python
+    device = torch.device("cuda") # or cpu
+    tokenizer = AutoTokenizer.from_pretrained(
+        "GoodBaiBai88/M3D-CLIP",
+        model_max_length=512,
+        padding_side="right",
+        use_fast=False
+    )
+    model = AutoModel.from_pretrained(
+        "GoodBaiBai88/M3D-CLIP",
+        trust_remote_code=True
+    )
+    model = model.to(device=device)
+    # Prepare your 3D medical image:
+    # 1. The image shape needs to be processed as 1*32*256*256, consider resize and other methods.
+    # 2. The image needs to be normalized to 0-1, consider Min-Max Normalization.
+    # 3. The image format needs to be converted to .npy
+    # 4. Although we did not train on 2D images, in theory, the 2D image can be interpolated to the shape of 1*32*256*256 for input.
+    image_path = ""
+    input_txt = ""
+    text_tensor = tokenizer(input_txt, return_tensors="pt")
+    input_id = text_tensor["input_ids"].to(device=device)
+    attention_mask = text_tensor["attention_mask"].to(device=device)
+    image = np.load(image_path).to(device=device)
+    with torch.inference_mode():
+        image_features = model.encode_image(image)[:, 0]
+        text_features = model.encode_text(input_id, attention_mask)[:, 0]
+```