FlexiCT-3D-VLM

FlexiCT-3D-VLM aligns the FlexiCT 3D vision encoder with a Qwen3 embedding text tower.

Input and preprocessing

Image preprocessing matches FlexiCT-3D: the default output shape is [B, 1, 160, 160, 160].

Text uses Qwen3 tokenizer behavior with left padding and max length 8192.

from transformers import AutoModel, AutoProcessor

processor = AutoProcessor.from_pretrained("ricklisz123/FlexiCT-3D-VLM", trust_remote_code=True)
model = AutoModel.from_pretrained("ricklisz123/FlexiCT-3D-VLM", trust_remote_code=True)

inputs = processor(
    images="/path/to/ct.nii.gz",
    text=["pulmonary nodule", "no acute abnormality"],
    return_tensors="pt",
)
outputs = model(**inputs)
similarity = outputs.logits_per_image

Outputs

image_embeds and text_embeds are L2-normalized embeddings. logits_per_image contains learned-temperature-scaled image-text similarity scores.

Limitations

This model is for research retrieval and text-image scoring. It is not a diagnostic device. The text tower depends on Qwen3 tokenizer/config files unless they are already cached or included in the local repo.

Downloads last month: 3

Safetensors

Model size

0.7B params

Tensor type

F32

BF16