README.md · jeffrey423/ToothXpert at main

File size: 6,648 Bytes

---
license: apache-2.0
datasets:
  - jeffrey423/ToothXpert.MM-OPG-Annotations
language:
  - en
tags:
  - dental
  - medical
  - multimodal
  - vision-language
  - llava
  - clip
  - sam
  - lora
  - orthopantomography
  - opg
  - x-ray
  - diagnosis
base_model: liuhaotian/llava-v1.5-7b
pipeline_tag: image-text-to-text
library_name: transformers
---

# ToothXpert Model

ToothXpert is a multimodal AI model for comprehensive dental X-ray (OPG) analysis, combining vision and language understanding for automatic diagnosis and condition detection.

## Quick Start

### Installation

```bash
pip install torch torchvision transformers
pip install opencv-python einops peft medpy
pip install "numpy<2.0"  # Important for compatibility
```

### Download Model

```python
from huggingface_hub import snapshot_download

model_path = snapshot_download(
    repo_id='jeffrey423/ToothXpert',
    local_dir='./ToothXpert_pretrained'
)
```

### Simple Inference

```python
import cv2
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, CLIPImageProcessor
from model.ToothXpert_MOE import ToothXpertForCausalLMMOE
from model.llava import conversation as conversation_lib
from model.llava.mm_utils import tokenizer_image_token
from model.segment_anything.utils.transforms import ResizeLongestSide
from utils.utils import (DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN,
                         DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX)

# Preprocessing function
def preprocess(x, pixel_mean=torch.Tensor([123.675, 116.28, 103.53]).view(-1, 1, 1),
               pixel_std=torch.Tensor([58.395, 57.12, 57.375]).view(-1, 1, 1), img_size=1024):
    x = (x - pixel_mean) / pixel_std
    h, w = x.shape[-2:]
    padh = img_size - h
    padw = img_size - w
    x = F.pad(x, (0, padw, 0, padh))
    return x

# Load model
model_path = "./ToothXpert_pretrained"
device = "cuda:0"

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    model_max_length=512,
    padding_side="right",
    use_fast=False,
)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.add_tokens("[SEG]")
seg_token_idx = tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)

moe_lora_args = {
    "lora_r": 8,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "lora_target_modules": "q_proj,v_proj",
    "moe_lora": False,
    "expert_num": 3,
    "guide": True,
    "guide_mode": "smmulsm",
    "vocab_size": len(tokenizer),
}

model = ToothXpertForCausalLMMOE.from_pretrained(
    model_path,
    low_cpu_mem_usage=True,
    vision_tower="openai/clip-vit-large-patch14",
    seg_token_idx=seg_token_idx,
    torch_dtype=torch.bfloat16,
    train_mask_decoder=True,
    out_dim=256,
    moe_lora_args=moe_lora_args,
)

model.config.eos_token_id = tokenizer.eos_token_id
model.config.bos_token_id = tokenizer.bos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

model.get_model().initialize_vision_modules(model.get_model().config)
vision_tower = model.get_model().get_vision_tower()
vision_tower.to(dtype=torch.bfloat16, device=device)

model = model.bfloat16().to(device)
model.eval()

# Load and process image
image_path = "your_dental_xray.png"
image_np = cv2.imread(image_path)
image_np = cv2.cvtColor(image_np, cv2.COLOR_BGR2RGB)
original_size_list = [image_np.shape[:2]]

clip_image_processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14")
transform = ResizeLongestSide(1024)

image_clip = (
    clip_image_processor.preprocess(image_np, return_tensors="pt")["pixel_values"][0]
    .unsqueeze(0).to(device).bfloat16()
)

image = transform.apply_image(image_np)
resize_list = [image.shape[:2]]
image = (
    preprocess(torch.from_numpy(image).permute(2, 0, 1).contiguous())
    .unsqueeze(0).to(device).bfloat16()
)

# Prepare prompt
question = "Can you describe the image for me?"
conv = conversation_lib.conv_templates["llava_v1"].copy()
conv.messages = []
prompt = DEFAULT_IMAGE_TOKEN + "\n" + question
prompt = prompt.replace(DEFAULT_IMAGE_TOKEN,
                       DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN)

conv.append_message(conv.roles[0], prompt)
conv.append_message(conv.roles[1], "")
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, return_tensors="pt")
input_ids = input_ids.unsqueeze(0).to(device)

# Run inference
with torch.no_grad():
    output_ids, pred_masks = model.evaluate(
        image_clip,
        image,
        input_ids,
        resize_list,
        original_size_list,
        max_new_tokens=512,
        tokenizer=tokenizer,
    )

output_ids = output_ids[0][output_ids[0] != IMAGE_TOKEN_INDEX]
text_output = tokenizer.decode(output_ids, skip_special_tokens=False)
text_output = text_output.split('ASSISTANT:')[-1].replace('</s>', '').strip()

print(f"Question: {question}")
print(f"Answer: {text_output}")
```

## Example Questions

**General Description:**
- "Can you describe the image for me?"

**Specific Conditions:**
- "Is there any amalgam restorations in the image?"
- "Any R/L suggestive of caries present?"
- "Is there any dental implant present?"
- "Is there any root canal treated teeth?"

## Supported Conditions

ToothXpert can detect 11 dental conditions:
1. Amalgam restorations
2. Caries (R/L)
3. Crestal bone loss (mandible)
4. Crestal bone loss (maxillary)
5. Implant-supported bridge
6. Dental implant
7. Metallic/non-metallic post
8. Non-metallic restorations
9. Periapical radiolucency
10. Root canal treated teeth
11. Tooth-supported bridge

## Requirements

- **GPU**: NVIDIA GPU with at least 16GB VRAM
- **Python**: 3.11 (recommended)
- **CUDA**: 12.1 or compatible

## Model Details

- **Base Model**: LLaVA-1.5-7B
- **Vision Encoder**: CLIP ViT-L/14
- **Segmentation**: SAM (Segment Anything Model) ViT-H
- **Adaptation**: Guided Mixture of LoRA Experts (G-MoLE)
- **Model Size**: ~15GB

## Citation

If you use ToothXpert in your research, please cite:

```bibtex
@article{liu2026toothxpert,
  title={Developing and Evaluating Multimodal Large Language Model for Orthopantomography Analysis to Support Clinical Dentistry},
  author={Liu, Xinyu and Hung, Kuo Feng and Yu, Weihao and Ng, Ray Anthony W T and Li, Wuyang and Niu, Tianye and Chen, Hui and Yuan, Yixuan},
  journal={Cell Reports Medicine},
  year={2026}
}
```

## Links

- **GitHub Repository**: [CUHK-AIM-Group/ToothXpert](https://github.com/CUHK-AIM-Group/ToothXpert)
- **Dataset**: [jeffrey423/ToothXpert.MM-OPG-Annotations](https://huggingface.co/datasets/jeffrey423/ToothXpert.MM-OPG-Annotations)

## License

Apache License 2.0