File size: 6,403 Bytes
50d2632 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | ---
license: mit
language:
- en
tags:
- medical
- radiology
- bone-tumor
- vision-language-model
- internvl
- fine-tuned
- classification
- report-generation
datasets:
- btxrd
pipeline_tag: image-text-to-text
base_model: OpenGVLab/InternVL3_5-8B
metrics:
- accuracy
- f1
- rouge
---
# BoneVision-8B — Bone Tumor X-ray Classifier
Fine-tuned vision-language model for automatic classification and structured
report generation from bone X-ray images. Adapted from InternVL3.5-8B on the
[BTXRD](https://www.nature.com/articles/s41597-024-04311-y) dataset using LoRA.
## Model description
This model takes a bone X-ray image and optional clinical metadata (patient
age, sex, anatomical location) and produces a structured radiology report
with two fields:
- **Diagnosis** — one of 7 bone tumor classes
- **Findings** — a narrative description of the radiographic findings that
support the diagnosis
### Architecture
| Component | Details |
|-----------|---------|
| Vision encoder | InternViT (ViT, 24 layers, hidden=1024, patch=14×14, img=448×448) |
| Projection | MLP 2-layer with GELU (mlp1) |
| Language model | Qwen3-8B-Instruct |
| Total parameters | ~8B |
| Fine-tuning method | LoRA (r=32, α=64) on all attention + FFN projections |
| Trainable params | ~83M (~1% of total) |
| Precision | bfloat16 |
### Classes
The model classifies among 7 clinically well-characterised bone tumor types:
| Class | Type |
|-------|------|
| Osteochondroma | Benign |
| Multiple osteochondromas | Benign |
| Giant cell tumor | Benign |
| Synovial osteochondroma | Benign |
| Osteofibroma | Benign |
| Simple bone cyst | Benign |
| Osteosarcoma | Malignant |
## Training data
**BTXRD** (*Bone Tumor X-ray Radiograph Dataset*, Yao et al. 2025) — 3,746
bone X-ray images from 3 medical centres, covering 9 tumor subtypes confirmed
by histopathology. Two heterogeneous catch-all classes were removed, leaving
2,009 samples across 7 clean classes. Training used 1,662 samples (with
minority-class oversampling to a minimum of 150 samples per class), 172
for validation, and 175 for final evaluation.
Textual references (Findings field) were generated synthetically using
GPT-4o Batch API conditioned on the confirmed diagnosis and clinical
metadata. No real radiologist reports are available in BTXRD.
## Performance
Evaluated on the held-out test set (n=175, stratified).
### Classification metrics
| Model | Classes | n | Accuracy | F1-macro |
|-------|---------|---|----------|----------|
| Base (zero-shot) | 7 | 175 | 35.43 % | 0.354 |
| **BoneVision-8B (this model)** | **7** | **175** | **74.86 %** | **0.720** |
### Per-class F1
| Class | Precision | Recall | F1 |
|-------|-----------|--------|----|
| Osteochondroma | 0.87 | 0.96 | 0.91 |
| Multiple osteochondromas | 0.87 | 0.96 | 0.91 |
| Osteosarcoma | 0.79 | 0.84 | 0.82 |
| Giant cell tumor | 0.78 | 0.70 | 0.74 |
| Synovial osteochondroma | 0.86 | 0.67 | 0.75 |
| Osteofibroma | 0.67 | 0.80 | 0.73 |
| Simple bone cyst | 0.50 | 0.45 | 0.47 |
### Text quality (ROUGE)
Computed against GPT-4o synthetic reports as reference.
| ROUGE-1 | ROUGE-2 | ROUGE-L |
|---------|---------|---------|
| 0.771 | 0.645 | 0.705 |
## How to use
```python
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import torchvision.transforms as T
model_path = "javierespantaleon/BoneVision-8B"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
).eval()
# Load and preprocess image
image = Image.open("xray.jpg").convert("RGB")
transform = T.Compose([
T.Resize((448, 448)),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
pixel_values = transform(image).unsqueeze(0).to(torch.bfloat16).cuda()
# Build prompt
prompt = (
"<image>\n"
"Provide a detailed radiological report and locate the lesion.\n"
"Context: Patient: 45/M, Location: Femur.\n"
"Your response must follow this exact format:\n"
" Diagnosis: <one of: giant cell tumor | multiple osteochondromas | "
" osteochondroma | synovial osteochondroma | osteofibroma | "
" osteosarcoma | simple bone cyst>\n"
" Findings: <detailed radiological description>"
)
generation_config = dict(max_new_tokens=256, do_sample=False)
response = model.chat(tokenizer, pixel_values, prompt, generation_config)
print(response)
```
**Example output:**
```
Diagnosis: osteosarcoma
Findings: The imaging reveals a malignant osteosarcoma characterised by an
aggressive bone lesion with cortical destruction and soft tissue extension.
The tumor exhibits a mixed lytic and sclerotic pattern, with associated
periosteal reaction and possible Codman triangle.
```
## Training details
| Hyperparameter | Value |
|----------------|-------|
| LoRA rank (r) | 32 |
| LoRA alpha (α) | 64 |
| LoRA target modules | q/k/v/o/gate/up/down_proj |
| Learning rate (LLM) | 2×10⁻⁴ |
| Learning rate (MLP) | 2×10⁻⁵ |
| Batch size | 8 |
| Epochs | 5 (early stopping) |
| Optimizer | AdamW |
| Hardware | NVIDIA A100 40GB (Google Colab) |
| Training time | ~4 h |
## Limitations
- **Synthetic references**: ROUGE metrics use GPT-4o generated reports as
ground truth, not real radiologist annotations. Text quality is validated
against a proxy, not a clinical gold standard.
- **Dataset distribution**: BTXRD is heavily skewed toward osteochondroma
(≈43% of test samples). Performance on rare classes (osteofibroma,
synovial osteochondroma) should be interpreted cautiously.
- **Not for clinical use**: This model is a research prototype and has not
been validated for clinical decision support.
- **Language**: The model generates findings in English regardless of
prompt language.
## Citation
If you use this model, please cite:
```bibtex
@article{wang2025internvl3_5,
title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency},
author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
journal={arXiv preprint arXiv:2508.18265},
year={2025}
}
```
## License
MIT — same as the base InternVL3 model.
|