File size: 6,403 Bytes
50d2632
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: mit
language:
- en
tags:
- medical
- radiology
- bone-tumor
- vision-language-model
- internvl
- fine-tuned
- classification
- report-generation
datasets:
- btxrd
pipeline_tag: image-text-to-text
base_model: OpenGVLab/InternVL3_5-8B
metrics:
- accuracy
- f1
- rouge
---

# BoneVision-8B — Bone Tumor X-ray Classifier

Fine-tuned vision-language model for automatic classification and structured
report generation from bone X-ray images. Adapted from InternVL3.5-8B on the
[BTXRD](https://www.nature.com/articles/s41597-024-04311-y) dataset using LoRA.

## Model description

This model takes a bone X-ray image and optional clinical metadata (patient
age, sex, anatomical location) and produces a structured radiology report
with two fields:

- **Diagnosis** — one of 7 bone tumor classes
- **Findings** — a narrative description of the radiographic findings that
  support the diagnosis

### Architecture

| Component | Details |
|-----------|---------|
| Vision encoder | InternViT (ViT, 24 layers, hidden=1024, patch=14×14, img=448×448) |
| Projection | MLP 2-layer with GELU (mlp1) |
| Language model | Qwen3-8B-Instruct |
| Total parameters | ~8B |
| Fine-tuning method | LoRA (r=32, α=64) on all attention + FFN projections |
| Trainable params | ~83M (~1% of total) |
| Precision | bfloat16 |

### Classes

The model classifies among 7 clinically well-characterised bone tumor types:

| Class | Type |
|-------|------|
| Osteochondroma | Benign |
| Multiple osteochondromas | Benign |
| Giant cell tumor | Benign |
| Synovial osteochondroma | Benign |
| Osteofibroma | Benign |
| Simple bone cyst | Benign |
| Osteosarcoma | Malignant |

## Training data

**BTXRD** (*Bone Tumor X-ray Radiograph Dataset*, Yao et al. 2025) — 3,746
bone X-ray images from 3 medical centres, covering 9 tumor subtypes confirmed
by histopathology. Two heterogeneous catch-all classes were removed, leaving
2,009 samples across 7 clean classes. Training used 1,662 samples (with
minority-class oversampling to a minimum of 150 samples per class), 172
for validation, and 175 for final evaluation.

Textual references (Findings field) were generated synthetically using
GPT-4o Batch API conditioned on the confirmed diagnosis and clinical
metadata. No real radiologist reports are available in BTXRD.

## Performance

Evaluated on the held-out test set (n=175, stratified).

### Classification metrics

| Model | Classes | n | Accuracy | F1-macro |
|-------|---------|---|----------|----------|
| Base (zero-shot) | 7 | 175 | 35.43 % | 0.354 |
| **BoneVision-8B (this model)** | **7** | **175** | **74.86 %** | **0.720** |

### Per-class F1

| Class | Precision | Recall | F1 |
|-------|-----------|--------|----|
| Osteochondroma | 0.87 | 0.96 | 0.91 |
| Multiple osteochondromas | 0.87 | 0.96 | 0.91 |
| Osteosarcoma | 0.79 | 0.84 | 0.82 |
| Giant cell tumor | 0.78 | 0.70 | 0.74 |
| Synovial osteochondroma | 0.86 | 0.67 | 0.75 |
| Osteofibroma | 0.67 | 0.80 | 0.73 |
| Simple bone cyst | 0.50 | 0.45 | 0.47 |

### Text quality (ROUGE)

Computed against GPT-4o synthetic reports as reference.

| ROUGE-1 | ROUGE-2 | ROUGE-L |
|---------|---------|---------|
| 0.771 | 0.645 | 0.705 |

## How to use

```python
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import torchvision.transforms as T

model_path = "javierespantaleon/BoneVision-8B"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()

# Load and preprocess image
image = Image.open("xray.jpg").convert("RGB")
transform = T.Compose([
    T.Resize((448, 448)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
pixel_values = transform(image).unsqueeze(0).to(torch.bfloat16).cuda()

# Build prompt
prompt = (
    "<image>\n"
    "Provide a detailed radiological report and locate the lesion.\n"
    "Context: Patient: 45/M, Location: Femur.\n"
    "Your response must follow this exact format:\n"
    "    Diagnosis: <one of: giant cell tumor | multiple osteochondromas | "
    "                osteochondroma | synovial osteochondroma | osteofibroma | "
    "                osteosarcoma | simple bone cyst>\n"
    "    Findings: <detailed radiological description>"
)

generation_config = dict(max_new_tokens=256, do_sample=False)
response = model.chat(tokenizer, pixel_values, prompt, generation_config)
print(response)
```

**Example output:**
```
Diagnosis: osteosarcoma
Findings: The imaging reveals a malignant osteosarcoma characterised by an
aggressive bone lesion with cortical destruction and soft tissue extension.
The tumor exhibits a mixed lytic and sclerotic pattern, with associated
periosteal reaction and possible Codman triangle.
```

## Training details

| Hyperparameter | Value |
|----------------|-------|
| LoRA rank (r) | 32 |
| LoRA alpha (α) | 64 |
| LoRA target modules | q/k/v/o/gate/up/down_proj |
| Learning rate (LLM) | 2×10⁻⁴ |
| Learning rate (MLP) | 2×10⁻⁵ |
| Batch size | 8 |
| Epochs | 5 (early stopping) |
| Optimizer | AdamW |
| Hardware | NVIDIA A100 40GB (Google Colab) |
| Training time | ~4 h |

## Limitations

- **Synthetic references**: ROUGE metrics use GPT-4o generated reports as
  ground truth, not real radiologist annotations. Text quality is validated
  against a proxy, not a clinical gold standard.
- **Dataset distribution**: BTXRD is heavily skewed toward osteochondroma
  (≈43% of test samples). Performance on rare classes (osteofibroma,
  synovial osteochondroma) should be interpreted cautiously.
- **Not for clinical use**: This model is a research prototype and has not
  been validated for clinical decision support.
- **Language**: The model generates findings in English regardless of
  prompt language.

## Citation

If you use this model, please cite:

```bibtex
@article{wang2025internvl3_5,
  title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency},
  author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
  journal={arXiv preprint arXiv:2508.18265},
  year={2025}
}
```

## License

MIT — same as the base InternVL3 model.