license: mit
language:
- en
tags:
- medical
- radiology
- bone-tumor
- vision-language-model
- internvl
- fine-tuned
- classification
- report-generation
datasets:
- btxrd
pipeline_tag: image-text-to-text
base_model: OpenGVLab/InternVL3_5-8B
metrics:
- accuracy
- f1
- rouge
BoneVision-8B — Bone Tumor X-ray Classifier
Fine-tuned vision-language model for automatic classification and structured report generation from bone X-ray images. Adapted from InternVL3.5-8B on the BTXRD dataset using LoRA.
Model description
This model takes a bone X-ray image and optional clinical metadata (patient age, sex, anatomical location) and produces a structured radiology report with two fields:
- Diagnosis — one of 7 bone tumor classes
- Findings — a narrative description of the radiographic findings that support the diagnosis
Architecture
| Component | Details |
|---|---|
| Vision encoder | InternViT (ViT, 24 layers, hidden=1024, patch=14×14, img=448×448) |
| Projection | MLP 2-layer with GELU (mlp1) |
| Language model | Qwen3-8B-Instruct |
| Total parameters | ~8B |
| Fine-tuning method | LoRA (r=32, α=64) on all attention + FFN projections |
| Trainable params | |
| Precision | bfloat16 |
Classes
The model classifies among 7 clinically well-characterised bone tumor types:
| Class | Type |
|---|---|
| Osteochondroma | Benign |
| Multiple osteochondromas | Benign |
| Giant cell tumor | Benign |
| Synovial osteochondroma | Benign |
| Osteofibroma | Benign |
| Simple bone cyst | Benign |
| Osteosarcoma | Malignant |
Training data
BTXRD (Bone Tumor X-ray Radiograph Dataset, Yao et al. 2025) — 3,746 bone X-ray images from 3 medical centres, covering 9 tumor subtypes confirmed by histopathology. Two heterogeneous catch-all classes were removed, leaving 2,009 samples across 7 clean classes. Training used 1,662 samples (with minority-class oversampling to a minimum of 150 samples per class), 172 for validation, and 175 for final evaluation.
Textual references (Findings field) were generated synthetically using GPT-4o Batch API conditioned on the confirmed diagnosis and clinical metadata. No real radiologist reports are available in BTXRD.
Performance
Evaluated on the held-out test set (n=175, stratified).
Classification metrics
| Model | Classes | n | Accuracy | F1-macro |
|---|---|---|---|---|
| Base (zero-shot) | 7 | 175 | 35.43 % | 0.354 |
| BoneVision-8B (this model) | 7 | 175 | 74.86 % | 0.720 |
Per-class F1
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Osteochondroma | 0.87 | 0.96 | 0.91 |
| Multiple osteochondromas | 0.87 | 0.96 | 0.91 |
| Osteosarcoma | 0.79 | 0.84 | 0.82 |
| Giant cell tumor | 0.78 | 0.70 | 0.74 |
| Synovial osteochondroma | 0.86 | 0.67 | 0.75 |
| Osteofibroma | 0.67 | 0.80 | 0.73 |
| Simple bone cyst | 0.50 | 0.45 | 0.47 |
Text quality (ROUGE)
Computed against GPT-4o synthetic reports as reference.
| ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|
| 0.771 | 0.645 | 0.705 |
How to use
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import torchvision.transforms as T
model_path = "javierespantaleon/BoneVision-8B"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
).eval()
# Load and preprocess image
image = Image.open("xray.jpg").convert("RGB")
transform = T.Compose([
T.Resize((448, 448)),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
pixel_values = transform(image).unsqueeze(0).to(torch.bfloat16).cuda()
# Build prompt
prompt = (
"<image>\n"
"Provide a detailed radiological report and locate the lesion.\n"
"Context: Patient: 45/M, Location: Femur.\n"
"Your response must follow this exact format:\n"
" Diagnosis: <one of: giant cell tumor | multiple osteochondromas | "
" osteochondroma | synovial osteochondroma | osteofibroma | "
" osteosarcoma | simple bone cyst>\n"
" Findings: <detailed radiological description>"
)
generation_config = dict(max_new_tokens=256, do_sample=False)
response = model.chat(tokenizer, pixel_values, prompt, generation_config)
print(response)
Example output:
Diagnosis: osteosarcoma
Findings: The imaging reveals a malignant osteosarcoma characterised by an
aggressive bone lesion with cortical destruction and soft tissue extension.
The tumor exhibits a mixed lytic and sclerotic pattern, with associated
periosteal reaction and possible Codman triangle.
Training details
| Hyperparameter | Value |
|---|---|
| LoRA rank (r) | 32 |
| LoRA alpha (α) | 64 |
| LoRA target modules | q/k/v/o/gate/up/down_proj |
| Learning rate (LLM) | 2×10⁻⁴ |
| Learning rate (MLP) | 2×10⁻⁵ |
| Batch size | 8 |
| Epochs | 5 (early stopping) |
| Optimizer | AdamW |
| Hardware | NVIDIA A100 40GB (Google Colab) |
| Training time | ~4 h |
Limitations
- Synthetic references: ROUGE metrics use GPT-4o generated reports as ground truth, not real radiologist annotations. Text quality is validated against a proxy, not a clinical gold standard.
- Dataset distribution: BTXRD is heavily skewed toward osteochondroma (≈43% of test samples). Performance on rare classes (osteofibroma, synovial osteochondroma) should be interpreted cautiously.
- Not for clinical use: This model is a research prototype and has not been validated for clinical decision support.
- Language: The model generates findings in English regardless of prompt language.
Citation
If you use this model, please cite:
@article{wang2025internvl3_5,
title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency},
author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
journal={arXiv preprint arXiv:2508.18265},
year={2025}
}
License
MIT — same as the base InternVL3 model.