BoneVision-8B — Bone Tumor X-ray Classifier

Fine-tuned vision-language model for automatic classification and structured report generation from bone X-ray images. Adapted from InternVL3.5-8B on the BTXRD dataset using LoRA.

Model description

This model takes a bone X-ray image and optional clinical metadata (patient age, sex, anatomical location) and produces a structured radiology report with two fields:

Diagnosis — one of 7 bone tumor classes
Findings — a narrative description of the radiographic findings that support the diagnosis

Architecture

Component	Details
Vision encoder	InternViT (ViT, 24 layers, hidden=1024, patch=14×14, img=448×448)
Projection	MLP 2-layer with GELU (mlp1)
Language model	Qwen3-8B-Instruct
Total parameters	~8B
Fine-tuning method	LoRA (r=32, α=64) on all attention + FFN projections
Trainable params	~~83M (~~1% of total)
Precision	bfloat16

Classes

The model classifies among 7 clinically well-characterised bone tumor types:

Class	Type
Osteochondroma	Benign
Multiple osteochondromas	Benign
Giant cell tumor	Benign
Synovial osteochondroma	Benign
Osteofibroma	Benign
Simple bone cyst	Benign
Osteosarcoma	Malignant

Training data

BTXRD (Bone Tumor X-ray Radiograph Dataset, Yao et al. 2025) — 3,746 bone X-ray images from 3 medical centres, covering 9 tumor subtypes confirmed by histopathology. Two heterogeneous catch-all classes were removed, leaving 2,009 samples across 7 clean classes. Training used 1,662 samples (with minority-class oversampling to a minimum of 150 samples per class), 172 for validation, and 175 for final evaluation.

Textual references (Findings field) were generated synthetically using GPT-4o Batch API conditioned on the confirmed diagnosis and clinical metadata. No real radiologist reports are available in BTXRD.

Performance

Evaluated on the held-out test set (n=175, stratified).

Classification metrics

Model	Classes	n	Accuracy	F1-macro
Base (zero-shot)	7	175	35.43 %	0.354
BoneVision-8B (this model)	7	175	74.86 %	0.720

Per-class F1

Class	Precision	Recall	F1
Osteochondroma	0.87	0.96	0.91
Multiple osteochondromas	0.87	0.96	0.91
Osteosarcoma	0.79	0.84	0.82
Giant cell tumor	0.78	0.70	0.74
Synovial osteochondroma	0.86	0.67	0.75
Osteofibroma	0.67	0.80	0.73
Simple bone cyst	0.50	0.45	0.47

Text quality (ROUGE)

Computed against GPT-4o synthetic reports as reference.

ROUGE-1	ROUGE-2	ROUGE-L
0.771	0.645	0.705

How to use

import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import torchvision.transforms as T

model_path = "javierespantaleon/BoneVision-8B"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()

# Load and preprocess image
image = Image.open("xray.jpg").convert("RGB")
transform = T.Compose([
    T.Resize((448, 448)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
pixel_values = transform(image).unsqueeze(0).to(torch.bfloat16).cuda()

# Build prompt
prompt = (
    "<image>\n"
    "Provide a detailed radiological report and locate the lesion.\n"
    "Context: Patient: 45/M, Location: Femur.\n"
    "Your response must follow this exact format:\n"
    "    Diagnosis: <one of: giant cell tumor | multiple osteochondromas | "
    "                osteochondroma | synovial osteochondroma | osteofibroma | "
    "                osteosarcoma | simple bone cyst>\n"
    "    Findings: <detailed radiological description>"
)

generation_config = dict(max_new_tokens=256, do_sample=False)
response = model.chat(tokenizer, pixel_values, prompt, generation_config)
print(response)

Example output:

Diagnosis: osteosarcoma
Findings: The imaging reveals a malignant osteosarcoma characterised by an
aggressive bone lesion with cortical destruction and soft tissue extension.
The tumor exhibits a mixed lytic and sclerotic pattern, with associated
periosteal reaction and possible Codman triangle.

Training details

Hyperparameter	Value
LoRA rank (r)	32
LoRA alpha (α)	64
LoRA target modules	q/k/v/o/gate/up/down_proj
Learning rate (LLM)	2×10⁻⁴
Learning rate (MLP)	2×10⁻⁵
Batch size	8
Epochs	5 (early stopping)
Optimizer	AdamW
Hardware	NVIDIA A100 40GB (Google Colab)
Training time	~4 h

Limitations

Synthetic references: ROUGE metrics use GPT-4o generated reports as ground truth, not real radiologist annotations. Text quality is validated against a proxy, not a clinical gold standard.
Dataset distribution: BTXRD is heavily skewed toward osteochondroma (≈43% of test samples). Performance on rare classes (osteofibroma, synovial osteochondroma) should be interpreted cautiously.
Not for clinical use: This model is a research prototype and has not been validated for clinical decision support.
Language: The model generates findings in English regardless of prompt language.

Citation

If you use this model, please cite:

@article{wang2025internvl3_5,
  title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency},
  author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
  journal={arXiv preprint arXiv:2508.18265},
  year={2025}
}