BoneVision-8B — Bone Tumor X-ray Classifier

Fine-tuned vision-language model for automatic classification and structured report generation from bone X-ray images. Adapted from InternVL3.5-8B on the BTXRD dataset using LoRA.

Model description

This model takes a bone X-ray image and optional clinical metadata (patient age, sex, anatomical location) and produces a structured radiology report with two fields:

  • Diagnosis — one of 7 bone tumor classes
  • Findings — a narrative description of the radiographic findings that support the diagnosis

Architecture

Component Details
Vision encoder InternViT (ViT, 24 layers, hidden=1024, patch=14×14, img=448×448)
Projection MLP 2-layer with GELU (mlp1)
Language model Qwen3-8B-Instruct
Total parameters ~8B
Fine-tuning method LoRA (r=32, α=64) on all attention + FFN projections
Trainable params 83M (1% of total)
Precision bfloat16

Classes

The model classifies among 7 clinically well-characterised bone tumor types:

Class Type
Osteochondroma Benign
Multiple osteochondromas Benign
Giant cell tumor Benign
Synovial osteochondroma Benign
Osteofibroma Benign
Simple bone cyst Benign
Osteosarcoma Malignant

Training data

BTXRD (Bone Tumor X-ray Radiograph Dataset, Yao et al. 2025) — 3,746 bone X-ray images from 3 medical centres, covering 9 tumor subtypes confirmed by histopathology. Two heterogeneous catch-all classes were removed, leaving 2,009 samples across 7 clean classes. Training used 1,662 samples (with minority-class oversampling to a minimum of 150 samples per class), 172 for validation, and 175 for final evaluation.

Textual references (Findings field) were generated synthetically using GPT-4o Batch API conditioned on the confirmed diagnosis and clinical metadata. No real radiologist reports are available in BTXRD.

Performance

Evaluated on the held-out test set (n=175, stratified).

Classification metrics

Model Classes n Accuracy F1-macro
Base (zero-shot) 7 175 35.43 % 0.354
BoneVision-8B (this model) 7 175 74.86 % 0.720

Per-class F1

Class Precision Recall F1
Osteochondroma 0.87 0.96 0.91
Multiple osteochondromas 0.87 0.96 0.91
Osteosarcoma 0.79 0.84 0.82
Giant cell tumor 0.78 0.70 0.74
Synovial osteochondroma 0.86 0.67 0.75
Osteofibroma 0.67 0.80 0.73
Simple bone cyst 0.50 0.45 0.47

Text quality (ROUGE)

Computed against GPT-4o synthetic reports as reference.

ROUGE-1 ROUGE-2 ROUGE-L
0.771 0.645 0.705

How to use

import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import torchvision.transforms as T

model_path = "javierespantaleon/BoneVision-8B"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()

# Load and preprocess image
image = Image.open("xray.jpg").convert("RGB")
transform = T.Compose([
    T.Resize((448, 448)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
pixel_values = transform(image).unsqueeze(0).to(torch.bfloat16).cuda()

# Build prompt
prompt = (
    "<image>\n"
    "Provide a detailed radiological report and locate the lesion.\n"
    "Context: Patient: 45/M, Location: Femur.\n"
    "Your response must follow this exact format:\n"
    "    Diagnosis: <one of: giant cell tumor | multiple osteochondromas | "
    "                osteochondroma | synovial osteochondroma | osteofibroma | "
    "                osteosarcoma | simple bone cyst>\n"
    "    Findings: <detailed radiological description>"
)

generation_config = dict(max_new_tokens=256, do_sample=False)
response = model.chat(tokenizer, pixel_values, prompt, generation_config)
print(response)

Example output:

Diagnosis: osteosarcoma
Findings: The imaging reveals a malignant osteosarcoma characterised by an
aggressive bone lesion with cortical destruction and soft tissue extension.
The tumor exhibits a mixed lytic and sclerotic pattern, with associated
periosteal reaction and possible Codman triangle.

Training details

Hyperparameter Value
LoRA rank (r) 32
LoRA alpha (α) 64
LoRA target modules q/k/v/o/gate/up/down_proj
Learning rate (LLM) 2×10⁻⁴
Learning rate (MLP) 2×10⁻⁵
Batch size 8
Epochs 5 (early stopping)
Optimizer AdamW
Hardware NVIDIA A100 40GB (Google Colab)
Training time ~4 h

Limitations

  • Synthetic references: ROUGE metrics use GPT-4o generated reports as ground truth, not real radiologist annotations. Text quality is validated against a proxy, not a clinical gold standard.
  • Dataset distribution: BTXRD is heavily skewed toward osteochondroma (≈43% of test samples). Performance on rare classes (osteofibroma, synovial osteochondroma) should be interpreted cautiously.
  • Not for clinical use: This model is a research prototype and has not been validated for clinical decision support.
  • Language: The model generates findings in English regardless of prompt language.

Citation

If you use this model, please cite:

@article{wang2025internvl3_5,
  title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency},
  author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
  journal={arXiv preprint arXiv:2508.18265},
  year={2025}
}

License

MIT — same as the base InternVL3 model.

Downloads last month
56
Safetensors
Model size
9B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for javierespdev/BoneVision-8B

Paper for javierespdev/BoneVision-8B