BoneVision-8B / README.md
javierespdev's picture
Upload folder using huggingface_hub
50d2632 verified
---
license: mit
language:
- en
tags:
- medical
- radiology
- bone-tumor
- vision-language-model
- internvl
- fine-tuned
- classification
- report-generation
datasets:
- btxrd
pipeline_tag: image-text-to-text
base_model: OpenGVLab/InternVL3_5-8B
metrics:
- accuracy
- f1
- rouge
---
# BoneVision-8B — Bone Tumor X-ray Classifier
Fine-tuned vision-language model for automatic classification and structured
report generation from bone X-ray images. Adapted from InternVL3.5-8B on the
[BTXRD](https://www.nature.com/articles/s41597-024-04311-y) dataset using LoRA.
## Model description
This model takes a bone X-ray image and optional clinical metadata (patient
age, sex, anatomical location) and produces a structured radiology report
with two fields:
- **Diagnosis** — one of 7 bone tumor classes
- **Findings** — a narrative description of the radiographic findings that
support the diagnosis
### Architecture
| Component | Details |
|-----------|---------|
| Vision encoder | InternViT (ViT, 24 layers, hidden=1024, patch=14×14, img=448×448) |
| Projection | MLP 2-layer with GELU (mlp1) |
| Language model | Qwen3-8B-Instruct |
| Total parameters | ~8B |
| Fine-tuning method | LoRA (r=32, α=64) on all attention + FFN projections |
| Trainable params | ~83M (~1% of total) |
| Precision | bfloat16 |
### Classes
The model classifies among 7 clinically well-characterised bone tumor types:
| Class | Type |
|-------|------|
| Osteochondroma | Benign |
| Multiple osteochondromas | Benign |
| Giant cell tumor | Benign |
| Synovial osteochondroma | Benign |
| Osteofibroma | Benign |
| Simple bone cyst | Benign |
| Osteosarcoma | Malignant |
## Training data
**BTXRD** (*Bone Tumor X-ray Radiograph Dataset*, Yao et al. 2025) — 3,746
bone X-ray images from 3 medical centres, covering 9 tumor subtypes confirmed
by histopathology. Two heterogeneous catch-all classes were removed, leaving
2,009 samples across 7 clean classes. Training used 1,662 samples (with
minority-class oversampling to a minimum of 150 samples per class), 172
for validation, and 175 for final evaluation.
Textual references (Findings field) were generated synthetically using
GPT-4o Batch API conditioned on the confirmed diagnosis and clinical
metadata. No real radiologist reports are available in BTXRD.
## Performance
Evaluated on the held-out test set (n=175, stratified).
### Classification metrics
| Model | Classes | n | Accuracy | F1-macro |
|-------|---------|---|----------|----------|
| Base (zero-shot) | 7 | 175 | 35.43 % | 0.354 |
| **BoneVision-8B (this model)** | **7** | **175** | **74.86 %** | **0.720** |
### Per-class F1
| Class | Precision | Recall | F1 |
|-------|-----------|--------|----|
| Osteochondroma | 0.87 | 0.96 | 0.91 |
| Multiple osteochondromas | 0.87 | 0.96 | 0.91 |
| Osteosarcoma | 0.79 | 0.84 | 0.82 |
| Giant cell tumor | 0.78 | 0.70 | 0.74 |
| Synovial osteochondroma | 0.86 | 0.67 | 0.75 |
| Osteofibroma | 0.67 | 0.80 | 0.73 |
| Simple bone cyst | 0.50 | 0.45 | 0.47 |
### Text quality (ROUGE)
Computed against GPT-4o synthetic reports as reference.
| ROUGE-1 | ROUGE-2 | ROUGE-L |
|---------|---------|---------|
| 0.771 | 0.645 | 0.705 |
## How to use
```python
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import torchvision.transforms as T
model_path = "javierespantaleon/BoneVision-8B"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
).eval()
# Load and preprocess image
image = Image.open("xray.jpg").convert("RGB")
transform = T.Compose([
T.Resize((448, 448)),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
pixel_values = transform(image).unsqueeze(0).to(torch.bfloat16).cuda()
# Build prompt
prompt = (
"<image>\n"
"Provide a detailed radiological report and locate the lesion.\n"
"Context: Patient: 45/M, Location: Femur.\n"
"Your response must follow this exact format:\n"
" Diagnosis: <one of: giant cell tumor | multiple osteochondromas | "
" osteochondroma | synovial osteochondroma | osteofibroma | "
" osteosarcoma | simple bone cyst>\n"
" Findings: <detailed radiological description>"
)
generation_config = dict(max_new_tokens=256, do_sample=False)
response = model.chat(tokenizer, pixel_values, prompt, generation_config)
print(response)
```
**Example output:**
```
Diagnosis: osteosarcoma
Findings: The imaging reveals a malignant osteosarcoma characterised by an
aggressive bone lesion with cortical destruction and soft tissue extension.
The tumor exhibits a mixed lytic and sclerotic pattern, with associated
periosteal reaction and possible Codman triangle.
```
## Training details
| Hyperparameter | Value |
|----------------|-------|
| LoRA rank (r) | 32 |
| LoRA alpha (α) | 64 |
| LoRA target modules | q/k/v/o/gate/up/down_proj |
| Learning rate (LLM) | 2×10⁻⁴ |
| Learning rate (MLP) | 2×10⁻⁵ |
| Batch size | 8 |
| Epochs | 5 (early stopping) |
| Optimizer | AdamW |
| Hardware | NVIDIA A100 40GB (Google Colab) |
| Training time | ~4 h |
## Limitations
- **Synthetic references**: ROUGE metrics use GPT-4o generated reports as
ground truth, not real radiologist annotations. Text quality is validated
against a proxy, not a clinical gold standard.
- **Dataset distribution**: BTXRD is heavily skewed toward osteochondroma
(≈43% of test samples). Performance on rare classes (osteofibroma,
synovial osteochondroma) should be interpreted cautiously.
- **Not for clinical use**: This model is a research prototype and has not
been validated for clinical decision support.
- **Language**: The model generates findings in English regardless of
prompt language.
## Citation
If you use this model, please cite:
```bibtex
@article{wang2025internvl3_5,
title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency},
author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
journal={arXiv preprint arXiv:2508.18265},
year={2025}
}
```
## License
MIT — same as the base InternVL3 model.