File size: 6,403 Bytes

50d2632

---
license: mit
language:
- en
tags:
- medical
- radiology
- bone-tumor
- vision-language-model
- internvl
- fine-tuned
- classification
- report-generation
datasets:
- btxrd
pipeline_tag: image-text-to-text
base_model: OpenGVLab/InternVL3_5-8B
metrics:
- accuracy
- f1
- rouge
---

# BoneVision-8B — Bone Tumor X-ray Classifier

Fine-tuned vision-language model for automatic classification and structured
report generation from bone X-ray images. Adapted from InternVL3.5-8B on the
[BTXRD](https://www.nature.com/articles/s41597-024-04311-y) dataset using LoRA.

## Model description

This model takes a bone X-ray image and optional clinical metadata (patient
age, sex, anatomical location) and produces a structured radiology report
with two fields:

- **Diagnosis** — one of 7 bone tumor classes
- **Findings** — a narrative description of the radiographic findings that
  support the diagnosis

### Architecture

| Component | Details |
|-----------|---------|
| Vision encoder | InternViT (ViT, 24 layers, hidden=1024, patch=14×14, img=448×448) |
| Projection | MLP 2-layer with GELU (mlp1) |
| Language model | Qwen3-8B-Instruct |
| Total parameters | ~8B |
| Fine-tuning method | LoRA (r=32, α=64) on all attention + FFN projections |
| Trainable params | ~83M (~1% of total) |
| Precision | bfloat16 |

### Classes

The model classifies among 7 clinically well-characterised bone tumor types:

| Class | Type |
|-------|------|
| Osteochondroma | Benign |
| Multiple osteochondromas | Benign |
| Giant cell tumor | Benign |
| Synovial osteochondroma | Benign |
| Osteofibroma | Benign |
| Simple bone cyst | Benign |
| Osteosarcoma | Malignant |

## Training data

**BTXRD** (*Bone Tumor X-ray Radiograph Dataset*, Yao et al. 2025) — 3,746
bone X-ray images from 3 medical centres, covering 9 tumor subtypes confirmed
by histopathology. Two heterogeneous catch-all classes were removed, leaving
2,009 samples across 7 clean classes. Training used 1,662 samples (with
minority-class oversampling to a minimum of 150 samples per class), 172
for validation, and 175 for final evaluation.

Textual references (Findings field) were generated synthetically using
GPT-4o Batch API conditioned on the confirmed diagnosis and clinical
metadata. No real radiologist reports are available in BTXRD.

## Performance

Evaluated on the held-out test set (n=175, stratified).

### Classification metrics

| Model | Classes | n | Accuracy | F1-macro |
|-------|---------|---|----------|----------|
| Base (zero-shot) | 7 | 175 | 35.43 % | 0.354 |
| **BoneVision-8B (this model)** | **7** | **175** | **74.86 %** | **0.720** |

### Per-class F1

| Class | Precision | Recall | F1 |
|-------|-----------|--------|----|
| Osteochondroma | 0.87 | 0.96 | 0.91 |
| Multiple osteochondromas | 0.87 | 0.96 | 0.91 |
| Osteosarcoma | 0.79 | 0.84 | 0.82 |
| Giant cell tumor | 0.78 | 0.70 | 0.74 |
| Synovial osteochondroma | 0.86 | 0.67 | 0.75 |
| Osteofibroma | 0.67 | 0.80 | 0.73 |
| Simple bone cyst | 0.50 | 0.45 | 0.47 |

### Text quality (ROUGE)

Computed against GPT-4o synthetic reports as reference.

| ROUGE-1 | ROUGE-2 | ROUGE-L |
|---------|---------|---------|
| 0.771 | 0.645 | 0.705 |

## How to use

```python
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import torchvision.transforms as T

model_path = "javierespantaleon/BoneVision-8B"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()

# Load and preprocess image
image = Image.open("xray.jpg").convert("RGB")
transform = T.Compose([
    T.Resize((448, 448)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
pixel_values = transform(image).unsqueeze(0).to(torch.bfloat16).cuda()

# Build prompt
prompt = (
    "<image>\n"
    "Provide a detailed radiological report and locate the lesion.\n"
    "Context: Patient: 45/M, Location: Femur.\n"
    "Your response must follow this exact format:\n"
    "    Diagnosis: <one of: giant cell tumor | multiple osteochondromas | "
    "                osteochondroma | synovial osteochondroma | osteofibroma | "
    "                osteosarcoma | simple bone cyst>\n"
    "    Findings: <detailed radiological description>"
)

generation_config = dict(max_new_tokens=256, do_sample=False)
response = model.chat(tokenizer, pixel_values, prompt, generation_config)
print(response)
```

**Example output:**
```
Diagnosis: osteosarcoma
Findings: The imaging reveals a malignant osteosarcoma characterised by an
aggressive bone lesion with cortical destruction and soft tissue extension.
The tumor exhibits a mixed lytic and sclerotic pattern, with associated
periosteal reaction and possible Codman triangle.
```

## Training details

| Hyperparameter | Value |
|----------------|-------|
| LoRA rank (r) | 32 |
| LoRA alpha (α) | 64 |
| LoRA target modules | q/k/v/o/gate/up/down_proj |
| Learning rate (LLM) | 2×10⁻⁴ |
| Learning rate (MLP) | 2×10⁻⁵ |
| Batch size | 8 |
| Epochs | 5 (early stopping) |
| Optimizer | AdamW |
| Hardware | NVIDIA A100 40GB (Google Colab) |
| Training time | ~4 h |

## Limitations

- **Synthetic references**: ROUGE metrics use GPT-4o generated reports as
  ground truth, not real radiologist annotations. Text quality is validated
  against a proxy, not a clinical gold standard.
- **Dataset distribution**: BTXRD is heavily skewed toward osteochondroma
  (≈43% of test samples). Performance on rare classes (osteofibroma,
  synovial osteochondroma) should be interpreted cautiously.
- **Not for clinical use**: This model is a research prototype and has not
  been validated for clinical decision support.
- **Language**: The model generates findings in English regardless of
  prompt language.

## Citation

If you use this model, please cite:

```bibtex
@article{wang2025internvl3_5,
  title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency},
  author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
  journal={arXiv preprint arXiv:2508.18265},
  year={2025}
}
```

## License

MIT — same as the base InternVL3 model.