| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - medical |
| - radiology |
| - bone-tumor |
| - vision-language-model |
| - internvl |
| - fine-tuned |
| - classification |
| - report-generation |
| datasets: |
| - btxrd |
| pipeline_tag: image-text-to-text |
| base_model: OpenGVLab/InternVL3_5-8B |
| metrics: |
| - accuracy |
| - f1 |
| - rouge |
| --- |
| |
| # BoneVision-8B — Bone Tumor X-ray Classifier |
|
|
| Fine-tuned vision-language model for automatic classification and structured |
| report generation from bone X-ray images. Adapted from InternVL3.5-8B on the |
| [BTXRD](https://www.nature.com/articles/s41597-024-04311-y) dataset using LoRA. |
|
|
| ## Model description |
|
|
| This model takes a bone X-ray image and optional clinical metadata (patient |
| age, sex, anatomical location) and produces a structured radiology report |
| with two fields: |
|
|
| - **Diagnosis** — one of 7 bone tumor classes |
| - **Findings** — a narrative description of the radiographic findings that |
| support the diagnosis |
|
|
| ### Architecture |
|
|
| | Component | Details | |
| |-----------|---------| |
| | Vision encoder | InternViT (ViT, 24 layers, hidden=1024, patch=14×14, img=448×448) | |
| | Projection | MLP 2-layer with GELU (mlp1) | |
| | Language model | Qwen3-8B-Instruct | |
| | Total parameters | ~8B | |
| | Fine-tuning method | LoRA (r=32, α=64) on all attention + FFN projections | |
| | Trainable params | ~83M (~1% of total) | |
| | Precision | bfloat16 | |
|
|
| ### Classes |
|
|
| The model classifies among 7 clinically well-characterised bone tumor types: |
|
|
| | Class | Type | |
| |-------|------| |
| | Osteochondroma | Benign | |
| | Multiple osteochondromas | Benign | |
| | Giant cell tumor | Benign | |
| | Synovial osteochondroma | Benign | |
| | Osteofibroma | Benign | |
| | Simple bone cyst | Benign | |
| | Osteosarcoma | Malignant | |
|
|
| ## Training data |
|
|
| **BTXRD** (*Bone Tumor X-ray Radiograph Dataset*, Yao et al. 2025) — 3,746 |
| bone X-ray images from 3 medical centres, covering 9 tumor subtypes confirmed |
| by histopathology. Two heterogeneous catch-all classes were removed, leaving |
| 2,009 samples across 7 clean classes. Training used 1,662 samples (with |
| minority-class oversampling to a minimum of 150 samples per class), 172 |
| for validation, and 175 for final evaluation. |
|
|
| Textual references (Findings field) were generated synthetically using |
| GPT-4o Batch API conditioned on the confirmed diagnosis and clinical |
| metadata. No real radiologist reports are available in BTXRD. |
|
|
| ## Performance |
|
|
| Evaluated on the held-out test set (n=175, stratified). |
|
|
| ### Classification metrics |
|
|
| | Model | Classes | n | Accuracy | F1-macro | |
| |-------|---------|---|----------|----------| |
| | Base (zero-shot) | 7 | 175 | 35.43 % | 0.354 | |
| | **BoneVision-8B (this model)** | **7** | **175** | **74.86 %** | **0.720** | |
|
|
| ### Per-class F1 |
|
|
| | Class | Precision | Recall | F1 | |
| |-------|-----------|--------|----| |
| | Osteochondroma | 0.87 | 0.96 | 0.91 | |
| | Multiple osteochondromas | 0.87 | 0.96 | 0.91 | |
| | Osteosarcoma | 0.79 | 0.84 | 0.82 | |
| | Giant cell tumor | 0.78 | 0.70 | 0.74 | |
| | Synovial osteochondroma | 0.86 | 0.67 | 0.75 | |
| | Osteofibroma | 0.67 | 0.80 | 0.73 | |
| | Simple bone cyst | 0.50 | 0.45 | 0.47 | |
|
|
| ### Text quality (ROUGE) |
|
|
| Computed against GPT-4o synthetic reports as reference. |
|
|
| | ROUGE-1 | ROUGE-2 | ROUGE-L | |
| |---------|---------|---------| |
| | 0.771 | 0.645 | 0.705 | |
|
|
| ## How to use |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModel |
| from PIL import Image |
| import torchvision.transforms as T |
| |
| model_path = "javierespantaleon/BoneVision-8B" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
| model = AutoModel.from_pretrained( |
| model_path, |
| torch_dtype=torch.bfloat16, |
| trust_remote_code=True, |
| device_map="auto" |
| ).eval() |
| |
| # Load and preprocess image |
| image = Image.open("xray.jpg").convert("RGB") |
| transform = T.Compose([ |
| T.Resize((448, 448)), |
| T.ToTensor(), |
| T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) |
| ]) |
| pixel_values = transform(image).unsqueeze(0).to(torch.bfloat16).cuda() |
| |
| # Build prompt |
| prompt = ( |
| "<image>\n" |
| "Provide a detailed radiological report and locate the lesion.\n" |
| "Context: Patient: 45/M, Location: Femur.\n" |
| "Your response must follow this exact format:\n" |
| " Diagnosis: <one of: giant cell tumor | multiple osteochondromas | " |
| " osteochondroma | synovial osteochondroma | osteofibroma | " |
| " osteosarcoma | simple bone cyst>\n" |
| " Findings: <detailed radiological description>" |
| ) |
| |
| generation_config = dict(max_new_tokens=256, do_sample=False) |
| response = model.chat(tokenizer, pixel_values, prompt, generation_config) |
| print(response) |
| ``` |
|
|
| **Example output:** |
| ``` |
| Diagnosis: osteosarcoma |
| Findings: The imaging reveals a malignant osteosarcoma characterised by an |
| aggressive bone lesion with cortical destruction and soft tissue extension. |
| The tumor exhibits a mixed lytic and sclerotic pattern, with associated |
| periosteal reaction and possible Codman triangle. |
| ``` |
|
|
| ## Training details |
|
|
| | Hyperparameter | Value | |
| |----------------|-------| |
| | LoRA rank (r) | 32 | |
| | LoRA alpha (α) | 64 | |
| | LoRA target modules | q/k/v/o/gate/up/down_proj | |
| | Learning rate (LLM) | 2×10⁻⁴ | |
| | Learning rate (MLP) | 2×10⁻⁵ | |
| | Batch size | 8 | |
| | Epochs | 5 (early stopping) | |
| | Optimizer | AdamW | |
| | Hardware | NVIDIA A100 40GB (Google Colab) | |
| | Training time | ~4 h | |
| |
| ## Limitations |
| |
| - **Synthetic references**: ROUGE metrics use GPT-4o generated reports as |
| ground truth, not real radiologist annotations. Text quality is validated |
| against a proxy, not a clinical gold standard. |
| - **Dataset distribution**: BTXRD is heavily skewed toward osteochondroma |
| (≈43% of test samples). Performance on rare classes (osteofibroma, |
| synovial osteochondroma) should be interpreted cautiously. |
| - **Not for clinical use**: This model is a research prototype and has not |
| been validated for clinical decision support. |
| - **Language**: The model generates findings in English regardless of |
| prompt language. |
| |
| ## Citation |
| |
| If you use this model, please cite: |
| |
| ```bibtex |
| @article{wang2025internvl3_5, |
| title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency}, |
| author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others}, |
| journal={arXiv preprint arXiv:2508.18265}, |
| year={2025} |
| } |
| ``` |
| |
| ## License |
| |
| MIT — same as the base InternVL3 model. |
| |