--- license: mit language: - en tags: - medical - radiology - bone-tumor - vision-language-model - internvl - fine-tuned - classification - report-generation datasets: - btxrd pipeline_tag: image-text-to-text base_model: OpenGVLab/InternVL3_5-8B metrics: - accuracy - f1 - rouge --- # BoneVision-8B — Bone Tumor X-ray Classifier Fine-tuned vision-language model for automatic classification and structured report generation from bone X-ray images. Adapted from InternVL3.5-8B on the [BTXRD](https://www.nature.com/articles/s41597-024-04311-y) dataset using LoRA. ## Model description This model takes a bone X-ray image and optional clinical metadata (patient age, sex, anatomical location) and produces a structured radiology report with two fields: - **Diagnosis** — one of 7 bone tumor classes - **Findings** — a narrative description of the radiographic findings that support the diagnosis ### Architecture | Component | Details | |-----------|---------| | Vision encoder | InternViT (ViT, 24 layers, hidden=1024, patch=14×14, img=448×448) | | Projection | MLP 2-layer with GELU (mlp1) | | Language model | Qwen3-8B-Instruct | | Total parameters | ~8B | | Fine-tuning method | LoRA (r=32, α=64) on all attention + FFN projections | | Trainable params | ~83M (~1% of total) | | Precision | bfloat16 | ### Classes The model classifies among 7 clinically well-characterised bone tumor types: | Class | Type | |-------|------| | Osteochondroma | Benign | | Multiple osteochondromas | Benign | | Giant cell tumor | Benign | | Synovial osteochondroma | Benign | | Osteofibroma | Benign | | Simple bone cyst | Benign | | Osteosarcoma | Malignant | ## Training data **BTXRD** (*Bone Tumor X-ray Radiograph Dataset*, Yao et al. 2025) — 3,746 bone X-ray images from 3 medical centres, covering 9 tumor subtypes confirmed by histopathology. Two heterogeneous catch-all classes were removed, leaving 2,009 samples across 7 clean classes. Training used 1,662 samples (with minority-class oversampling to a minimum of 150 samples per class), 172 for validation, and 175 for final evaluation. Textual references (Findings field) were generated synthetically using GPT-4o Batch API conditioned on the confirmed diagnosis and clinical metadata. No real radiologist reports are available in BTXRD. ## Performance Evaluated on the held-out test set (n=175, stratified). ### Classification metrics | Model | Classes | n | Accuracy | F1-macro | |-------|---------|---|----------|----------| | Base (zero-shot) | 7 | 175 | 35.43 % | 0.354 | | **BoneVision-8B (this model)** | **7** | **175** | **74.86 %** | **0.720** | ### Per-class F1 | Class | Precision | Recall | F1 | |-------|-----------|--------|----| | Osteochondroma | 0.87 | 0.96 | 0.91 | | Multiple osteochondromas | 0.87 | 0.96 | 0.91 | | Osteosarcoma | 0.79 | 0.84 | 0.82 | | Giant cell tumor | 0.78 | 0.70 | 0.74 | | Synovial osteochondroma | 0.86 | 0.67 | 0.75 | | Osteofibroma | 0.67 | 0.80 | 0.73 | | Simple bone cyst | 0.50 | 0.45 | 0.47 | ### Text quality (ROUGE) Computed against GPT-4o synthetic reports as reference. | ROUGE-1 | ROUGE-2 | ROUGE-L | |---------|---------|---------| | 0.771 | 0.645 | 0.705 | ## How to use ```python import torch from transformers import AutoTokenizer, AutoModel from PIL import Image import torchvision.transforms as T model_path = "javierespantaleon/BoneVision-8B" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModel.from_pretrained( model_path, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto" ).eval() # Load and preprocess image image = Image.open("xray.jpg").convert("RGB") transform = T.Compose([ T.Resize((448, 448)), T.ToTensor(), T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) pixel_values = transform(image).unsqueeze(0).to(torch.bfloat16).cuda() # Build prompt prompt = ( "\n" "Provide a detailed radiological report and locate the lesion.\n" "Context: Patient: 45/M, Location: Femur.\n" "Your response must follow this exact format:\n" " Diagnosis: \n" " Findings: " ) generation_config = dict(max_new_tokens=256, do_sample=False) response = model.chat(tokenizer, pixel_values, prompt, generation_config) print(response) ``` **Example output:** ``` Diagnosis: osteosarcoma Findings: The imaging reveals a malignant osteosarcoma characterised by an aggressive bone lesion with cortical destruction and soft tissue extension. The tumor exhibits a mixed lytic and sclerotic pattern, with associated periosteal reaction and possible Codman triangle. ``` ## Training details | Hyperparameter | Value | |----------------|-------| | LoRA rank (r) | 32 | | LoRA alpha (α) | 64 | | LoRA target modules | q/k/v/o/gate/up/down_proj | | Learning rate (LLM) | 2×10⁻⁴ | | Learning rate (MLP) | 2×10⁻⁵ | | Batch size | 8 | | Epochs | 5 (early stopping) | | Optimizer | AdamW | | Hardware | NVIDIA A100 40GB (Google Colab) | | Training time | ~4 h | ## Limitations - **Synthetic references**: ROUGE metrics use GPT-4o generated reports as ground truth, not real radiologist annotations. Text quality is validated against a proxy, not a clinical gold standard. - **Dataset distribution**: BTXRD is heavily skewed toward osteochondroma (≈43% of test samples). Performance on rare classes (osteofibroma, synovial osteochondroma) should be interpreted cautiously. - **Not for clinical use**: This model is a research prototype and has not been validated for clinical decision support. - **Language**: The model generates findings in English regardless of prompt language. ## Citation If you use this model, please cite: ```bibtex @article{wang2025internvl3_5, title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency}, author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others}, journal={arXiv preprint arXiv:2508.18265}, year={2025} } ``` ## License MIT — same as the base InternVL3 model.