Upload folder using huggingface_hub

50d2632 verified 9 days ago

6.4 kB

	---
	license: mit
	language:
	- en
	tags:
	- medical
	- radiology
	- bone-tumor
	- vision-language-model
	- internvl
	- fine-tuned
	- classification
	- report-generation
	datasets:
	- btxrd
	pipeline_tag: image-text-to-text
	base_model: OpenGVLab/InternVL3_5-8B
	metrics:
	- accuracy
	- f1
	- rouge
	---

	# BoneVision-8B — Bone Tumor X-ray Classifier

	Fine-tuned vision-language model for automatic classification and structured
	report generation from bone X-ray images. Adapted from InternVL3.5-8B on the
	[BTXRD](https://www.nature.com/articles/s41597-024-04311-y) dataset using LoRA.

	## Model description

	This model takes a bone X-ray image and optional clinical metadata (patient
	age, sex, anatomical location) and produces a structured radiology report
	with two fields:

	- Diagnosis — one of 7 bone tumor classes
	- Findings — a narrative description of the radiographic findings that
	support the diagnosis

	### Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Vision encoder \| InternViT (ViT, 24 layers, hidden=1024, patch=14×14, img=448×448) \|
	\| Projection \| MLP 2-layer with GELU (mlp1) \|
	\| Language model \| Qwen3-8B-Instruct \|
	\| Total parameters \| ~8B \|
	\| Fine-tuning method \| LoRA (r=32, α=64) on all attention + FFN projections \|
	\| Trainable params \| ~83M (~1% of total) \|
	\| Precision \| bfloat16 \|

	### Classes

	The model classifies among 7 clinically well-characterised bone tumor types:

	\| Class \| Type \|
	\|-------\|------\|
	\| Osteochondroma \| Benign \|
	\| Multiple osteochondromas \| Benign \|
	\| Giant cell tumor \| Benign \|
	\| Synovial osteochondroma \| Benign \|
	\| Osteofibroma \| Benign \|
	\| Simple bone cyst \| Benign \|
	\| Osteosarcoma \| Malignant \|

	## Training data

	BTXRD (Bone Tumor X-ray Radiograph Dataset, Yao et al. 2025) — 3,746
	bone X-ray images from 3 medical centres, covering 9 tumor subtypes confirmed
	by histopathology. Two heterogeneous catch-all classes were removed, leaving
	2,009 samples across 7 clean classes. Training used 1,662 samples (with
	minority-class oversampling to a minimum of 150 samples per class), 172
	for validation, and 175 for final evaluation.

	Textual references (Findings field) were generated synthetically using
	GPT-4o Batch API conditioned on the confirmed diagnosis and clinical
	metadata. No real radiologist reports are available in BTXRD.

	## Performance

	Evaluated on the held-out test set (n=175, stratified).

	### Classification metrics

	\| Model \| Classes \| n \| Accuracy \| F1-macro \|
	\|-------\|---------\|---\|----------\|----------\|
	\| Base (zero-shot) \| 7 \| 175 \| 35.43 % \| 0.354 \|
	\| BoneVision-8B (this model) \| 7 \| 175 \| 74.86 % \| 0.720 \|

	### Per-class F1

	\| Class \| Precision \| Recall \| F1 \|
	\|-------\|-----------\|--------\|----\|
	\| Osteochondroma \| 0.87 \| 0.96 \| 0.91 \|
	\| Multiple osteochondromas \| 0.87 \| 0.96 \| 0.91 \|
	\| Osteosarcoma \| 0.79 \| 0.84 \| 0.82 \|
	\| Giant cell tumor \| 0.78 \| 0.70 \| 0.74 \|
	\| Synovial osteochondroma \| 0.86 \| 0.67 \| 0.75 \|
	\| Osteofibroma \| 0.67 \| 0.80 \| 0.73 \|
	\| Simple bone cyst \| 0.50 \| 0.45 \| 0.47 \|

	### Text quality (ROUGE)

	Computed against GPT-4o synthetic reports as reference.

	\| ROUGE-1 \| ROUGE-2 \| ROUGE-L \|
	\|---------\|---------\|---------\|
	\| 0.771 \| 0.645 \| 0.705 \|

	## How to use

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel
	from PIL import Image
	import torchvision.transforms as T

	model_path = "javierespantaleon/BoneVision-8B"

	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
	model = AutoModel.from_pretrained(
	model_path,
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	device_map="auto"
	).eval()

	# Load and preprocess image
	image = Image.open("xray.jpg").convert("RGB")
	transform = T.Compose([
	T.Resize((448, 448)),
	T.ToTensor(),
	T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
	])
	pixel_values = transform(image).unsqueeze(0).to(torch.bfloat16).cuda()

	# Build prompt
	prompt = (
	"<image>\n"
	"Provide a detailed radiological report and locate the lesion.\n"
	"Context: Patient: 45/M, Location: Femur.\n"
	"Your response must follow this exact format:\n"
	" Diagnosis: <one of: giant cell tumor \| multiple osteochondromas \| "
	" osteochondroma \| synovial osteochondroma \| osteofibroma \| "
	" osteosarcoma \| simple bone cyst>\n"
	" Findings: <detailed radiological description>"
	)

	generation_config = dict(max_new_tokens=256, do_sample=False)
	response = model.chat(tokenizer, pixel_values, prompt, generation_config)
	print(response)
	```

	Example output:
	```
	Diagnosis: osteosarcoma
	Findings: The imaging reveals a malignant osteosarcoma characterised by an
	aggressive bone lesion with cortical destruction and soft tissue extension.
	The tumor exhibits a mixed lytic and sclerotic pattern, with associated
	periosteal reaction and possible Codman triangle.
	```

	## Training details

	\| Hyperparameter \| Value \|
	\|----------------\|-------\|
	\| LoRA rank (r) \| 32 \|
	\| LoRA alpha (α) \| 64 \|
	\| LoRA target modules \| q/k/v/o/gate/up/down_proj \|
	\| Learning rate (LLM) \| 2×10⁻⁴ \|
	\| Learning rate (MLP) \| 2×10⁻⁵ \|
	\| Batch size \| 8 \|
	\| Epochs \| 5 (early stopping) \|
	\| Optimizer \| AdamW \|
	\| Hardware \| NVIDIA A100 40GB (Google Colab) \|
	\| Training time \| ~4 h \|

	## Limitations

	- Synthetic references: ROUGE metrics use GPT-4o generated reports as
	ground truth, not real radiologist annotations. Text quality is validated
	against a proxy, not a clinical gold standard.
	- Dataset distribution: BTXRD is heavily skewed toward osteochondroma
	(≈43% of test samples). Performance on rare classes (osteofibroma,
	synovial osteochondroma) should be interpreted cautiously.
	- Not for clinical use: This model is a research prototype and has not
	been validated for clinical decision support.
	- Language: The model generates findings in English regardless of
	prompt language.

	## Citation

	If you use this model, please cite:

	```bibtex
	@article{wang2025internvl3_5,
	title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency},
	author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
	journal={arXiv preprint arXiv:2508.18265},
	year={2025}
	}
	```

	## License

	MIT — same as the base InternVL3 model.