LLaVA-Next-Med-OLAB

Leveraging LLaVA-Next backbone with LLaVA-Med's training curriculum done by OLAB at NYU Langone Health

Model Details

We combined the backbone and pretraining of LLaVA-Next (alternative link) with the staged medical curriculum of the original LLaVA-Med as a part of our work on Repurposing the scientific literature with vision-language models. This model served as an intermediate step of training CNS-Obsidian.

Model Description

Developed by: @alyakin314 (@NYU-OLAB)
Model type: Autoregressive Image-Text-to-Text model (VLM)
Language(s) (NLP): English
License: See below.
Finetuned from model: llava-hf/llava-v1.6-34b-hf (alternative link: liuhaotian/llava-v1.6-34b)
Model date:: Trained in September 2024, arXiv'd in February 2025, model weights made public July 2025.

Model Sources

Repository: alyakin314/CNS-Obsidian
Paper: CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications

License

This model is/may be subject to multiple licenses. The strictest license terms apply in all relevant cases:

NousResearch/Nous-Hermes-2-Yi-34B: Apache License 2.0
LLaVA-Next: Apache License 2.0
LLaVA-Med Data: CC BY NC 4.0
LLaVA-Med (if relevant): Microsoft Research License Terms

Uses

Primary Intended Use

The data, code, and model checkpoints are intended to be used solely for (I) future research on visual-language processing and (II) reproducibility of the experimental results. The primary intended use is to support AI researchers reproducing and building on top of this work as we built on LLaVA-Next and LLaVA-Med. CNS-Obsidian and its associated models should be helpful for exploring various biomedical vision-language processing (VLP ) and vision question answering (VQA) research questions.

Out-of-Scope Use

Any deployed use case of the model --- commercial or otherwise --- is out of scope. The data, code, and model checkpoints are intended for research use only and not intended for deployed use in clinical care or for any clinical decision making purposes.

Bias, Risks, and Limitations

This model was developed using English corpora, and thus may be considered English-only. It is not suitable for use in any clinical setting. Under some conditions, the model may make inaccurate predictions and display limitations, which may require additional mitigation strategies. In particular, this model is likely to carry many of the limitations of the model from which it is derived - LLaVA-Next (as well as LLaVA and LLaVA-Med).

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

We recommend running the model using vLLM on at least two A/H100 80GB (or equivalent):

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from PIL import Image
from io import BytesIO
import requests

model_path = "NYU-OLAB/LLaVA-Next-Med-OLAB"

tokenizer = AutoTokenizer.from_pretrained(model_path)

llm = LLM(
    model=model_path,
    tensor_parallel_size=2,
    max_model_len=4096,
    dtype="bfloat16",
)

url = "https://prod-images-static.radiopaedia.org/images/940/67dcac388ad1cc69f7252e9ab44516_big_gallery.jpeg"
image = Image.open(BytesIO(requests.get(url).content))

messages = [
    {
        "role": "user",
        "content": ("<image>\n"
                    "What kind of scan is this? "
                    "What pathology does the patient have? " 
                    "And what's their prognosis?")
    }
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

sampling_params = SamplingParams(
    temperature=0.0,
    top_p=0.95,
    max_tokens=100
)

outputs = llm.generate(
    {
        "prompt": prompt,
        "multi_modal_data": {
            "image": image
        }
    },
    sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

should produce:

This is an MRI (Magnetic Resonance Imaging) scan of the brain.  The patient has a pathology called glioblastoma,  which is a type of aggressive brain tumor.  The prognosis for glioblastoma is generally poor,  as it is a highly malignant tumor that tends to grow rapidly and invade surrounding brain tissue.  Treatment options may include surgery,  radiation therapy,  and chemotherapy,  but the overall survival rate for glioblastoma is low,  with a median survival time of around 12  to 15  months after diagnosis.  It's important to note that individual prognoses can vary depending on factors such as the patient's age,  overall health,  and the specific characteristics of the tumor.

Training Details

Training Data

This model builds upon LLaVA-Med, which in turn builds upon the PMC-15M dataset. PMC-15M is a large-scale parallel image-text dataset for biomedical vision-language processing, containing 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. It covers a diverse range of biomedical image types, such as microscopy, radiography, histology, and more.

We obtained the Stage 1 and 2 training data using the data downloading script from the LLaVA-Med GitHub repository. Through this process, we were able to recover 467K biomedical image-text pairs for Stage 1 alignment and 56K instruction-following samples for Stage 2 fine-tuning (from the originally reported 500K and 60K respectively).

We followed paper-level data splits (no figures from the same papers appear in different sploits) with 95% train / 2.5 % val / 2.5 % test.

Training Procedure

We used a two-stage curriculum:

Stage 1 – Medical alignment

Freeze vision and language models; train projection layers only.
Data: PMC-15M-style biomedical figure–caption pairs (LLaVA-Med Align).

Stage 2 – General medical IFT

Freeze vision model, train language model + projection layers.
Data: PMC biomedical instruction-following conversations (LLaVA-Med IFT).

Training Hyperparameters

Optimizer: AdamW (default parameters)
Precision / regime: bf16 mixed precision
Parallelism: PyTorch FSDP on 13 nodes (8 GPUs each, 104 GPUs total)
Learning rate:
- Stage 1: 1e-3
- Stage 2: 1e-5
Cosine LR schedule with warm-up
Per-GPU microbatch: 4
Gradient accumulation:
- Stage 1: 4
- Stage 2: 1
Effective batch size:
- Stage 1: 1664
- Stage 2: 416

Speeds, Sizes, Times [optional]

Stage	Approx. Duration per Epoch
Stage 1	~3.5 hours
Stage 2	~30 minutes

Disclosure

Our work was performed and arXiv'ed in parallel with LLaVA-NeXT-Med: Medical Multimodal Large Language Model by Yunfei Guo, Wu Huang. It is NOT the same model but trained in a very similar fashion. We chose to add clarifier -OLAB to ours to avoid confusion.

BibTeX entry and citation info

@misc{alyakin2025cnsobsidian,
      title={Repurposing the scientific literature with vision-language models}, 
      author={Anton Alyakin and Jaden Stryker and Daniel Alexander Alber and Karl L. Sangwon and Jin Vivian Lee and Brandon Duderstadt and Akshay Save and David Kurland and Spencer Frome and Shrutika Singh and Jeff Zhang and Eunice Yang and Ki Yun Park and Cordelia Orillac and Aly A. Valliani and Sean Neifert and Albert Liu and Aneek Patel and Christopher Livia and Darryl Lau and Ilya Laufer and Peter A. Rozman and Eveline Teresa Hidalgo and Howard Riina and Rui Feng and Todd Hollon and Yindalon Aphinyanaphongs and John G. Golfinos and Laura Snyder and Eric Leuthardt and Douglas Kondziolka and Eric Karl Oermann},
      year={2025},
      eprint={2502.19546},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.19546}, 
}