AstroLLaVA Stage-2 (connector + LoRA instruction tuning)

A LLaVA-style vision–language model that lets Qwen2.5-1.5B-Instruct answer questions about astronomy images encoded by CLIP ViT-L/14. This is the Stage-2 model: it warm-starts the Stage-1 connector and continues training it jointly with LoRA adapters on the Qwen LLM, on the caption + GPT-4 QA records of UniverseTBD/AstroLLaVA_convos. The CLIP vision tower stays frozen. Trained on a disjoint held-out test split so it can be evaluated on unseen images.

Stage 1 aligned the connector with the LLM frozen — it grounds coarse visual structure but hallucinates fine specifics. Stage 2 opens up the LLM (via LoRA) so the model learns to use the visual evidence when committing to answers — the recipe's instruction-tuning step.

⚠️ This bundle ships the connector + LoRA adapter only (not full LLM weights). It is not a standalone transformers model — it needs the custom VLM code from the astronomy-vlm repo, the two base models (auto-downloaded from the Hub), and peft to run.

Download

A single bundle holds the final checkpoint and everything needed to run / reproduce it:

Bundle	Contents
`astrollava-stage2.zip`	`checkpoint-2526/` (`connector.safetensors` + `lora/`), `predictions_test_stage2.jsonl`, `finetune_astrollava_stage2.yaml`, `test.json`, `REPRODUCE.md`

checkpoint-2526/ contains the continued-trained connector (connector.safetensors), the trained LoRA adapter (lora/adapter_model.safetensors + adapter_config.json), optimizer/scheduler state (training_state.pt), and meta.json (step + final loss). Both the connector and the LoRA are required at inference.

Architecture

image ─► CLIP ViT-L/14 (FROZEN) ─► MLP connector (TRAINED, init from Stage-1) ─► Qwen2.5-1.5B + LoRA (base FROZEN, LoRA TRAINED) ─► text
                                    1024 → 1536 → 1536

Vision: openai/clip-vit-large-patch14, penultimate-layer patch features (frozen)
Connector: 2-layer MLP with GELU, 1024→1536→1536; warm-started from Stage-1 checkpoint-3789 and kept trainable
LLM: Qwen/Qwen2.5-1.5B-Instruct, base frozen + LoRA adapters (r=16, α=32, dropout 0.05) on q/k/v/o/gate/up/down_proj across all 28 layers
Trainable / total: 22,400,000 / 1,868,879,360 (1.20%) — connector 3,935,232 + LoRA 18,464,768

Training


Data	`UniverseTBD/AstroLLaVA_convos`, same per-image held-out split as Stage-1: train 161,653 recs / 29,151 imgs, test 591 imgs / 3,271 recs
Initialization	connector ← Stage-1 `checkpoint-3789` (epoch 3); LoRA ← fresh (no-op init)
Objective	next-token cross-entropy on answer tokens only (connector + LoRA trainable)
Epochs / steps	1 epoch, 2,526 update steps
Effective batch	64 (per-device 4 × grad-accum 16)
LR / schedule	2e-4, cosine with 3% warmup (75 steps)
Max length	512 (+256 image tokens)
Precision	bf16 (autocast) + gradient checkpointing
Hardware	1× RTX 6000 Ada (48 GB), ~~15 samples/s (~~3 h)
Loss	began ~1.47 (equal to the Stage-1 warm-start, since LoRA initializes as a no-op); final value in `checkpoint-2526/meta.json`

The full-LLM backward pass (absent in Stage-1) is the memory driver, hence per-device batch 4 + gradient checkpointing to fit ~48 GB. One epoch is the LLaVA instruction-tuning convention — the model only needs to learn to use the already-aligned visual features, not to align them from scratch.

Usage

# 1. get the code
git clone https://github.com/crimsonKn1ght/astronomy-vlm && cd astronomy-vlm
pip install -r requirements.txt        # includes peft

# 2. download + unzip the bundle
hf download grKnight/astrollava-stage2 astrollava-stage2.zip --local-dir .
unzip astrollava-stage2.zip -d ckpt2

# 3. answer a question about an image (CLIP + Qwen auto-download; peft loads the LoRA)
python inference.py \
  --config ckpt2/finetune_astrollava_stage2.yaml \
  --checkpoint ckpt2/checkpoint-2526 \
  --image your_astro_image.jpg \
  --prompt "What type of object is this and what is notable about it?" \
  --temperature 0

Pass the Stage-2 config so the LoRA modules are built before the adapter weights load; the loader then restores both the connector and the LoRA automatically. The bundled predictions_test_stage2.jsonl holds the held-out outputs with their reference captions.

Capabilities & limitations

Stage 2 fine-tunes the LLM (LoRA) jointly with the connector, so — unlike Stage-1 — the language model itself learns from the QA pairs rather than improvising specifics from its frozen prior. The intended effect is fewer hallucinated fine details (catalog numbers, instruments, dates) on question-answering prompts, on top of Stage-1's coarse visual grounding. Compare the bundled predictions_test_stage2.jsonl with Stage-1's predictions_test_ep3.jsonl (held out, same images) to see the difference.

Limitations carried over from the design: CLIP's 224×224 input discards fine astronomical detail; the base LLM is small (1.5B); and LoRA is a low-rank adaptation, not a full fine-tune. Evaluation is a held-out generation set, not a full quantitative benchmark — read results qualitatively.

Reproduction

The bundle's REPRODUCE.md pins the exact code commit, base models, the seeded dataset-build command, the training command, and package versions (torch, transformers, peft). The split is seeded, so the build reproduces the exact train/test partition.

prereq: Stage-1 connector checkpoint-3789 (grKnight/astrollava-stage1 ep3 bundle)
build:  python scripts/build_astrollava_trainset.py --include-qa --max-image-size 384 --test-fraction 0.02 --seed 42
train:  python train.py --config configs/finetune_astrollava_stage2.yaml
eval:   python scripts/batch_inference.py --config configs/finetune_astrollava_stage2.yaml --records-json datasets/astrollava_llava/test.json --num-samples 0 ...

License & attribution

Weights: cc-by-sa-4.0, inherited from the training data.
Training data: UniverseTBD/AstroLLaVA_convos (CC-BY-SA-4.0); imagery from NASA APOD, ESO, and NASA/ESA Hubble.
Base models: Qwen2.5-1.5B-Instruct (Apache-2.0), CLIP ViT-L/14 (OpenAI, MIT).
Builds on: AstroLLaVA Stage-1 and the AstroLLaVA work (arXiv:2504.08583).