A3-preview / README.md
nbeerbower's picture
Update README: switch install to artemis-vlm v0.1.0 package
764ea73 verified
---
license: apache-2.0
language:
- en
tags:
- multimodal
- vision-language
- vlm
- artemis
- schneewolf-labs
- preview
- stage1
- projector-only
base_model:
- schneewolflabs/A2
- Qwen/Qwen3-VL-2B-Instruct
datasets:
- BLIP3o/BLIP3o-Pretrain-Long-Caption
pipeline_tag: image-text-to-text
---
# A3-preview
> **Project Artemis β€” Stage-1 alignment proof-of-concept.**
> *This is a preview, not a production VLM.* It demonstrates that the
> Schneewolf Labs A-series text decoder can be successfully extended to
> vision-language with a small learned projector. A3-preview is the
> training milestone *between* A2 (text-only flagship) and A3 (the real
> multimodal release).
## What this is
A LLaVA-style graft assembling three pieces:
| Component | Source | Role |
|---|---|---|
| **Vision tower** | `Qwen/Qwen3-VL-2B-Instruct` (ViT, ~600M params) | Image β†’ visual feature tokens |
| **Projector** | Fresh 2-layer MLP, ~45M params | Visual hidden β†’ text hidden bridge |
| **Language model** | `schneewolflabs/A2` (~12B params) | Unchanged decoder |
Only the projector was trained. The vision tower and decoder are frozen
exactly as published, so A2's text capabilities (reasoning, tool calls,
identity, Qwen3 chat template support) are preserved by construction.
## Training details
| Setting | Value |
|---|---|
| Corpus | `BLIP3o/BLIP3o-Pretrain-Long-Caption` (25,000 streamed samples) |
| Optimizer | AdamW (fp32 moments), lr 1e-3 cosine to 0 |
| Effective batch | 8 (bs=2 Γ— grad_accum=4) |
| Steps | 3,094 (1 epoch) |
| Precision | bfloat16 |
| Wall clock | ~3.4 hours on a single NVIDIA GB10 (DGX Spark) |
| Train loss | 5.44 β†’ **0.88** |
| Eval loss | **0.77** on held-out BLIP3o (better than train β€” not memorizing) |
## What works
Tested on a small held-out battery (BLIP3o + entirely out-of-distribution
Japanese photos). The projector is **image-grounded** β€” captions describe
what's actually in each image, including specific named objects on OOD inputs
(brand text on bottles, identification of a "Gundam statue" at a specific
"Lalaport" mall, etc.). This is what we hoped for from Stage-1 alignment
and it sets up a real Stage-1 run.
## What this is *not*
- **Not a production VLM.** 25k samples is a fraction of what serious
projector alignment needs (LLaVA-1.5 used 558k; LLaVA-NeXT used 1.3M).
- **Captions stay close to "describe the image" patterns.** Visual
reasoning, OCR, VQA, multi-image, and detailed counting were not trained
for and won't work reliably.
- **No instruction tuning on multimodal data yet.** That's Stage-2.
- **No safety / refusal tuning** on visual inputs.
## What's next
- **A3** β€” full Stage-1 (~1M samples on BLIP3o-Long-Caption) currently training on
a single NVIDIA GB10. A3 is the projector-aligned successor to A3-preview.
- **Artemis** β€” Stage-2 (multimodal instruction FFT with text rehearsal so A2's
reasoning / tool calling / identity survive). The named flagship multimodal
release after A3.
## Install
```bash
pip install 'artemis-vlm @ git+https://github.com/Schneewolf-Labs/Artemis.git@v0.1.0'
```
The [`artemis-vlm`](https://github.com/Schneewolf-Labs/Artemis) package contains
the model definition, processor, and data collator. On import, it registers
`artemis_vlm` with HuggingFace AutoConfig and AutoModelForCausalLM so
`from_pretrained()` resolves without `trust_remote_code`.
## Usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import artemis_vlm # registers ArtemisVLM with AutoConfig / AutoModel
model = AutoModelForCausalLM.from_pretrained(
"schneewolflabs/A3-preview", dtype=torch.bfloat16,
).to("cuda").eval()
tok = AutoTokenizer.from_pretrained("schneewolflabs/A3-preview")
processor = artemis_vlm.ArtemisVLMProcessor(
tokenizer=tok, vision_config=model.visual.config,
min_pixels=32 * 32, max_pixels=512 * 512,
)
# Qwen3 chat-template style multimodal message
messages = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
from PIL import Image
image = Image.open("your_image.jpg")
batch = processor(text=text, images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
out = model.generate(**batch, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True))
```
## Architecture notes
A3-preview uses the *Path B* (composition, not modification) approach to
extending a text LLM into a VLM: the decoder is untouched, the vision
encoder is taken intact from a pretrained VLM, and only the projector
between them is new. This keeps the underlying text model's reasoning,
tool, and identity capabilities exactly as in A2 β€” the multimodal
addition cannot regress text capability because the text computation
path is byte-identical.
Image tokens are inserted using A2's repurposed reserved-token layout
(`<|image_pad|>` is token id 22 β€” see the A1 release notes for the
full token-id allocation across `<think>`, `<tool_call>`, vision, etc.).
## License
Apache 2.0. Same as A1, A2, and the underlying Qwen3-VL vision tower.
## Acknowledgements
- BLIP3o team for the Long-Caption pretraining corpus
- Qwen team for the Qwen3-VL vision encoder
- LLaVA project for the architectural template
β€” Schneewolf Labs Β· Project Artemis