Update README: switch install to artemis-vlm v0.1.0 package

764ea73 verified 10 days ago

5.44 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- multimodal
	- vision-language
	- vlm
	- artemis
	- schneewolf-labs
	- preview
	- stage1
	- projector-only
	base_model:
	- schneewolflabs/A2
	- Qwen/Qwen3-VL-2B-Instruct
	datasets:
	- BLIP3o/BLIP3o-Pretrain-Long-Caption
	pipeline_tag: image-text-to-text
	---

	# A3-preview

	> Project Artemis — Stage-1 alignment proof-of-concept.
	> This is a preview, not a production VLM. It demonstrates that the
	> Schneewolf Labs A-series text decoder can be successfully extended to
	> vision-language with a small learned projector. A3-preview is the
	> training milestone between A2 (text-only flagship) and A3 (the real
	> multimodal release).

	## What this is

	A LLaVA-style graft assembling three pieces:

	\| Component \| Source \| Role \|
	\|---\|---\|---\|
	\| Vision tower \| `Qwen/Qwen3-VL-2B-Instruct` (ViT, ~600M params) \| Image → visual feature tokens \|
	\| Projector \| Fresh 2-layer MLP, ~45M params \| Visual hidden → text hidden bridge \|
	\| Language model \| `schneewolflabs/A2` (~12B params) \| Unchanged decoder \|

	Only the projector was trained. The vision tower and decoder are frozen
	exactly as published, so A2's text capabilities (reasoning, tool calls,
	identity, Qwen3 chat template support) are preserved by construction.

	## Training details

	\| Setting \| Value \|
	\|---\|---\|
	\| Corpus \| `BLIP3o/BLIP3o-Pretrain-Long-Caption` (25,000 streamed samples) \|
	\| Optimizer \| AdamW (fp32 moments), lr 1e-3 cosine to 0 \|
	\| Effective batch \| 8 (bs=2 × grad_accum=4) \|
	\| Steps \| 3,094 (1 epoch) \|
	\| Precision \| bfloat16 \|
	\| Wall clock \| ~3.4 hours on a single NVIDIA GB10 (DGX Spark) \|
	\| Train loss \| 5.44 → 0.88 \|
	\| Eval loss \| 0.77 on held-out BLIP3o (better than train — not memorizing) \|

	## What works

	Tested on a small held-out battery (BLIP3o + entirely out-of-distribution
	Japanese photos). The projector is image-grounded — captions describe
	what's actually in each image, including specific named objects on OOD inputs
	(brand text on bottles, identification of a "Gundam statue" at a specific
	"Lalaport" mall, etc.). This is what we hoped for from Stage-1 alignment
	and it sets up a real Stage-1 run.

	## What this is not

	- Not a production VLM. 25k samples is a fraction of what serious
	projector alignment needs (LLaVA-1.5 used 558k; LLaVA-NeXT used 1.3M).
	- Captions stay close to "describe the image" patterns. Visual
	reasoning, OCR, VQA, multi-image, and detailed counting were not trained
	for and won't work reliably.
	- No instruction tuning on multimodal data yet. That's Stage-2.
	- No safety / refusal tuning on visual inputs.

	## What's next

	- A3 — full Stage-1 (~1M samples on BLIP3o-Long-Caption) currently training on
	a single NVIDIA GB10. A3 is the projector-aligned successor to A3-preview.
	- Artemis — Stage-2 (multimodal instruction FFT with text rehearsal so A2's
	reasoning / tool calling / identity survive). The named flagship multimodal
	release after A3.

	## Install

	```bash
	pip install 'artemis-vlm @ git+https://github.com/Schneewolf-Labs/Artemis.git@v0.1.0'
	```

	The [`artemis-vlm`](https://github.com/Schneewolf-Labs/Artemis) package contains
	the model definition, processor, and data collator. On import, it registers
	`artemis_vlm` with HuggingFace AutoConfig and AutoModelForCausalLM so
	`from_pretrained()` resolves without `trust_remote_code`.

	## Usage

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import artemis_vlm # registers ArtemisVLM with AutoConfig / AutoModel

	model = AutoModelForCausalLM.from_pretrained(
	"schneewolflabs/A3-preview", dtype=torch.bfloat16,
	).to("cuda").eval()

	tok = AutoTokenizer.from_pretrained("schneewolflabs/A3-preview")
	processor = artemis_vlm.ArtemisVLMProcessor(
	tokenizer=tok, vision_config=model.visual.config,
	min_pixels=32 * 32, max_pixels=512 * 512,
	)

	# Qwen3 chat-template style multimodal message
	messages = [{"role": "user", "content": [
	{"type": "image"},
	{"type": "text", "text": "Describe this image in detail."},
	]}]
	text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
	from PIL import Image
	image = Image.open("your_image.jpg")
	batch = processor(text=text, images=[image], return_tensors="pt").to("cuda")
	with torch.no_grad():
	out = model.generate(**batch, max_new_tokens=200, do_sample=False)
	print(tok.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	## Architecture notes

	A3-preview uses the Path B (composition, not modification) approach to
	extending a text LLM into a VLM: the decoder is untouched, the vision
	encoder is taken intact from a pretrained VLM, and only the projector
	between them is new. This keeps the underlying text model's reasoning,
	tool, and identity capabilities exactly as in A2 — the multimodal
	addition cannot regress text capability because the text computation
	path is byte-identical.

	Image tokens are inserted using A2's repurposed reserved-token layout
	(`<\|image_pad\|>` is token id 22 — see the A1 release notes for the
	full token-id allocation across `<think>`, `<tool_call>`, vision, etc.).

	## License

	Apache 2.0. Same as A1, A2, and the underlying Qwen3-VL vision tower.

	## Acknowledgements

	- BLIP3o team for the Long-Caption pretraining corpus
	- Qwen team for the Qwen3-VL vision encoder
	- LLaVA project for the architectural template

	— Schneewolf Labs · Project Artemis