| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - multimodal |
| - vision-language |
| - vlm |
| - artemis |
| - schneewolf-labs |
| - preview |
| - stage1 |
| - projector-only |
| base_model: |
| - schneewolflabs/A2 |
| - Qwen/Qwen3-VL-2B-Instruct |
| datasets: |
| - BLIP3o/BLIP3o-Pretrain-Long-Caption |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # A3-preview |
|
|
| > **Project Artemis β Stage-1 alignment proof-of-concept.** |
| > *This is a preview, not a production VLM.* It demonstrates that the |
| > Schneewolf Labs A-series text decoder can be successfully extended to |
| > vision-language with a small learned projector. A3-preview is the |
| > training milestone *between* A2 (text-only flagship) and A3 (the real |
| > multimodal release). |
|
|
| ## What this is |
|
|
| A LLaVA-style graft assembling three pieces: |
|
|
| | Component | Source | Role | |
| |---|---|---| |
| | **Vision tower** | `Qwen/Qwen3-VL-2B-Instruct` (ViT, ~600M params) | Image β visual feature tokens | |
| | **Projector** | Fresh 2-layer MLP, ~45M params | Visual hidden β text hidden bridge | |
| | **Language model** | `schneewolflabs/A2` (~12B params) | Unchanged decoder | |
|
|
| Only the projector was trained. The vision tower and decoder are frozen |
| exactly as published, so A2's text capabilities (reasoning, tool calls, |
| identity, Qwen3 chat template support) are preserved by construction. |
|
|
| ## Training details |
|
|
| | Setting | Value | |
| |---|---| |
| | Corpus | `BLIP3o/BLIP3o-Pretrain-Long-Caption` (25,000 streamed samples) | |
| | Optimizer | AdamW (fp32 moments), lr 1e-3 cosine to 0 | |
| | Effective batch | 8 (bs=2 Γ grad_accum=4) | |
| | Steps | 3,094 (1 epoch) | |
| | Precision | bfloat16 | |
| | Wall clock | ~3.4 hours on a single NVIDIA GB10 (DGX Spark) | |
| | Train loss | 5.44 β **0.88** | |
| | Eval loss | **0.77** on held-out BLIP3o (better than train β not memorizing) | |
| |
| ## What works |
| |
| Tested on a small held-out battery (BLIP3o + entirely out-of-distribution |
| Japanese photos). The projector is **image-grounded** β captions describe |
| what's actually in each image, including specific named objects on OOD inputs |
| (brand text on bottles, identification of a "Gundam statue" at a specific |
| "Lalaport" mall, etc.). This is what we hoped for from Stage-1 alignment |
| and it sets up a real Stage-1 run. |
| |
| ## What this is *not* |
| |
| - **Not a production VLM.** 25k samples is a fraction of what serious |
| projector alignment needs (LLaVA-1.5 used 558k; LLaVA-NeXT used 1.3M). |
| - **Captions stay close to "describe the image" patterns.** Visual |
| reasoning, OCR, VQA, multi-image, and detailed counting were not trained |
| for and won't work reliably. |
| - **No instruction tuning on multimodal data yet.** That's Stage-2. |
| - **No safety / refusal tuning** on visual inputs. |
| |
| ## What's next |
| |
| - **A3** β full Stage-1 (~1M samples on BLIP3o-Long-Caption) currently training on |
| a single NVIDIA GB10. A3 is the projector-aligned successor to A3-preview. |
| - **Artemis** β Stage-2 (multimodal instruction FFT with text rehearsal so A2's |
| reasoning / tool calling / identity survive). The named flagship multimodal |
| release after A3. |
| |
| ## Install |
| |
| ```bash |
| pip install 'artemis-vlm @ git+https://github.com/Schneewolf-Labs/Artemis.git@v0.1.0' |
| ``` |
| |
| The [`artemis-vlm`](https://github.com/Schneewolf-Labs/Artemis) package contains |
| the model definition, processor, and data collator. On import, it registers |
| `artemis_vlm` with HuggingFace AutoConfig and AutoModelForCausalLM so |
| `from_pretrained()` resolves without `trust_remote_code`. |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import artemis_vlm # registers ArtemisVLM with AutoConfig / AutoModel |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "schneewolflabs/A3-preview", dtype=torch.bfloat16, |
| ).to("cuda").eval() |
| |
| tok = AutoTokenizer.from_pretrained("schneewolflabs/A3-preview") |
| processor = artemis_vlm.ArtemisVLMProcessor( |
| tokenizer=tok, vision_config=model.visual.config, |
| min_pixels=32 * 32, max_pixels=512 * 512, |
| ) |
| |
| # Qwen3 chat-template style multimodal message |
| messages = [{"role": "user", "content": [ |
| {"type": "image"}, |
| {"type": "text", "text": "Describe this image in detail."}, |
| ]}] |
| text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) |
| from PIL import Image |
| image = Image.open("your_image.jpg") |
| batch = processor(text=text, images=[image], return_tensors="pt").to("cuda") |
| with torch.no_grad(): |
| out = model.generate(**batch, max_new_tokens=200, do_sample=False) |
| print(tok.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True)) |
| ``` |
|
|
| ## Architecture notes |
|
|
| A3-preview uses the *Path B* (composition, not modification) approach to |
| extending a text LLM into a VLM: the decoder is untouched, the vision |
| encoder is taken intact from a pretrained VLM, and only the projector |
| between them is new. This keeps the underlying text model's reasoning, |
| tool, and identity capabilities exactly as in A2 β the multimodal |
| addition cannot regress text capability because the text computation |
| path is byte-identical. |
|
|
| Image tokens are inserted using A2's repurposed reserved-token layout |
| (`<|image_pad|>` is token id 22 β see the A1 release notes for the |
| full token-id allocation across `<think>`, `<tool_call>`, vision, etc.). |
|
|
| ## License |
|
|
| Apache 2.0. Same as A1, A2, and the underlying Qwen3-VL vision tower. |
|
|
| ## Acknowledgements |
|
|
| - BLIP3o team for the Long-Caption pretraining corpus |
| - Qwen team for the Qwen3-VL vision encoder |
| - LLaVA project for the architectural template |
|
|
| β Schneewolf Labs Β· Project Artemis |
|
|