---
license: apache-2.0
language:
- en
- multilingual
pipeline_tag: image-text-to-text
tags:
- multimodal
- vlm
- artemis-vlm
- mistral
- qwen3-vl
- schneewolf-labs
- a-series
base_model: schneewolflabs/A3
datasets:
- schneewolflabs/ArtemisMix-v1.1
---

# A3-Instruct

**A3-Instruct** is the instruction-tuned + creative-writing sibling of
[**A3**](https://huggingface.co/schneewolflabs/A3) in the Schneewolf Labs
A-series. It takes A3 (Qwen3-VL ViT + 2-layer MLP projector + A2/Mistral
decoder; Stage-1 projector-only alignment) and runs a full multimodal
instruction fine-tune on [ArtemisMix-v1.1](https://huggingface.co/datasets/schneewolflabs/ArtemisMix-v1.1).

> A3 stays the dense-captioning specialist. A3-Instruct is the one to
> reach for when you want conversation, VQA, multi-turn image grounding,
> tool-call drafting, identity-preserving chat, or creative writing.

## What it is

| | |
|---|---|
| Architecture | Qwen3-VL ViT (frozen, ~0.41 B) + 2-layer MLP projector (trained, 37 M) + A2 Mistral decoder (full FFT, 12.25 B) |
| Total params | **12.69 B** |
| Trainable in Stage-2 | 12.28 B (96.8%) — ViT frozen |
| Base | [`schneewolflabs/A3`](https://huggingface.co/schneewolflabs/A3) |
| Training corpus | [`schneewolflabs/ArtemisMix-v1.1`](https://huggingface.co/datasets/schneewolflabs/ArtemisMix-v1.1) (364,816 rows; 333,001 after the 4096-token filter) |
| Epochs | 1 |
| Effective batch | 16 (bs 1 × grad-accum 16) |
| Optimizer | paged AdamW 8-bit |
| Learning rate | 1e-5, cosine, warmup 3% |
| Max seq length | 4096 |
| Hardware | 1× NVIDIA GB10 (DGX Spark, 128 GB unified) |
| Wall-clock | ~7.4 days |
| Final eval loss | **0.7516** (down from 0.85 at the first eval) |

## Strengths

- **Identity is durable.** Asked "Who are you and who built you?" with no system prompt, the model answers "I'm a language model created by Schneewolf Labs, a software research and publishing company based in Pennsylvania." The i-DPO anti-drift bedrock (17% of the corpus) held cleanly through full FFT.
- **Creative writing has voice.** *"The rain fell in gray sheets, turning the neon signs of Nathan Road into smears of color on the wet concrete... The kind of night where trouble walks in on two legs and asks for a drink."* The Athanorlite contribution pulled through.
- **VQA + multi-turn structure work.** Direct factual questions (counting, color, what's in the image) get clean answers; multi-turn follow-ups maintain image context.
- **Architecture stays usable in llama.cpp** via the `Schneewolf-Labs/llama.cpp` fork's Artemis VLM mmproj graft (same path as A3).

## Limitations (be honest about these)

- **Hybrid `<think>` gate is currently underdeveloped.** Even with `enable_thinking=True`, the model tends to emit an empty `<think></think>` and put reasoning in the answer body rather than fill the wrapper. The model *can* reason — it just doesn't use the dedicated block. Likely because the training data had `<think>...</think>` embedded in assistant content and the model learned to close, not fill, the template-injected wrapper. Under investigation.
- **Visual grounding regressed vs A3** on dense description. A3's caption of a bento box correctly named pink/yellow/blue containers, apricots, almonds, figs, and "meatloaf"; A3-Instruct's caption on the same image is more generic ("plastic containers", "meat", "fruit salad") and occasionally hallucinates (called pineapple+mandarin "apples", then "cake" on a follow-up). This is the alignment tax — only ~30 K of the 365 K rows were detailed-captioning, and the conversational/reasoning data diluted A3's perceptual sharpness. **For dense captioning, use A3.**
- **Structured `<tool_call>` syntax drifted.** A2's tool-call format (`<tool_call>` blocks with JSON) was rehearsed via 30 K oversampled A2-tool-orpo rows, but A3-Instruct emits the *concept* of an API call (e.g. an OpenWeatherMap GET URL with params) rather than the structured token format. The behavior is reasonable; the format isn't.

These three are tracked for the next training run; they are not blockers for chat/VQA/creative use.

## Intended use

- Conversational VLM for ChatSWL-style internal product use
- VQA, multi-turn image grounding, creative writing, image-grounded discussion
- Foundation for further fine-tuning toward specialized behavior

Not intended for: production tool calling without verification, high-stakes captioning where A3 is the better choice, autonomous decision-making.

## Inference

Compatible with the standard `transformers` `ArtemisVLMForConditionalGeneration` interface from the `artemis-vlm` package (PyPI: `artemis-vlm >= 0.1.3`). Also runs in llama.cpp via the `Schneewolf-Labs/llama.cpp` fork's `mtmd` support (decoder GGUF + Artemis mmproj GGUF — same pattern as A3).

```python
from transformers import AutoConfig, AutoTokenizer
from artemis_vlm import ArtemisVLMForConditionalGeneration, ArtemisVLMProcessor
import torch

ckpt = "schneewolflabs/A3-Instruct"
model = ArtemisVLMForConditionalGeneration.from_pretrained(ckpt, dtype=torch.bfloat16).to("cuda")
cfg = AutoConfig.from_pretrained(ckpt, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(ckpt, trust_remote_code=True)
proc = ArtemisVLMProcessor(tokenizer=tok, vision_config=cfg.vision_config)
```

## Lineage

- [`schneewolflabs/A3`](https://huggingface.co/schneewolflabs/A3) — Stage-1 base (projector-only alignment, 1 M BLIP3o samples)
- [`schneewolflabs/A2`](https://huggingface.co/schneewolflabs/A2) — text decoder (Mistral 12.3 B, hidden 5120, Tekken vocab)
- [`schneewolflabs/ArtemisMix-v1.1`](https://huggingface.co/datasets/schneewolflabs/ArtemisMix-v1.1) — training corpus
- [`schneewolflabs/Athanorlite-DPO`](https://huggingface.co/datasets/schneewolflabs/Athanorlite-DPO) — creative-writing source (collapsed to SFT, non-reasoning bucket)
- [`schneewolflabs/i-DPO`](https://huggingface.co/datasets/schneewolflabs/i-DPO) — identity/voice anti-drift bedrock

## What about Artemis?

The **Artemis** name is reserved for a future training that addresses the three limitations above — explicit thinking-block targets, dense-captioning preservation, structured tool-format rehearsal — and ideally the full 500 K corpus with L2 (multimodal tool/agent) + L3 (custom distill) layers that ArtemisMix-v1.1 deliberately deferred.

## License

apache-2.0, consistent with the rest of the A-series lineage.