Instructions to use Michael-Kozu/Deimos-A1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Michael-Kozu/Deimos-A1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Michael-Kozu/Deimos-A1")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Michael-Kozu/Deimos-A1")
model = AutoModelForMultimodalLM.from_pretrained("Michael-Kozu/Deimos-A1")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Michael-Kozu/Deimos-A1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Michael-Kozu/Deimos-A1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Michael-Kozu/Deimos-A1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Michael-Kozu/Deimos-A1

SGLang

How to use Michael-Kozu/Deimos-A1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Michael-Kozu/Deimos-A1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Michael-Kozu/Deimos-A1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Michael-Kozu/Deimos-A1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Michael-Kozu/Deimos-A1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Michael-Kozu/Deimos-A1 with Docker Model Runner:
```
docker model run hf.co/Michael-Kozu/Deimos-A1
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Deimos A1

Satellite Class · 4.66B · MIT

Concise reasoningText generationPublic release

Release signal~89% fewer tokens.
~6× faster.Weights available

Overview

Deimos A1 is a concise chain-of-thought (CCoT) fine-tune of Qwen3.5-4B. It produces dense, stepwise <think> blocks averaging ~1/8 the tokens of the base model while improving accuracy on every reasoning benchmark we measured.

Our first model release. Trained on Quark, a 4,919-row CCoT SFT dataset whose <think> traces were compressed by a Qwen3.6-35B teacher (NVFP4) running in our internal Tokamak pipeline. Final answers in the training data are byte-identical to the source — only the reasoning channel is rewritten.

The "A1" suffix means Alpha 1 — the first public iteration of the Deimos line. Future revisions (A2, …) will fold in additional sources (Kimi K2.5, larger Quark builds), longer training runs, and answer-channel compression once it lands in the Tokamak pipeline.

Specifications

Model

ClassSatellite

Parameters4B

ArchitectureQwen3_5 (Gated DeltaNet + sparse attention)

BaseQwen/Qwen3.5-4B

PrecisionBF16 (merged)

Context131,072 tokens

Training

MethodLoRA r=128, α=128 (merged)

Targetsall attention + MLP

Epochs3

OptimiserAdamW (cosine, lr 1e-4)

Wall time~3 h 49 m

Training Details

Dataset: Michael-Kozu/Quark — 4,919 rows of CCoT SFT data (Opus 4.6 + GPT-5.4 sources). Train/val/test = 3,937 / 491 / 491.
Adapter: LoRA rank 128, α 128, dropout 0; targets q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj; trainable parameters 169.9M of 4.7B (3.61%). Adapter merged into base before release.
Schedule: 3 epochs · per-device batch 4 · gradient accumulation 4 · effective batch 16 · 741 total steps · cosine LR 1e-4 · 5% warmup · AdamW · weight decay 0.01.
Sequence: max 4,096 tokens · packing disabled (Qwen3.5 multimodal architecture).

Loss trajectory

Training and eval loss curves

Train loss descends from 1.043 → 0.477 over 741 steps; eval loss bottoms at 0.814 at the end of epoch 2, then drifts up to 0.831 at the end of epoch 3, indicating mild overfitting. The released weights use the ep-3 final state; the lower-val ep-2 checkpoint is preserved internally.

Learning-rate schedule

Cosine LR schedule with 5% warmup

Benchmarks

The headline measurement is token efficiency — mean output tokens per problem and wall-clock per benchmark. These are direct, reproducible measurements of the same harness running against both endpoints on the same hardware.

Token efficiency — Deimos vs Base

Mean tokens / problem~1,400 → ~150 ≈ -89%

Wall-clock (full bench)58m → 9m 38s ≈ 6× faster

Contamination (13-gram)0% overlap — Quark vs GSM8K / MMLU-Pro / ARC-C test sets

Token efficiency by task

Mean output tokens per problem, Deimos A1 vs Qwen3.5-4B base

Accuracy — comprehensive evaluation in progress.

Initial harness runs show Deimos consistently emits a parseable answer immediately after its <think> block, while the base model under the same harness more often does not — making accuracy comparisons sensitive to the parser, the per-task max_tokens budget, and chat-template handling. We are tuning the eval harness (matching Qwen's recommended max_tokens for thinking mode, verifying our scores reproduce Qwen's published baseline numbers before claiming any delta) and will publish a full report with per-task tables, contamination disclosure, and reproduction instructions in a follow-up update to this card.

Until that report lands, the only quantitative claims we make about this model are the token-efficiency and wall-clock numbers above — both of which are measured the same way for both models and are not sensitive to harness parsing.

Limitations & License

Subset benchmarks only. Per-task n=10 (stderr ±16%). MMLU-Pro at n=140 (stderr ±4%). Larger-n runs are planned for the next release.
Inherited Qwen3.5-4B limitations — language coverage, knowledge cutoff, and any biases of the base model. Quark fine-tuning shifts style, not knowledge.
Mild ep-3 overfitting on the 4,919-row Quark training set (val loss 0.814 → 0.831 from ep 2 to ep 3).
English only at this time.
License: MIT, consistent with the Quark dataset and the Qwen3.5-4B base license terms.

Kozu AI Turning the laws of reality into unparalleled creation.

Downloads last month: 379

Safetensors

Model size

5B params

Tensor type

BF16

F32

Model tree for Michael-Kozu/Deimos-A1

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(371)

this model

Quantizations

2 models

Michael-Kozu
/

Deimos-A1