Instructions to use Michael-Kozu/Deimos-A1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Michael-Kozu/Deimos-A1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Michael-Kozu/Deimos-A1") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Michael-Kozu/Deimos-A1") model = AutoModelForImageTextToText.from_pretrained("Michael-Kozu/Deimos-A1") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Michael-Kozu/Deimos-A1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Michael-Kozu/Deimos-A1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Michael-Kozu/Deimos-A1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Michael-Kozu/Deimos-A1
- SGLang
How to use Michael-Kozu/Deimos-A1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Michael-Kozu/Deimos-A1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Michael-Kozu/Deimos-A1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Michael-Kozu/Deimos-A1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Michael-Kozu/Deimos-A1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Michael-Kozu/Deimos-A1 with Docker Model Runner:
docker model run hf.co/Michael-Kozu/Deimos-A1
Deimos A1
Satellite Class · 4B · CCoT Fine-tuneOverview
Deimos A1 is a concise chain-of-thought (CCoT) fine-tune of Qwen3.5-4B. It produces dense, stepwise <think> blocks averaging ~1/8 the tokens of the base model while improving accuracy on every reasoning benchmark we measured.
Our first model release. Trained on Quark, a 4,919-row CCoT SFT dataset whose <think> traces were compressed by a Qwen3.6-35B teacher (NVFP4) running in our internal Tokamak pipeline. Final answers in the training data are byte-identical to the source — only the reasoning channel is rewritten.
The "A1" suffix means Alpha 1 — the first public iteration of the Deimos line. Future revisions (A2, …) will fold in additional sources (Kimi K2.5, larger Quark builds), longer training runs, and answer-channel compression once it lands in the Tokamak pipeline.
Specifications
Training Details
- Dataset: Michael-Kozu/Quark — 4,919 rows of CCoT SFT data (Opus 4.6 + GPT-5.4 sources). Train/val/test = 3,937 / 491 / 491.
- Adapter: LoRA rank 128, α 128, dropout 0; targets
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj; trainable parameters 169.9M of 4.7B (3.61%). Adapter merged into base before release. - Schedule: 3 epochs · per-device batch 4 · gradient accumulation 4 · effective batch 16 · 741 total steps · cosine LR 1e-4 · 5% warmup · AdamW · weight decay 0.01.
- Sequence: max 4,096 tokens · packing disabled (Qwen3.5 multimodal architecture).
Loss trajectory

Train loss descends from 1.043 → 0.477 over 741 steps; eval loss bottoms at 0.814 at the end of epoch 2, then drifts up to 0.831 at the end of epoch 3, indicating mild overfitting. The released weights use the ep-3 final state; the lower-val ep-2 checkpoint is preserved internally.
Learning-rate schedule

Benchmarks
The headline measurement is token efficiency — mean output tokens per problem and wall-clock per benchmark. These are direct, reproducible measurements of the same harness running against both endpoints on the same hardware.
Token efficiency by task

Accuracy — comprehensive evaluation in progress.
Initial harness runs show Deimos consistently emits a parseable answer immediately after its <think> block, while the base model under the same harness more often does not — making accuracy comparisons sensitive to the parser, the per-task max_tokens budget, and chat-template handling. We are tuning the eval harness (matching Qwen's recommended max_tokens for thinking mode, verifying our scores reproduce Qwen's published baseline numbers before claiming any delta) and will publish a full report with per-task tables, contamination disclosure, and reproduction instructions in a follow-up update to this card.
Until that report lands, the only quantitative claims we make about this model are the token-efficiency and wall-clock numbers above — both of which are measured the same way for both models and are not sensitive to harness parsing.
Limitations & License
- Subset benchmarks only. Per-task n=10 (stderr ±16%). MMLU-Pro at n=140 (stderr ±4%). Larger-n runs are planned for the next release.
- Inherited Qwen3.5-4B limitations — language coverage, knowledge cutoff, and any biases of the base model. Quark fine-tuning shifts style, not knowledge.
- Mild ep-3 overfitting on the 4,919-row Quark training set (val loss 0.814 → 0.831 from ep 2 to ep 3).
- English only at this time.
- License: MIT, consistent with the Quark dataset and the Qwen3.5-4B base license terms.
- Downloads last month
- 634