Atem Logo

Atem v1

Ancient logic. Modern intelligence.

A 1.5B reasoning model trained via multi-source knowledge distillation from frontier teacher models.

Base Model Method Parameters License


Overview

Atem is a 1.5B parameter reasoning model built via supervised fine-tuning on a curated corpus of approximately 115,000 examples distilled from multiple frontier teacher models. Starting from Qwen2.5-1.5B-Instruct, Atem was trained using LoRA to preserve base model capabilities while improving performance on reasoning, mathematics, and coding tasks.

This is Stage 1 of a planned multi-stage training series. Stage 1 focuses on establishing strong general reasoning across domains. Stage 2 layers chain-of-thought thinking traces on top of this foundation. Stage 2 is Atem-Wisdom which builds on this foundation by adding explicit chain-of-thought reasoning — the model works through problems inside tags before producing its final answer.


Model Details

Property Value
Base model Qwen/Qwen2.5-1.5B-Instruct
Training method LoRA Supervised Fine-Tuning (Stage 1)
LoRA config r=32, alpha=64, dropout=0.05
Target modules q, k, v, o, gate, up, down projections
Parameters ~1.54B
Training records ~114,932
Epochs 1
Effective batch size 64 (batch 8 × grad accum 8)
Learning rate 2e-4, cosine schedule, 5% warmup
Final train loss 0.940
Final val loss 0.890
Hardware NVIDIA A100-SXM4 80GB
Max sequence length 4,096 tokens
Precision bfloat16
License Apache 2.0

Intended Use

Atem is designed for open-ended reasoning tasks where structured, accurate thinking adds value:

  • Code explanation, implementation, and debugging
  • Mathematical problem solving with working shown
  • Analytical reasoning and hypothesis evaluation
  • Concept explanation and comparative analysis
  • Logic, argument, and fallacy identification

Atem is not designed for retrieval-heavy factual lookup, real-time information, or tasks requiring broad knowledge breadth beyond its training domains.


Training Data

Atem was trained on a corpus assembled from eleven sources, combining domain-specific generated datasets and publicly available distillation datasets from frontier models. All outputs containing <think> reasoning traces were stripped to clean final responses for Stage 1 training.

Dataset Records Source / Teacher
EphAsad/QWENMillenium-SF 5,000 Qwen2.5-14B — Analytical & Scientific
EphAsad/Phi4Millennium-SF 2,932 Phi-4 14B — Mathematical Reasoning
EphAsad/MistralMillenium-SF 5,000 Mistral-Nemo-12B — Language & Comprehension
Modotte/CodeX-2M-Thinking 30,000 Mixed — Coding
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned 23,000 Kimi K2.5 — General Distillation (English filtered)
WithinUsAI/MiniMax_M2.7_Distilled_5k 5,000 MiniMax M2.7
tuanha1305/DeepSeek-R1-Distill 9,000 DeepSeek-R1
open-r1/OpenThoughts-114k-math 10,000 Mixed — Mathematics (correct answers only)
flytech/python-codes-25k 10,000 Python coding
FreedomIntelligence/medical-o1-reasoning-SFT 10,000 Medical reasoning (English config)
Private dataset 5,000 Undisclosed
Total ~114,932

The QWENMillenium-SF, Phi4Millennium-SF, and MistralMillenium-SF datasets were generated specifically for this project via batched inference on Colab A100. OpenThoughts-114k-math was filtered to verified correct solutions only before sampling.


Training Configuration

# Key hyperparameters
lora_r            = 32
lora_alpha        = 64
lora_dropout      = 0.05
max_seq_length    = 4096
learning_rate     = 2e-4
lr_scheduler      = 'cosine'
warmup_ratio      = 0.05
batch_size        = 8
grad_accumulation = 8       # effective batch size: 64
num_epochs        = 1
dtype             = bfloat16
load_in_4bit      = True    # during training

Training used Unsloth with train_on_responses_only masking, ensuring loss was computed exclusively on assistant response tokens. A three-part pre-training validation was run before training: chat template replacement verification, think tag strip confirmation, and mask sanity check.

After training, LoRA adapters were merged into the base weights and exported as a full merged model.

Loss curve:

Step Train Loss Val Loss
500 0.990 0.920
1000 1.020 0.900
1500 0.960 0.890
Final 0.940 0.890

Validation loss converged at 0.890, with a final train/val gap of 0.050 — indicating no overfitting over the single epoch.


Evaluation

Benchmark Results

Evaluated against Qwen2.5-1.5B-Instruct (base model) using lm-evaluation-harness with identical conditions: 4-bit inference, batch size 16, zero-shot strict evaluation.

Task Base (1.5B) Atem v1 (1.5B) Delta
ARC-Challenge 43.7% 45.5% +1.8% ✓
GSM8K 23.0% 53.0% +30.0%
HellaSwag 66.8% 64.4% -2.4%

The GSM8K result is the primary finding. A +30 percentage point improvement on grade school mathematics reflects the targeted training on verified correct mathematical reasoning examples from multiple frontier teacher models.

The HellaSwag regression of 2.4% is within normal benchmark variance and represents a significant improvement over a prior exploratory training run using full fine-tune, which produced a 16.2% regression on the same benchmark. LoRA preserved base model commonsense capabilities as intended.

Comparison vs Qwen2.5-7B-Instruct

To contextualise the GSM8K result, Atem was benchmarked against Qwen2.5-7B-Instruct under the same zero-shot strict evaluation conditions.

Model Parameters GSM8K (zero-shot strict)
Qwen2.5-1.5B-Instruct 1.5B 23.0%
Atem v1 1.5B 53.0%
Qwen2.5-7B-Instruct 7B 74.9%

At baseline, the 1.5B model sits 51.9 points below the 7B. After training, Atem sits 21.9 points below — closing approximately 58% of the capability gap between 1.5B and 7B on mathematical reasoning. Atem achieves 71% of Qwen2.5-7B's GSM8K performance at 22% of its parameter count.

Note: Official Qwen2.5-7B-Instruct scores (91.6% GSM8K) use 4-shot chain-of-thought prompting. The 74.9% figure above reflects the same zero-shot strict evaluation format used for Atem, ensuring a fair direct comparison.

Qualitative Evaluation

Atem was evaluated against Qwen2.5-1.5B-Instruct across 30 domain-representative questions using matched system prompts, ensuring differences in output reflect trained capability rather than prompt engineering.

Domain Questions Outcome
Coding 8 Atem stronger — more thorough, better structured, catches edge cases
Mathematics 6 Comparable — both accurate on standard problems
Analytical Reasoning 6 Atem stronger — better structured arguments
General Knowledge 5 Comparable
Language & Logic 5 Atem stronger — correct fallacy identification, greater depth

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "EphAsad/Atem-v1-1.5B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": "Write a Python function that checks whether a number is prime."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=1000,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        do_sample=True,
    )

response = tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
)
print(response)

Unsloth (faster inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="EphAsad/Atem-v1-1.5B",
    max_seq_length=4096,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {
        "role": "user",
        "content": "Explain the difference between a stack and a queue, with examples."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=1000,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

print(tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
))

Ollama

# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-v1-1.5B:Q4_K_M

# Higher quality
ollama run hf.co/EphAsad/Atem-v1-1.5B:Q5_K_M

# Near-lossless
ollama run hf.co/EphAsad/Atem-v1-1.5B:Q8_0

llama.cpp

llama-server -hf EphAsad/Atem-v1-1.5B:Q4_K_M

System Prompt

Atem's identity is baked into the chat template and activates automatically when no system message is provided. For manual override:

You are Atem, a precise and analytical reasoning assistant. You approach 
every problem methodically — identifying core concepts, reasoning step by 
step, and arriving at well-supported conclusions. You show your thinking 
clearly and are thorough, direct, and intellectually honest.

Available Files

File Size Description
model.safetensors ~3.1 GB Full bfloat16 merged weights
Atem-1.5b.Q4_K_M.gguf ~986 MB 4-bit quantised — recommended
Atem-1.5b.Q5_K_M.gguf ~1.1 GB 5-bit quantised
Atem-1.5b.Q8_0.gguf ~1.6 GB 8-bit quantised — near-lossless

Known Limitations

No thinking traces (Stage 1 by design). Think tags were stripped from all training data for Stage 1. The model does not produce extended <think> reasoning traces. Stage 2 training will layer this capability on top of the Stage 1 foundation.

Mathematical precision on complex problems. On multi-step calculations, the model may make arithmetic slips in intermediate steps while arriving at a structurally correct approach. Answers to high-stakes mathematical problems should be independently verified.

HellaSwag regression. A 2.4% regression on HellaSwag commonsense completion is observed. This is minor and substantially better than the 16.2% regression produced by the earlier exploratory full fine-tune run, confirming that LoRA preserved base commonsense capability effectively.


Roadmap

Atem v1 establishes the Stage 1 foundation. Planned next steps:

  • Stage 2: LoRA SFT on curated chain-of-thought data to add thinking trace capability — using Complex_CoT, inverted_reasoning, and reasoning trace columns held out from Stage 1 training
  • Extended benchmarks: MMLU, BBH, IFEval, WinoGrande, MBPP post-Stage 2
  • Atem v2: Expanded corpus, further domain coverage

Citation

@misc{atem_v1_2026,
  author       = {Asad, Zain},
  title        = {Atem v1: A 1.5B Reasoning Model via 
                  Multi-Source Knowledge Distillation},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EphAsad/Atem-v1-1.5B}},
}

Support

If you find this model useful for your research or projects, you can support further development of my datasets and models here:
ko-fi.com/ephraim123


License

Released under the Apache 2.0 License, consistent with the base model Qwen2.5-1.5B-Instruct.


Built independently by EphAsad

Downloads last month
302
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EphAsad/Atem-v1-1.5B

Adapter
(1006)
this model
Adapters
3 models

Datasets used to train EphAsad/Atem-v1-1.5B

Evaluation results