AUM-1-70B
AUM (เค เค เคฎ) โ the primordial sound. The first. The foundation.
AUM-1-70B is a 70-billion parameter thinking model built on LLaMA 3 70B. It is the first model in the AUM series โ a research effort focused on building open, transparent, reasoning-first language models through knowledge distillation, supervised fine-tuning, and synthetic data generation.
AUM-1-70B externalizes its reasoning inside <think> tags before producing a final answer, giving full transparency into how the model arrives at conclusions.
Model Details
| Property | Value |
|---|---|
| Base Model | meta-llama/Meta-Llama-3-70B |
| Parameters | 70B |
| Architecture | LLaMA 3 (decoder-only transformer) |
| Training Method | Distillation + SFT + Thinking Traces |
| Thinking Format | <think>...</think> tags (trained-in, not prompted) |
| Precision | bfloat16 |
| Context Length | 8,192 tokens |
| Release Date | September 2025 |
| License | LLaMA 3 Community License |
What Makes AUM Different
Most fine-tuned models are trained to produce answers. AUM is trained to produce reasoning โ the full chain of thought that leads to an answer.
This is inspired by the Orca paper (Microsoft, 2023), which showed that smaller models can match much larger ones by learning from the reasoning traces of frontier models, not just their outputs.
AUM combines three training strategies:
1. Knowledge Distillation (Orca-style) Frontier models (GPT-4, Claude) were used to generate detailed reasoning trajectories. AUM learned to think by imitating how much larger models reason โ internalizing step-by-step decomposition, self-correction, and structured thinking.
2. Benchmark-Specific SFT Fine-tuned on the training splits of popular public benchmarks. This teaches underlying skills without contaminating held-out test sets.
3. Thinking Format Training
AUM is trained to wrap internal reasoning in <think> tags. This is not a prompt trick โ the model learned this format from training data where reasoning traces were explicitly structured this way.
Thinking Format
AUM outputs reasoning before its final answer:
User: A train travels 120km in 1.5 hours. What is its average speed?
AUM: <think>
The formula for average speed is distance divided by time.
Distance = 120 km
Time = 1.5 hours
Speed = 120 / 1.5 = 80 km/h
</think>
The average speed of the train is 80 km/h.
How to Run
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Nitish-Garikoti/aum-1-70B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = "What is the derivative of x^3 + 2x^2 - 5x + 1?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Parsing the Think Block
import re
def parse_aum_response(text):
think = re.search(r'<think>(.*?)</think>', text, re.DOTALL)
answer = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip()
return {
"thinking": think.group(1).strip() if think else None,
"answer": answer
}
Training Details
Datasets
AUM-1-70B was trained on a multi-layered dataset combining public benchmarks and private synthetic data.
Public Datasets (Train Splits Only)
Only train splits were used โ test splits remain untouched to preserve benchmark integrity.
| Dataset | Domain | Purpose |
|---|---|---|
open-thoughts/OpenThoughts-114k |
Reasoning | Core thinking traces โ teaches <think> format |
openai/gsm8k |
Math | Arithmetic and multi-step reasoning |
AI-MO/NuminaMath-CoT |
Math | Advanced math with chain-of-thought |
openai/humaneval |
Coding | Python function generation |
princeton-nlp/SWE-bench |
Coding | Real-world GitHub issue resolution |
cais/mmlu |
Knowledge | Multi-domain academic QA |
EleutherAI/hellaswag |
NLU | Commonsense reasoning |
allenai/ai2_arc |
Science | Multi-step science QA |
HuggingFaceH4/ultrafeedback_binarized |
Alignment | Instruction following |
Private Synthetic Datasets
A significant portion of AUM's training data is private and was generated using Orca-style distillation:
- Reasoning trajectories โ Generated by prompting frontier models (GPT-4, Claude) with diverse tasks, capturing full chain-of-thought responses formatted with
<think>tags - Task-specific SFT data โ Custom instruction-response pairs targeting specific capability gaps
- Benchmark augmentation โ Synthetic variants of public benchmark problems to increase diversity
Why This Is Not Contamination
AUM uses only the train splits of public benchmarks. The model learns the skill (e.g., mathematical reasoning), not the specific test answers. This is the same methodology used by DeepSeek, Qwen, and other leading open-weight models.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 2e-5 (cosine decay) |
| Batch Size | 32 (gradient accumulation ร 8) |
| Epochs | 3 |
| Max Sequence Length | 4,096 tokens |
| Optimizer | AdamW (ฮฒ1=0.9, ฮฒ2=0.95) |
| Warmup Steps | 100 |
| Mixed Precision | bfloat16 |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| LoRA Target Modules | q_proj, v_proj, k_proj, o_proj |
Evaluation
| Benchmark | Domain | Metric | Score |
|---|---|---|---|
| GSM8K (test) | Math | Accuracy | ~88.5% |
| MMLU (test) | Knowledge | Accuracy | ~79.2% |
| HumanEval (test) | Coding | Pass@1 | ~74.4% |
| HellaSwag (test) | NLU | Accuracy | ~87.3% |
| ARC-Challenge (test) | Science | Accuracy | ~80.1% |
Hardware Requirements
| Setup | Configuration |
|---|---|
| Inference (full precision) | 2ร A100 80GB |
| Inference (4-bit quantized) | 1ร A100 40GB |
| Fine-tuning (LoRA) | 4ร A100 80GB |
Limitations
- Context window: 8,192 tokens โ long documents require chunking
<think>overhead: Reasoning blocks add token count โ setmax_new_tokensaccordingly- English-primary: Trained predominantly on English data
- Not RLHF-aligned: SFT model โ may not refuse harmful requests reliably
- Hallucination: Like all LLMs, can produce confident but incorrect reasoning
Citation
@misc{garikoti2025aum,
title={AUM-1-70B: A Thinking Model via Distillation and Task-Specific Fine-Tuning},
author={Garikoti, Nitish},
year={2025},
url={https://huggingface.co/Nitish-Garikoti/aum-1-70B}
}
Built with ๐ฅ by Nitish Garikoti
- Downloads last month
- 375
Model tree for Nitish-Garikoti/aum-1-70B
Paper for Nitish-Garikoti/aum-1-70B
Evaluation results
- GSM8Kself-reported88.500
- MMLUself-reported79.200
- HumanEvalself-reported74.400