AUM-1-70B

AUM (अ उ म) — the primordial sound. The first. The foundation.

AUM-1-70B is a 70-billion parameter thinking model built on LLaMA 3 70B. It is the first model in the AUM series — a research effort focused on building open, transparent, reasoning-first language models through knowledge distillation, supervised fine-tuning, and synthetic data generation.

AUM-1-70B externalizes its reasoning inside <think> tags before producing a final answer, giving full transparency into how the model arrives at conclusions.

Model Details

Property	Value
Base Model	meta-llama/Meta-Llama-3-70B
Parameters	70B
Architecture	LLaMA 3 (decoder-only transformer)
Training Method	Distillation + SFT + Thinking Traces
Thinking Format	`<think>...</think>` tags (trained-in, not prompted)
Precision	bfloat16
Context Length	8,192 tokens
Release Date	September 2025
License	LLaMA 3 Community License

What Makes AUM Different

Most fine-tuned models are trained to produce answers. AUM is trained to produce reasoning — the full chain of thought that leads to an answer.

This is inspired by the Orca paper (Microsoft, 2023), which showed that smaller models can match much larger ones by learning from the reasoning traces of frontier models, not just their outputs.

AUM combines three training strategies:

1. Knowledge Distillation (Orca-style) Frontier models (GPT-4, Claude) were used to generate detailed reasoning trajectories. AUM learned to think by imitating how much larger models reason — internalizing step-by-step decomposition, self-correction, and structured thinking.

2. Benchmark-Specific SFT Fine-tuned on the training splits of popular public benchmarks. This teaches underlying skills without contaminating held-out test sets.

3. Thinking Format Training AUM is trained to wrap internal reasoning in <think> tags. This is not a prompt trick — the model learned this format from training data where reasoning traces were explicitly structured this way.

Thinking Format

AUM outputs reasoning before its final answer:

User: A train travels 120km in 1.5 hours. What is its average speed?

AUM: <think>
The formula for average speed is distance divided by time.
Distance = 120 km
Time = 1.5 hours
Speed = 120 / 1.5 = 80 km/h
</think>

The average speed of the train is 80 km/h.

How to Run

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Nitish-Garikoti/aum-1-70B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "What is the derivative of x^3 + 2x^2 - 5x + 1?"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Parsing the Think Block

import re

def parse_aum_response(text):
    think = re.search(r'<think>(.*?)</think>', text, re.DOTALL)
    answer = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip()
    return {
        "thinking": think.group(1).strip() if think else None,
        "answer": answer
    }

Training Details

Datasets

AUM-1-70B was trained on a multi-layered dataset combining public benchmarks and private synthetic data.

Public Datasets (Train Splits Only)

Only train splits were used — test splits remain untouched to preserve benchmark integrity.

Dataset	Domain	Purpose
`open-thoughts/OpenThoughts-114k`	Reasoning	Core thinking traces — teaches `<think>` format
`openai/gsm8k`	Math	Arithmetic and multi-step reasoning
`AI-MO/NuminaMath-CoT`	Math	Advanced math with chain-of-thought
`openai/humaneval`	Coding	Python function generation
`princeton-nlp/SWE-bench`	Coding	Real-world GitHub issue resolution
`cais/mmlu`	Knowledge	Multi-domain academic QA
`EleutherAI/hellaswag`	NLU	Commonsense reasoning
`allenai/ai2_arc`	Science	Multi-step science QA
`HuggingFaceH4/ultrafeedback_binarized`	Alignment	Instruction following

Private Synthetic Datasets

A significant portion of AUM's training data is private and was generated using Orca-style distillation:

Reasoning trajectories — Generated by prompting frontier models (GPT-4, Claude) with diverse tasks, capturing full chain-of-thought responses formatted with <think> tags
Task-specific SFT data — Custom instruction-response pairs targeting specific capability gaps
Benchmark augmentation — Synthetic variants of public benchmark problems to increase diversity

Why This Is Not Contamination

AUM uses only the train splits of public benchmarks. The model learns the skill (e.g., mathematical reasoning), not the specific test answers. This is the same methodology used by DeepSeek, Qwen, and other leading open-weight models.

Training Hyperparameters

Parameter	Value
Learning Rate	2e-5 (cosine decay)
Batch Size	32 (gradient accumulation × 8)
Epochs	3
Max Sequence Length	4,096 tokens
Optimizer	AdamW (β1=0.9, β2=0.95)
Warmup Steps	100
Mixed Precision	bfloat16
LoRA Rank	16
LoRA Alpha	32
LoRA Target Modules	q_proj, v_proj, k_proj, o_proj

Evaluation

Benchmark	Domain	Metric	Score
GSM8K (test)	Math	Accuracy	~88.5%
MMLU (test)	Knowledge	Accuracy	~79.2%
HumanEval (test)	Coding	Pass@1	~74.4%
HellaSwag (test)	NLU	Accuracy	~87.3%
ARC-Challenge (test)	Science	Accuracy	~80.1%

Hardware Requirements

Setup	Configuration
Inference (full precision)	2× A100 80GB
Inference (4-bit quantized)	1× A100 40GB
Fine-tuning (LoRA)	4× A100 80GB

Limitations

Context window: 8,192 tokens — long documents require chunking
<think> overhead: Reasoning blocks add token count — set max_new_tokens accordingly
English-primary: Trained predominantly on English data
Not RLHF-aligned: SFT model — may not refuse harmful requests reliably
Hallucination: Like all LLMs, can produce confident but incorrect reasoning

Citation

@misc{garikoti2025aum,
  title={AUM-1-70B: A Thinking Model via Distillation and Task-Specific Fine-Tuning},
  author={Garikoti, Nitish},
  year={2025},
  url={https://huggingface.co/Nitish-Garikoti/aum-1-70B}
}

Built with 🔥 by Nitish Garikoti