DaisyCore — daisy_milli

Model Description

DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.

Architecture

Property	Value
Architecture	DaisyCore
Layers	26
Attention Heads	14
Model Dimension	1,792
Head Dimension	128
Sliding Window Size	2,048
Max Sequence Length	131,072
Vocabulary Size	49,152
Attention Implementation	standard
Value Embeddings	True
Tied Embeddings	False
Skip Mix Mode	linear
Tokenizer	`jonathanmiddleton/daisy`
Dtype	bfloat16
Parameters (total)	2,323,120,245
Parameters (non-embedding)	1,001,914,485
Parameters (embedding)	1,321,205,760

Training Progress

Metric	Value
Checkpoint Step	39,300
Tokens Processed	123.46B (123,459,993,600)
Target Tokens	150.00B (150,000,000,000)
Progress	82.3%
Best Validation Loss	1.38709
Evaluations Performed	814
Saved	2026-03-03 23:08 UTC

Training Configuration

Optimizers

Optimizer	Parameter Group	Learning Rate
AdamW	head_params	0.003216
AdamW	embed_params	0.1865
AdamW	scalar_params	0.02099
Muon	hidden_matrix_params	0.025

Schedule & Regularization

Parameter	Value
LR Scale	1.0
LR Schedule	n_phase_linear
LR Schedule — begin_after_fraction	0.0
LR Schedule — cooldown_fraction	0.0
LR Schedule — floor	0.0
LR Schedule — phases	[{'progress': 0.0, 'scale': 1.0}, {'progress': 0.5766, 'scale': 0.23492}, {'progress': 0.7733, 'scale': 0.195}, {'progress': 0.818, 'scale': 0.05}, {'progress': 1, 'scale': 0.05}]
LR Schedule — warmup_fraction	0.0
Gradient Accumulation Steps	3
Muon Warmup Steps	300
Seed	1337

Training Data

Type	Sequence Length	Path
fineweb-edu-dedup	16,384	`data/fineweb-edu-dedup/fineweb-edu-dedup_jonathanmiddleton_daisy_train_*.bin`
daisy_assistant_v1	1,024	`data/daisy_assistant_v1`
daisypie_smart	16,384	`data/daisypie_smart`

Checkpoint Provenance

Resumed from: JonathanMiddleton/daisy-milli-18d-1

All Hyperparameters

Parameter	Value
window_size	2048
vocab_size	49152
eos_token_id	49131
num_layers	26
num_heads	14
model_dim	1792
head_dim	128
max_seq_len	131072
model_spec	daisy_milli
model_class	models.daisy.daisy_core.DaisyCore
target_tokens	100000000000
full_window_target_tokens	3000000000
torch_coordinate_descent_tuning	False
torch_inductor_config_max_autotune	False
overfit	False
full_windows	False
wandb_log	True
wandb_project	milli
wandb_run_name	milli_v18d.1.1
wandb_group	pretrain
resume_checkpoint	JonathanMiddleton/daisy-milli-18d-1
resume_target_tokens_override	150000000000
use_value_embeddings	True
use_tied_embeddings	False
seed	1337
task_val_debug_log_samples	False
log_interval	16384
muon_warmup_steps	300
lr_scale	1.0
cooldown_fraction	0.0
lr_schedule	{"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 1.0}, {"progress": 0.5766, "scale": 0.23492}, {"progress": 0.7733, "scale": 0.195}, {"progress": 0.818, "scale": 0.05}, {"progress": 1, "scale": 0.05}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}}
grad_acc_steps	3
val_loss_every_tokens	324403200
checkpoint_warmup_tokens	93000000000
checkpoint_per_n_tokens	0
save_checkpoint	True
benchmarks_frequency	1
mmlu_cache_bin_path	data/mmlu_cache/mmlu_cache.bin
mmlu_cache_bin_rebuild	False
task_training	False
track_last_n_layers	0

Downloads last month: 77

Safetensors

Model size

2B params

Tensor type

F32

BF16