DaisyCore — daisy_milli

Model Description

DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.

Architecture

Property	Value
Architecture	DaisyCore
Layers	26
Attention Heads	14
Model Dimension	1,792
Head Dimension	128
Sliding Window Size	2,048
Max Sequence Length	131,072
Vocabulary Size	49,152
Attention Implementation	standard
Value Embeddings	True
Tied Embeddings	False
Skip Mix Mode	linear
Tokenizer	`jonathanmiddleton/daisy`
Dtype	bfloat16
Parameters (total)	2,323,120,245
Parameters (non-embedding)	1,001,914,485
Parameters (embedding)	1,321,205,760

Training Progress

Metric	Value
Checkpoint Step	50
Tokens Processed	104.86M (104,857,600)
Target Tokens	376.05M (376,045,116)
Progress	27.9%
Best Validation Loss	1.35229
Evaluations Performed	2
HellaSwag (acc_norm)	60.19%
MMLU (acc)	34.20%
Saved	2026-03-10 17:46 UTC

Training Configuration

Optimizers

Optimizer	Parameter Group	Learning Rate
AdamW	head_params	0.003216
AdamW	embed_params	0.1865
AdamW	scalar_params	0.02099
Muon	hidden_matrix_params	0.025

Schedule & Regularization

Parameter	Value
LR Scale	1.0
LR Schedule	n_phase_linear
LR Schedule — begin_after_fraction	0.0
LR Schedule — cooldown_fraction	0.0
LR Schedule — floor	0.0
LR Schedule — phases	[{'progress': 0.0, 'scale': 0.05}, {'progress': 0.1, 'scale': 0.05}, {'progress': 1, 'scale': 0.001}]
LR Schedule — warmup_fraction	0.0
Gradient Accumulation Steps	128
Muon Warmup Steps	300
Seed	1337

Training Data

Type	Sequence Length	Path
smart_sft	16,384	`data/smart_sft`

All Hyperparameters

Parameter	Value
window_size	2048
vocab_size	49152
eos_token_id	49131
num_layers	26
num_heads	14
model_dim	1792
head_dim	128
max_seq_len	131072
model_spec	daisy_milli
model_class	models.daisy.daisy_core.DaisyCore
target_tokens	376045116
full_window_target_tokens	376045116
torch_coordinate_descent_tuning	False
torch_inductor_config_max_autotune	False
overfit	False
full_windows	True
wandb_log	True
wandb_project	milli
wandb_run_name	milli_v18de.2.1
wandb_group	pretrain
init_model	JonathanMiddleton/daisy-milli-18de.2
use_value_embeddings	True
use_tied_embeddings	False
seed	1337
task_val_debug_log_samples	False
log_interval	16384
muon_warmup_steps	300
lr_scale	1.0
cooldown_fraction	0.0
lr_schedule	{"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 0.05}, {"progress": 0.1, "scale": 0.05}, {"progress": 1, "scale": 0.001}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}}
grad_acc_steps	128
val_loss_every_tokens	52428800
checkpoint_warmup_tokens	1
checkpoint_per_n_tokens	0
save_checkpoint	True
benchmarks	{"hellaswag_enabled": true, "hellaswag_frequency": 1, "mmlu_enabled": true, "mmlu_frequency": 1}
mmlu_cache_bin_path	data/mmlu_cache/mmlu_cache.bin
mmlu_cache_bin_rebuild	False
task_training	False
track_last_n_layers	0

Downloads last month: 8

Safetensors

Model size

2B params

Tensor type

F32

BF16