DaisyCore — daisy_milli

Model Description

DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.

Architecture

Property	Value
Architecture	DaisyCore
Layers	26
Attention Heads	14
Model Dimension	1,792
Head Dimension	128
Sliding Window Size	2,048
Max Sequence Length	131,072
Vocabulary Size	49,152
Attention Implementation	standard
Value Embeddings	True
Tied Embeddings	False
Skip Mix Mode	linear
Tokenizer	`jonathanmiddleton/daisy`
Dtype	bfloat16
Parameters (total)	2,323,120,245
Parameters (non-embedding)	1,001,914,485
Parameters (embedding)	1,321,205,760

Training Progress

Metric	Value
Checkpoint Step	1,200
Tokens Processed	2.52B (2,516,582,400)
Target Tokens	3.44B (3,442,039,096)
Progress	73.1%
Best Validation Loss	1.42815
Evaluations Performed	12
Saved	2026-02-26 17:34 UTC

Training Configuration

Optimizers

Optimizer	Parameter Group	Learning Rate
AdamW	head_params	0.003216
AdamW	embed_params	0.1865
AdamW	scalar_params	0.02099
Muon	hidden_matrix_params	0.025

Schedule & Regularization

Parameter	Value
LR Scale	1.0
LR Schedule	n_phase_linear
LR Schedule — begin_after_fraction	0.0
LR Schedule — cooldown_fraction	0.0
LR Schedule — floor	0.0
LR Schedule — phases	[{'progress': 0.0, 'scale': 0}, {'progress': 0.183, 'scale': 0.23237}, {'progress': 1.0, 'scale': 0.1}]
LR Schedule — warmup_fraction	0.0
Gradient Accumulation Steps	4
Muon Warmup Steps	300
Seed	1337

Training Data

Type	Sequence Length	Path
nemotron_cc_hq	16,384	`data/nemotron_cc_hq/high_synthetic_distill/high_synthetic_distill_jonathanmiddleton_daisy_train_*.bin`
daisy_assistant_v1	16,384	`data/daisy_assistant_v1`

All Hyperparameters

Parameter	Value
window_size	2048
vocab_size	49152
eos_token_id	49131
num_layers	26
num_heads	14
model_dim	1792
head_dim	128
max_seq_len	131072
model_spec	daisy_milli
model_class	models.daisy.daisy_core.DaisyCore
target_tokens	3442039096
full_window_target_tokens	3442039096
torch_coordinate_descent_tuning	False
torch_inductor_config_max_autotune	False
overfit	False
full_windows	True
wandb_log	True
wandb_project	milli
wandb_run_name	milli_v18_mid_v12
wandb_group	mid
init_model	JonathanMiddleton/daisy-milli-base-v18
use_value_embeddings	True
use_tied_embeddings	False
seed	1337
task_val_debug_log_samples	False
log_interval	16384
muon_warmup_steps	300
lr_scale	1.0
cooldown_fraction	0.0
lr_schedule	{"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 0}, {"progress": 0.183, "scale": 0.23237}, {"progress": 1.0, "scale": 0.1}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}}
grad_acc_steps	4
val_loss_every_tokens	209715200
checkpoint_warmup_tokens	1000000000
checkpoint_per_n_tokens	0
save_checkpoint	True
benchmarks_frequency	2
mmlu_cache_bin_path	data/mmlu_cache/mmlu_cache.bin
mmlu_cache_bin_rebuild	False
task_training	False
track_last_n_layers	0

Downloads last month: 19

Safetensors

Model size

2B params

Tensor type

F32

BF16