DaisyCore β€” daisy_milli

Model Description

DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.

Architecture

Property Value
Architecture DaisyCore
Layers 26
Attention Heads 14
Model Dimension 1,792
Head Dimension 128
Sliding Window Size 2,048
Max Sequence Length 131,072
Vocabulary Size 49,152
Attention Implementation standard
Value Embeddings True
Tied Embeddings False
Skip Mix Mode linear
Tokenizer jonathanmiddleton/daisy
Dtype bfloat16
Parameters (total) 2,323,120,245
Parameters (non-embedding) 1,001,914,485
Parameters (embedding) 1,321,205,760

Training Progress

Metric Value
Checkpoint Step 39,300
Tokens Processed 123.46B (123,459,993,600)
Target Tokens 150.00B (150,000,000,000)
Progress 82.3%
Best Validation Loss 1.38709
Evaluations Performed 814
Saved 2026-03-03 23:08 UTC

Training Configuration

Optimizers

Optimizer Parameter Group Learning Rate
AdamW head_params 0.003216
AdamW embed_params 0.1865
AdamW scalar_params 0.02099
Muon hidden_matrix_params 0.025

Schedule & Regularization

Parameter Value
LR Scale 1.0
LR Schedule n_phase_linear
LR Schedule β€” begin_after_fraction 0.0
LR Schedule β€” cooldown_fraction 0.0
LR Schedule β€” floor 0.0
LR Schedule β€” phases [{'progress': 0.0, 'scale': 1.0}, {'progress': 0.5766, 'scale': 0.23492}, {'progress': 0.7733, 'scale': 0.195}, {'progress': 0.818, 'scale': 0.05}, {'progress': 1, 'scale': 0.05}]
LR Schedule β€” warmup_fraction 0.0
Gradient Accumulation Steps 3
Muon Warmup Steps 300
Seed 1337

Training Data

Type Sequence Length Path
fineweb-edu-dedup 16,384 data/fineweb-edu-dedup/fineweb-edu-dedup_jonathanmiddleton_daisy_train_*.bin
daisy_assistant_v1 1,024 data/daisy_assistant_v1
daisypie_smart 16,384 data/daisypie_smart

Checkpoint Provenance

  • Resumed from: JonathanMiddleton/daisy-milli-18d-1

All Hyperparameters

Parameter Value
window_size 2048
vocab_size 49152
eos_token_id 49131
num_layers 26
num_heads 14
model_dim 1792
head_dim 128
max_seq_len 131072
model_spec daisy_milli
model_class models.daisy.daisy_core.DaisyCore
target_tokens 100000000000
full_window_target_tokens 3000000000
torch_coordinate_descent_tuning False
torch_inductor_config_max_autotune False
overfit False
full_windows False
wandb_log True
wandb_project milli
wandb_run_name milli_v18d.1.1
wandb_group pretrain
resume_checkpoint JonathanMiddleton/daisy-milli-18d-1
resume_target_tokens_override 150000000000
use_value_embeddings True
use_tied_embeddings False
seed 1337
task_val_debug_log_samples False
log_interval 16384
muon_warmup_steps 300
lr_scale 1.0
cooldown_fraction 0.0
lr_schedule {"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 1.0}, {"progress": 0.5766, "scale": 0.23492}, {"progress": 0.7733, "scale": 0.195}, {"progress": 0.818, "scale": 0.05}, {"progress": 1, "scale": 0.05}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}}
grad_acc_steps 3
val_loss_every_tokens 324403200
checkpoint_warmup_tokens 93000000000
checkpoint_per_n_tokens 0
save_checkpoint True
benchmarks_frequency 1
mmlu_cache_bin_path data/mmlu_cache/mmlu_cache.bin
mmlu_cache_bin_rebuild False
task_training False
track_last_n_layers 0
Downloads last month
77
Safetensors
Model size
2B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support