DaisyCore β€” daisy_milli

Model Description

DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.

Architecture

Property Value
Architecture DaisyCore
Layers 26
Attention Heads 14
Model Dimension 1,792
Head Dimension 128
Sliding Window Size 2,048
Max Sequence Length 131,072
Vocabulary Size 49,152
Attention Implementation standard
Value Embeddings True
Tied Embeddings False
Skip Mix Mode linear
Tokenizer jonathanmiddleton/daisy
Dtype bfloat16
Parameters (total) 2,323,120,245
Parameters (non-embedding) 1,001,914,485
Parameters (embedding) 1,321,205,760

Training Progress

Metric Value
Checkpoint Step 1,200
Tokens Processed 2.52B (2,516,582,400)
Target Tokens 3.44B (3,442,039,096)
Progress 73.1%
Best Validation Loss 1.42815
Evaluations Performed 12
Saved 2026-02-26 17:34 UTC

Training Configuration

Optimizers

Optimizer Parameter Group Learning Rate
AdamW head_params 0.003216
AdamW embed_params 0.1865
AdamW scalar_params 0.02099
Muon hidden_matrix_params 0.025

Schedule & Regularization

Parameter Value
LR Scale 1.0
LR Schedule n_phase_linear
LR Schedule β€” begin_after_fraction 0.0
LR Schedule β€” cooldown_fraction 0.0
LR Schedule β€” floor 0.0
LR Schedule β€” phases [{'progress': 0.0, 'scale': 0}, {'progress': 0.183, 'scale': 0.23237}, {'progress': 1.0, 'scale': 0.1}]
LR Schedule β€” warmup_fraction 0.0
Gradient Accumulation Steps 4
Muon Warmup Steps 300
Seed 1337

Training Data

Type Sequence Length Path
nemotron_cc_hq 16,384 data/nemotron_cc_hq/high_synthetic_distill/high_synthetic_distill_jonathanmiddleton_daisy_train_*.bin
daisy_assistant_v1 16,384 data/daisy_assistant_v1

All Hyperparameters

Parameter Value
window_size 2048
vocab_size 49152
eos_token_id 49131
num_layers 26
num_heads 14
model_dim 1792
head_dim 128
max_seq_len 131072
model_spec daisy_milli
model_class models.daisy.daisy_core.DaisyCore
target_tokens 3442039096
full_window_target_tokens 3442039096
torch_coordinate_descent_tuning False
torch_inductor_config_max_autotune False
overfit False
full_windows True
wandb_log True
wandb_project milli
wandb_run_name milli_v18_mid_v12
wandb_group mid
init_model JonathanMiddleton/daisy-milli-base-v18
use_value_embeddings True
use_tied_embeddings False
seed 1337
task_val_debug_log_samples False
log_interval 16384
muon_warmup_steps 300
lr_scale 1.0
cooldown_fraction 0.0
lr_schedule {"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 0}, {"progress": 0.183, "scale": 0.23237}, {"progress": 1.0, "scale": 0.1}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}}
grad_acc_steps 4
val_loss_every_tokens 209715200
checkpoint_warmup_tokens 1000000000
checkpoint_per_n_tokens 0
save_checkpoint True
benchmarks_frequency 2
mmlu_cache_bin_path data/mmlu_cache/mmlu_cache.bin
mmlu_cache_bin_rebuild False
task_training False
track_last_n_layers 0
Downloads last month
19
Safetensors
Model size
2B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support