DaisyCore β€” daisy_milli

Model Description

DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.

Architecture

Property Value
Architecture DaisyCore
Layers 26
Attention Heads 14
Model Dimension 1,792
Head Dimension 128
Sliding Window Size 2,048
Max Sequence Length 131,072
Vocabulary Size 49,152
Attention Implementation standard
Value Embeddings True
Tied Embeddings False
Skip Mix Mode linear
Tokenizer jonathanmiddleton/daisy
Dtype bfloat16
Parameters (total) 2,323,120,245
Parameters (non-embedding) 1,001,914,485
Parameters (embedding) 1,321,205,760

Training Progress

Metric Value
Checkpoint Step 50
Tokens Processed 104.86M (104,857,600)
Target Tokens 376.05M (376,045,116)
Progress 27.9%
Best Validation Loss 1.35229
Evaluations Performed 2
HellaSwag (acc_norm) 60.19%
MMLU (acc) 34.20%
Saved 2026-03-10 17:46 UTC

Training Configuration

Optimizers

Optimizer Parameter Group Learning Rate
AdamW head_params 0.003216
AdamW embed_params 0.1865
AdamW scalar_params 0.02099
Muon hidden_matrix_params 0.025

Schedule & Regularization

Parameter Value
LR Scale 1.0
LR Schedule n_phase_linear
LR Schedule β€” begin_after_fraction 0.0
LR Schedule β€” cooldown_fraction 0.0
LR Schedule β€” floor 0.0
LR Schedule β€” phases [{'progress': 0.0, 'scale': 0.05}, {'progress': 0.1, 'scale': 0.05}, {'progress': 1, 'scale': 0.001}]
LR Schedule β€” warmup_fraction 0.0
Gradient Accumulation Steps 128
Muon Warmup Steps 300
Seed 1337

Training Data

Type Sequence Length Path
smart_sft 16,384 data/smart_sft

All Hyperparameters

Parameter Value
window_size 2048
vocab_size 49152
eos_token_id 49131
num_layers 26
num_heads 14
model_dim 1792
head_dim 128
max_seq_len 131072
model_spec daisy_milli
model_class models.daisy.daisy_core.DaisyCore
target_tokens 376045116
full_window_target_tokens 376045116
torch_coordinate_descent_tuning False
torch_inductor_config_max_autotune False
overfit False
full_windows True
wandb_log True
wandb_project milli
wandb_run_name milli_v18de.2.1
wandb_group pretrain
init_model JonathanMiddleton/daisy-milli-18de.2
use_value_embeddings True
use_tied_embeddings False
seed 1337
task_val_debug_log_samples False
log_interval 16384
muon_warmup_steps 300
lr_scale 1.0
cooldown_fraction 0.0
lr_schedule {"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 0.05}, {"progress": 0.1, "scale": 0.05}, {"progress": 1, "scale": 0.001}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}}
grad_acc_steps 128
val_loss_every_tokens 52428800
checkpoint_warmup_tokens 1
checkpoint_per_n_tokens 0
save_checkpoint True
benchmarks {"hellaswag_enabled": true, "hellaswag_frequency": 1, "mmlu_enabled": true, "mmlu_frequency": 1}
mmlu_cache_bin_path data/mmlu_cache/mmlu_cache.bin
mmlu_cache_bin_rebuild False
task_training False
track_last_n_layers 0
Downloads last month
8
Safetensors
Model size
2B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support