--- library_name: transformers tags: - daisy - causal-lm - pretrained license: apache-2.0 --- # DaisyCore — daisy_milli ## Model Description DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation. ## Architecture | Property | Value | |:---|:---| | Architecture | DaisyCore | | Layers | 26 | | Attention Heads | 14 | | Model Dimension | 1,792 | | Head Dimension | 128 | | Sliding Window Size | 2,048 | | Max Sequence Length | 131,072 | | Vocabulary Size | 49,152 | | Attention Implementation | standard | | Value Embeddings | True | | Tied Embeddings | False | | Skip Mix Mode | linear | | Tokenizer | `jonathanmiddleton/daisy` | | Dtype | bfloat16 | | Parameters (total) | 2,323,120,245 | | Parameters (non-embedding) | 1,001,914,485 | | Parameters (embedding) | 1,321,205,760 | ## Training Progress | Metric | Value | |:---|:---| | Checkpoint Step | 1,646 | | Tokens Processed | 3.45B (3,451,912,192) | | Target Tokens | 3.45B (3,451,912,192) | | Progress | 100.0% | | Best Validation Loss | 1.57501 | | Evaluations Performed | 17 | | Saved | 2026-03-02 21:53 UTC | ## Training Configuration ### Optimizers | Optimizer | Parameter Group | Learning Rate | |:---|:---|:---| | AdamW | head_params | 0.003216 | | AdamW | embed_params | 0.1865 | | AdamW | scalar_params | 0.02099 | | Muon | hidden_matrix_params | 0.025 | ### Schedule & Regularization | Parameter | Value | |:---|:---| | LR Scale | 1.0 | | LR Schedule | n_phase_linear | | LR Schedule — begin_after_fraction | 0.0 | | LR Schedule — cooldown_fraction | 0.0 | | LR Schedule — floor | 0.0 | | LR Schedule — phases | [{'progress': 0.0, 'scale': 0.185}, {'progress': 1.0, 'scale': 0.1}] | | LR Schedule — warmup_fraction | 0.0 | | Gradient Accumulation Steps | 4 | | Muon Warmup Steps | 300 | | Seed | 1337 | ### Training Data | Type | Sequence Length | Path | |:---|:---|:---| | fineweb-edu-dedup | 16,384 | `data/fineweb-edu-dedup/fineweb-edu-dedup_jonathanmiddleton_daisy_train_*.bin[000140:000200]` | | daisy_assistant_v1 | 16,384 | `data/daisy_assistant_v1` | ## All Hyperparameters | Parameter | Value | |:---|:---| | window_size | 2048 | | vocab_size | 49152 | | eos_token_id | 49131 | | num_layers | 26 | | num_heads | 14 | | model_dim | 1792 | | head_dim | 128 | | max_seq_len | 131072 | | model_spec | daisy_milli | | model_class | models.daisy.daisy_core.DaisyCore | | target_tokens | 3451912192 | | full_window_target_tokens | 3451912192 | | torch_coordinate_descent_tuning | False | | torch_inductor_config_max_autotune | False | | overfit | False | | full_windows | True | | wandb_log | True | | wandb_project | milli | | wandb_run_name | milli_v18b_mid_v13 | | wandb_group | mid | | init_model | JonathanMiddleton/daisy-milli-base-v18b | | use_value_embeddings | True | | use_tied_embeddings | False | | seed | 1337 | | task_val_debug_log_samples | False | | log_interval | 16384 | | muon_warmup_steps | 300 | | lr_scale | 1.0 | | cooldown_fraction | 0.0 | | lr_schedule | {"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 0.185}, {"progress": 1.0, "scale": 0.1}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}} | | grad_acc_steps | 4 | | val_loss_every_tokens | 209715200 | | checkpoint_warmup_tokens | 1000000000 | | checkpoint_per_n_tokens | 0 | | save_checkpoint | True | | benchmarks_frequency | 2 | | mmlu_cache_bin_path | data/mmlu_cache/mmlu_cache.bin | | mmlu_cache_bin_rebuild | False | | task_training | False | | track_last_n_layers | 0 |