metadata
library_name: transformers
tags:
- daisy
- causal-lm
- pretrained
license: apache-2.0
DaisyCore — daisy_milli
Model Description
DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.
Architecture
| Property | Value |
|---|---|
| Architecture | DaisyCore |
| Layers | 26 |
| Attention Heads | 14 |
| Model Dimension | 1,792 |
| Head Dimension | 128 |
| Sliding Window Size | 2,048 |
| Max Sequence Length | 131,072 |
| Vocabulary Size | 49,152 |
| Attention Implementation | standard |
| Value Embeddings | True |
| Tied Embeddings | False |
| Skip Mix Mode | linear |
| Tokenizer | jonathanmiddleton/daisy |
| Dtype | bfloat16 |
| Parameters (total) | 2,323,120,245 |
| Parameters (non-embedding) | 1,001,914,485 |
| Parameters (embedding) | 1,321,205,760 |
Training Progress
| Metric | Value |
|---|---|
| Checkpoint Step | 1,646 |
| Tokens Processed | 3.45B (3,451,912,192) |
| Target Tokens | 3.45B (3,451,912,192) |
| Progress | 100.0% |
| Best Validation Loss | 1.57501 |
| Evaluations Performed | 17 |
| Saved | 2026-03-02 21:53 UTC |
Training Configuration
Optimizers
| Optimizer | Parameter Group | Learning Rate |
|---|---|---|
| AdamW | head_params | 0.003216 |
| AdamW | embed_params | 0.1865 |
| AdamW | scalar_params | 0.02099 |
| Muon | hidden_matrix_params | 0.025 |
Schedule & Regularization
| Parameter | Value |
|---|---|
| LR Scale | 1.0 |
| LR Schedule | n_phase_linear |
| LR Schedule — begin_after_fraction | 0.0 |
| LR Schedule — cooldown_fraction | 0.0 |
| LR Schedule — floor | 0.0 |
| LR Schedule — phases | [{'progress': 0.0, 'scale': 0.185}, {'progress': 1.0, 'scale': 0.1}] |
| LR Schedule — warmup_fraction | 0.0 |
| Gradient Accumulation Steps | 4 |
| Muon Warmup Steps | 300 |
| Seed | 1337 |
Training Data
| Type | Sequence Length | Path |
|---|---|---|
| fineweb-edu-dedup | 16,384 | data/fineweb-edu-dedup/fineweb-edu-dedup_jonathanmiddleton_daisy_train_*.bin[000140:000200] |
| daisy_assistant_v1 | 16,384 | data/daisy_assistant_v1 |
All Hyperparameters
| Parameter | Value |
|---|---|
| window_size | 2048 |
| vocab_size | 49152 |
| eos_token_id | 49131 |
| num_layers | 26 |
| num_heads | 14 |
| model_dim | 1792 |
| head_dim | 128 |
| max_seq_len | 131072 |
| model_spec | daisy_milli |
| model_class | models.daisy.daisy_core.DaisyCore |
| target_tokens | 3451912192 |
| full_window_target_tokens | 3451912192 |
| torch_coordinate_descent_tuning | False |
| torch_inductor_config_max_autotune | False |
| overfit | False |
| full_windows | True |
| wandb_log | True |
| wandb_project | milli |
| wandb_run_name | milli_v18b_mid_v13 |
| wandb_group | mid |
| init_model | JonathanMiddleton/daisy-milli-base-v18b |
| use_value_embeddings | True |
| use_tied_embeddings | False |
| seed | 1337 |
| task_val_debug_log_samples | False |
| log_interval | 16384 |
| muon_warmup_steps | 300 |
| lr_scale | 1.0 |
| cooldown_fraction | 0.0 |
| lr_schedule | {"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 0.185}, {"progress": 1.0, "scale": 0.1}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}} |
| grad_acc_steps | 4 |
| val_loss_every_tokens | 209715200 |
| checkpoint_warmup_tokens | 1000000000 |
| checkpoint_per_n_tokens | 0 |
| save_checkpoint | True |
| benchmarks_frequency | 2 |
| mmlu_cache_bin_path | data/mmlu_cache/mmlu_cache.bin |
| mmlu_cache_bin_rebuild | False |
| task_training | False |
| track_last_n_layers | 0 |