JonathanMiddleton commited on
Commit
0a2699b
·
verified ·
1 Parent(s): 629533a

Add files using upload-large-folder tool

Browse files
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - daisy
5
+ - causal-lm
6
+ - pretrained
7
+ license: apache-2.0
8
+ ---
9
+
10
+ # DaisyCore — daisy_milli
11
+
12
+ ## Model Description
13
+
14
+ DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.
15
+
16
+ ## Architecture
17
+
18
+ | Property | Value |
19
+ |:---|:---|
20
+ | Architecture | DaisyCore |
21
+ | Layers | 26 |
22
+ | Attention Heads | 14 |
23
+ | Model Dimension | 1,792 |
24
+ | Head Dimension | 128 |
25
+ | Sliding Window Size | 2,048 |
26
+ | Max Sequence Length | 131,072 |
27
+ | Vocabulary Size | 49,152 |
28
+ | Attention Implementation | standard |
29
+ | Value Embeddings | True |
30
+ | Tied Embeddings | False |
31
+ | Skip Mix Mode | linear |
32
+ | Tokenizer | `jonathanmiddleton/daisy` |
33
+ | Dtype | bfloat16 |
34
+ | Parameters (total) | 2,323,120,245 |
35
+ | Parameters (non-embedding) | 1,001,914,485 |
36
+ | Parameters (embedding) | 1,321,205,760 |
37
+
38
+ ## Training Progress
39
+
40
+ | Metric | Value |
41
+ |:---|:---|
42
+ | Checkpoint Step | 2,750 |
43
+ | Tokens Processed | 11.53B (11,534,336,000) |
44
+ | Target Tokens | 13.94B (13,941,866,496) |
45
+ | Progress | 82.7% |
46
+ | Best Validation Loss | 1.58289 |
47
+ | Evaluations Performed | 55 |
48
+ | HellaSwag (acc_norm) | 60.95% |
49
+ | MMLU (acc) | 33.43% |
50
+ | Saved | 2026-03-09 18:23 UTC |
51
+
52
+ ## Training Configuration
53
+
54
+ ### Optimizers
55
+
56
+ | Optimizer | Parameter Group | Learning Rate |
57
+ |:---|:---|:---|
58
+ | AdamW | head_params | 0.003216 |
59
+ | AdamW | embed_params | 0.1865 |
60
+ | AdamW | scalar_params | 0.02099 |
61
+ | Muon | hidden_matrix_params | 0.025 |
62
+
63
+ ### Schedule & Regularization
64
+
65
+ | Parameter | Value |
66
+ |:---|:---|
67
+ | LR Scale | 1.0 |
68
+ | LR Schedule | n_phase_linear |
69
+ | LR Schedule — begin_after_fraction | 0.0 |
70
+ | LR Schedule — cooldown_fraction | 0.0 |
71
+ | LR Schedule — floor | 0.0 |
72
+ | LR Schedule — phases | [{'progress': 0.0, 'scale': 0.10171}, {'progress': 0.3, 'scale': 0.1}, {'progress': 1.0, 'scale': 0.05}] |
73
+ | LR Schedule — warmup_fraction | 0.0 |
74
+ | Gradient Accumulation Steps | 4 |
75
+ | Muon Warmup Steps | 300 |
76
+ | Seed | 1337 |
77
+
78
+ ### Training Data
79
+
80
+ | Type | Sequence Length | Path |
81
+ |:---|:---|:---|
82
+ | fineweb-edu-shuffled | 16,384 | `data/fineweb-edu-shuffled/train/*.bin` |
83
+ | daisypie_chat | 16,384 | `data/daisypie_chat/` |
84
+
85
+ ## All Hyperparameters
86
+
87
+ | Parameter | Value |
88
+ |:---|:---|
89
+ | window_size | 2048 |
90
+ | vocab_size | 49152 |
91
+ | eos_token_id | 49131 |
92
+ | num_layers | 26 |
93
+ | num_heads | 14 |
94
+ | model_dim | 1792 |
95
+ | head_dim | 128 |
96
+ | max_seq_len | 131072 |
97
+ | model_spec | daisy_milli |
98
+ | model_class | models.daisy.daisy_core.DaisyCore |
99
+ | target_tokens | 13941866496 |
100
+ | full_window_target_tokens | 13941866496 |
101
+ | torch_coordinate_descent_tuning | False |
102
+ | torch_inductor_config_max_autotune | False |
103
+ | overfit | False |
104
+ | full_windows | True |
105
+ | wandb_log | True |
106
+ | wandb_project | milli |
107
+ | wandb_run_name | milli_v18de_v2 |
108
+ | wandb_group | pretrain |
109
+ | init_model | JonathanMiddleton/daisy-milli-base-v18d.e-tokens296879128576 |
110
+ | use_value_embeddings | True |
111
+ | use_tied_embeddings | False |
112
+ | seed | 1337 |
113
+ | task_val_debug_log_samples | False |
114
+ | log_interval | 16384 |
115
+ | muon_warmup_steps | 300 |
116
+ | lr_scale | 1.0 |
117
+ | cooldown_fraction | 0.0 |
118
+ | lr_schedule | {"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 0.10171}, {"progress": 0.3, "scale": 0.1}, {"progress": 1.0, "scale": 0.05}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}} |
119
+ | grad_acc_steps | 4 |
120
+ | val_loss_every_tokens | 209715200 |
121
+ | checkpoint_warmup_tokens | 6000000000 |
122
+ | checkpoint_per_n_tokens | 0 |
123
+ | save_checkpoint | True |
124
+ | benchmarks_frequency | 1 |
125
+ | mmlu_cache_bin_path | data/mmlu_cache/mmlu_cache.bin |
126
+ | mmlu_cache_bin_rebuild | False |
127
+ | task_training | False |
128
+ | track_last_n_layers | 0 |
chat_template.jinja ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {#- Daisy Chat Template v2 -#}
2
+ {#- Supports: ChatML format, tool calling, multipart content -#}
3
+
4
+ {#- Macro to render content (string or multipart) -#}
5
+ {%- macro render_content(content) -%}
6
+ {%- if content is string -%}
7
+ {{ content }}
8
+ {%- elif content is iterable -%}
9
+ {%- for part in content -%}
10
+ {%- if part.type == 'text' -%}
11
+ {{ part.text }}
12
+ {%- elif part.type == 'tool_call' -%}
13
+ <|tool_call|>{{ part.text }}<|/tool_call|>
14
+ {%- elif part.type == 'tool_result' -%}
15
+ <|tool_result|>{{ part.text }}<|/tool_result|>
16
+ {%- elif part.type == 'python' -%}
17
+ <|python|>{{ part.text }}<|/python|>
18
+ {%- elif part.type == 'output' -%}
19
+ <|output|>{{ part.text }}<|/output|>
20
+ {%- elif part.type == 'think' -%}
21
+ <|think|>{{ part.text }}<|/think|>
22
+ {%- endif -%}
23
+ {%- endfor -%}
24
+ {%- else -%}
25
+ {{ content }}
26
+ {%- endif -%}
27
+ {%- endmacro -%}
28
+
29
+ {#- Main message loop -#}
30
+ {%- for message in messages -%}
31
+ {%- if message.role == 'system' -%}
32
+ <|im_start|>system
33
+ {{ message.content }}<|im_end|>
34
+ {% elif message.role == 'user' -%}
35
+ <|im_start|>user
36
+ {{ message.content }}<|im_end|>
37
+ {% elif message.role == 'assistant' -%}
38
+ <|im_start|>assistant
39
+ {% generation %}{{ render_content(message.content) }}<|im_end|>{% endgeneration %}
40
+ {% elif message.role == 'tool' -%}
41
+ <|tool_result|>{{ message.content }}<|/tool_result|>
42
+ {%- endif -%}
43
+ {%- endfor -%}
44
+
45
+ {#- Generation prompt -#}
46
+ {%- if add_generation_prompt -%}
47
+ <|im_start|>assistant
48
+ {% generation %}{% endgeneration %}
49
+ {%- endif -%}
50
+
checkpoint.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1191143bfc626a5b92c4768dab1d22cd33fdbf3bf346381e6dbd721f0e93141
3
+ size 16471242599
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "DaisyForCausalLM"
4
+ ],
5
+ "attn_all_layers": true,
6
+ "attn_impl": "standard",
7
+ "bos_token_id": 49131,
8
+ "dtype": "float32",
9
+ "eos_token_id": 49131,
10
+ "eot_token_id": 49134,
11
+ "head_dim": 128,
12
+ "hidden_size": 1792,
13
+ "max_position_embeddings": 131072,
14
+ "model_dim": 1792,
15
+ "model_type": "daisy",
16
+ "num_attention_heads": 14,
17
+ "num_heads": 14,
18
+ "num_hidden_layers": 26,
19
+ "num_key_value_heads": 14,
20
+ "num_layers": 26,
21
+ "padded_embeddings": false,
22
+ "skip_mix_mode": "linear",
23
+ "tokenizer_name": "jonathanmiddleton/daisy",
24
+ "transformers_version": "5.3.0",
25
+ "use_tied_embeddings": false,
26
+ "use_value_embeddings": true,
27
+ "vocab_size": 49152,
28
+ "window_size": 2048
29
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 49131,
4
+ "eos_token_id": 49131,
5
+ "output_attentions": false,
6
+ "output_hidden_states": false,
7
+ "transformers_version": "5.3.0"
8
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bca9450d97735063346401fd1cf88dea6096764cfdd53b0f93e0fa6b47d60447
3
+ size 4822418412
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": "<|endoftext|>",
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|endoftext|>",
7
+ "extra_special_tokens": [
8
+ "<|tool_call|>",
9
+ "<|/tool_call|>",
10
+ "<|tool_result|>",
11
+ "<|/tool_result|>",
12
+ "<|python|>",
13
+ "<|/python|>",
14
+ "<|output|>",
15
+ "<|/output|>",
16
+ "<|think|>",
17
+ "<|/think|>",
18
+ "<|system|>",
19
+ "<|user|>",
20
+ "<|assistant|>",
21
+ "<|reserved_0|>",
22
+ "<|reserved_1|>",
23
+ "<|reserved_2|>",
24
+ "<|reserved_3|>"
25
+ ],
26
+ "is_local": false,
27
+ "model_max_length": 131072,
28
+ "pad_token": "<|pad|>",
29
+ "tokenizer_class": "TokenizersBackend",
30
+ "unk_token": null
31
+ }