Add files using upload-large-folder tool

Browse files

Files changed (8) hide show

README.md +130 -0
chat_template.jinja +50 -0
checkpoint.pt +3 -0
config.json +29 -0
generation_config.json +8 -0
model.safetensors +3 -0
tokenizer.json +0 -0
tokenizer_config.json +31 -0

README.md ADDED Viewed

	@@ -0,0 +1,130 @@

+---
+library_name: transformers
+tags:
+- daisy
+- causal-lm
+- pretrained
+license: apache-2.0
+---
+# DaisyCore — daisy_milli
+## Model Description
+DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.
+## Architecture
+| Property | Value |
+|:---|:---|
+| Architecture | DaisyCore |
+| Layers | 26 |
+| Attention Heads | 14 |
+| Model Dimension | 1,792 |
+| Head Dimension | 128 |
+| Sliding Window Size | 2,048 |
+| Max Sequence Length | 131,072 |
+| Vocabulary Size | 49,152 |
+| Attention Implementation | standard |
+| Value Embeddings | True |
+| Tied Embeddings | False |
+| Skip Mix Mode | linear |
+| Tokenizer | `jonathanmiddleton/daisy` |
+| Dtype | bfloat16 |
+| Parameters (total) | 2,323,120,245 |
+| Parameters (non-embedding) | 1,001,914,485 |
+| Parameters (embedding) | 1,321,205,760 |
+## Training Progress
+| Metric | Value |
+|:---|:---|
+| Checkpoint Step | 52,959 |
+| Tokens Processed | 143.26B (143,262,744,576) |
+| Target Tokens | 300.00B (300,000,000,000) |
+| Progress | 47.8% |
+| Best Validation Loss | 2.07058 |
+| Evaluations Performed | 912 |
+| Saved | 2026-03-06 00:15 UTC |
+## Training Configuration
+### Optimizers
+| Optimizer | Parameter Group | Learning Rate |
+|:---|:---|:---|
+| AdamW | head_params | 0.003216 |
+| AdamW | embed_params | 0.1865 |
+| AdamW | scalar_params | 0.02099 |
+| Muon | hidden_matrix_params | 0.025 |
+### Schedule & Regularization
+| Parameter | Value |
+|:---|:---|
+| LR Scale | 1.0 |
+| LR Schedule | n_phase_linear |
+| LR Schedule — begin_after_fraction | 0.0 |
+| LR Schedule — cooldown_fraction | 0.0 |
+| LR Schedule — floor | 0.0 |
+| LR Schedule — phases | [{'progress': 0.0, 'scale': 1.0}, {'progress': 0.36117676, 'scale': 0.20527}, {'progress': 1.0, 'scale': 0.1}] |
+| LR Schedule — warmup_fraction | 0.0 |
+| Gradient Accumulation Steps | 5 |
+| Muon Warmup Steps | 300 |
+| Seed | 1337 |
+### Training Data
+| Type | Sequence Length | Path |
+|:---|:---|:---|
+| fineweb-edu-dedup | 16,384 | `data/fineweb-edu-dedup/fineweb-edu-dedup_jonathanmiddleton_daisy_train_*.bin[000600:005000]` |
+### Checkpoint Provenance
+- **Resumed from**: `JonathanMiddleton/daisy-milli-base-v18d.b`
+## All Hyperparameters
+| Parameter | Value |
+|:---|:---|
+| window_size | 2048 |
+| vocab_size | 49152 |
+| eos_token_id | 49131 |
+| num_layers | 26 |
+| num_heads | 14 |
+| model_dim | 1792 |
+| head_dim | 128 |
+| max_seq_len | 131072 |
+| model_spec | daisy_milli |
+| model_class | models.daisy.daisy_core.DaisyCore |
+| target_tokens | 100000000000 |
+| full_window_target_tokens | 3000000000 |
+| torch_coordinate_descent_tuning | False |
+| torch_inductor_config_max_autotune | False |
+| overfit | False |
+| full_windows | False |
+| wandb_log | True |
+| wandb_project | milli |
+| wandb_run_name | milli_v18d.d |
+| wandb_group | pretrain |
+| resume_checkpoint | JonathanMiddleton/daisy-milli-base-v18d.b |
+| resume_target_tokens_override | 300000000000 |
+| use_value_embeddings | True |
+| use_tied_embeddings | False |
+| seed | 1337 |
+| task_val_debug_log_samples | False |
+| log_interval | 16384 |
+| muon_warmup_steps | 300 |
+| lr_scale | 1.0 |
+| cooldown_fraction | 0.0 |
+| lr_schedule | {"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 1.0}, {"progress": 0.36117676, "scale": 0.20527}, {"progress": 1.0, "scale": 0.1}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}} |
+| grad_acc_steps | 5 |
+| val_loss_every_tokens | 245760000 |
+| checkpoint_warmup_tokens | 1 |
+| checkpoint_per_n_tokens | 245760000 |
+| save_checkpoint | True |
+| benchmarks_frequency | 2 |
+| mmlu_cache_bin_path | data/mmlu_cache/mmlu_cache.bin |
+| mmlu_cache_bin_rebuild | False |
+| task_training | False |
+| track_last_n_layers | 0 |

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,50 @@

+{#- Daisy Chat Template v2 -#}
+{#- Supports: ChatML format, tool calling, multipart content -#}
+{#- Macro to render content (string or multipart) -#}
+{%- macro render_content(content) -%}
+{%- if content is string -%}
+{{ content }}
+{%- elif content is iterable -%}
+{%- for part in content -%}
+{%- if part.type == 'text' -%}
+{{ part.text }}
+{%- elif part.type == 'tool_call' -%}
+<|tool_call|>{{ part.text }}<|/tool_call|>
+{%- elif part.type == 'tool_result' -%}
+<|tool_result|>{{ part.text }}<|/tool_result|>
+{%- elif part.type == 'python' -%}
+<|python|>{{ part.text }}<|/python|>
+{%- elif part.type == 'output' -%}
+<|output|>{{ part.text }}<|/output|>
+{%- elif part.type == 'think' -%}
+<|think|>{{ part.text }}<|/think|>
+{%- endif -%}
+{%- endfor -%}
+{%- else -%}
+{{ content }}
+{%- endif -%}
+{%- endmacro -%}
+{#- Main message loop -#}
+{%- for message in messages -%}
+{%- if message.role == 'system' -%}
+<|im_start|>system
+{{ message.content }}<|im_end|>
+{% elif message.role == 'user' -%}
+<|im_start|>user
+{{ message.content }}<|im_end|>
+{% elif message.role == 'assistant' -%}
+<|im_start|>assistant
+{% generation %}{{ render_content(message.content) }}<|im_end|>{% endgeneration %}
+{% elif message.role == 'tool' -%}
+<|tool_result|>{{ message.content }}<|/tool_result|>
+{%- endif -%}
+{%- endfor -%}
+{#- Generation prompt -#}
+{%- if add_generation_prompt -%}
+<|im_start|>assistant
+{% generation %}{% endgeneration %}
+{%- endif -%}

checkpoint.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ae09dbaf90f145ef447f694d79818bc06c458b6fe0f32d5fd81dbe28e31f0c3d
+size 16471242855

config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "architectures": [
+    "DaisyForCausalLM"
+  ],
+  "attn_all_layers": true,
+  "attn_impl": "standard",
+  "bos_token_id": 49131,
+  "dtype": "float32",
+  "eos_token_id": 49131,
+  "eot_token_id": 49134,
+  "head_dim": 128,
+  "hidden_size": 1792,
+  "max_position_embeddings": 131072,
+  "model_dim": 1792,
+  "model_type": "daisy",
+  "num_attention_heads": 14,
+  "num_heads": 14,
+  "num_hidden_layers": 26,
+  "num_key_value_heads": 14,
+  "num_layers": 26,
+  "padded_embeddings": false,
+  "skip_mix_mode": "linear",
+  "tokenizer_name": "jonathanmiddleton/daisy",
+  "transformers_version": "5.3.0",
+  "use_tied_embeddings": false,
+  "use_value_embeddings": true,
+  "vocab_size": 49152,
+  "window_size": 2048
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 49131,
+  "eos_token_id": 49131,
+  "output_attentions": false,
+  "output_hidden_states": false,
+  "transformers_version": "5.3.0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c296f22800259f9c4ed9bca652b00ec2653930886a8d527a78231c90c38568eb
+size 4822418412

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": [
+    "<|tool_call|>",
+    "<|/tool_call|>",
+    "<|tool_result|>",
+    "<|/tool_result|>",
+    "<|python|>",
+    "<|/python|>",
+    "<|output|>",
+    "<|/output|>",
+    "<|think|>",
+    "<|/think|>",
+    "<|system|>",
+    "<|user|>",
+    "<|assistant|>",
+    "<|reserved_0|>",
+    "<|reserved_1|>",
+    "<|reserved_2|>",
+    "<|reserved_3|>"
+  ],
+  "is_local": false,
+  "model_max_length": 131072,
+  "pad_token": "<|pad|>",
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": null
+}