phgrouptechs commited on
Commit
75ddade
·
verified ·
1 Parent(s): 807f6d4

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,4 +1,5 @@
1
  ---
 
2
  library_name: peft
3
  model_name: tutor_model_output
4
  tags:
@@ -7,132 +8,55 @@ tags:
7
  - sft
8
  - transformers
9
  - trl
10
- - language-tutor
11
- - english
12
- - german
13
- - vietnamese
14
- - conversational
15
- - instruct
16
- language:
17
- - vi
18
- - en
19
- - de
20
  licence: license
21
  pipeline_tag: text-generation
22
  ---
23
 
24
- # 🎓 Denglish-8B-Instruct: AI Language Tutor for Vietnamese Learners
25
 
26
- **Denglish-8B-Instruct** is a fine-tuned LoRA adapter based on `unsloth/llama-3-8b-Instruct-bnb-4bit`. It is specifically designed to act as a strict yet friendly AI Language Tutor, assisting Vietnamese students in learning **English** and **German**.
 
27
 
28
- This model excels at identifying grammatical, spelling, and contextual errors from user inputs, explaining the mistakes clearly in **Vietnamese**, and providing perfectly corrected sentences in the target language.
29
 
30
- ## 🚀 Model Details
31
- - **Model Type:** Causal Language Model (Fine-tuned LoRA Adapter)
32
- - **Base Model:** Meta Llama 3 (8B-Instruct 4-bit quantized via Unsloth)
33
- - **Primary Languages:** Vietnamese (Explanations), English (Target), German (Target)
34
- - **Training Framework:** `TRL` (Transformer Reinforcement Learning) & `PEFT`
35
- - **Architecture:** Optimized for multi-modal integrations (Text, OCR/Images, and STT/Voice processing ecosystems).
36
-
37
- ## 💡 Intended Uses & Ecosystem
38
- This model is the core "Brain" of the **Denglish Omnichannel Platform** (integrated via RunPod Serverless, FastAPI, Telegram, and Facebook Messenger).
39
- It is intended to process inputs such as:
40
- 1. **Direct Text:** User types a sentence in English or German.
41
- 2. **Transcribed Audio (Whisper STT):** Correcting conversational mistakes from spoken language.
42
- 3. **Extracted Text from Images (OCR):** Grading handwritten or printed homework.
43
-
44
- ## 🛠️ How to Use (Quick Start)
45
-
46
- Since this is a LoRA adapter, you need to load the base model first and then merge it with this adapter using `peft`.
47
-
48
- ### Prerequisites
49
- ```bash
50
- pip install transformers accelerate bitsandbytes peft
51
- ```
52
- ## Inference Code
53
  ```python
54
- import torch
55
- from transformers import AutoModelForCausalLM, AutoTokenizer
56
- from peft import PeftModel
57
-
58
- # 1. Load Base Model and Tokenizer
59
- base_model_id = "unsloth/llama-3-8b-Instruct-bnb-4bit"
60
- lora_model_id = "phgrouptechs/Denglish-8B-Instruct"
61
-
62
- tokenizer = AutoTokenizer.from_pretrained(base_model_id)
63
- base_model = AutoModelForCausalLM.from_pretrained(
64
- base_model_id,
65
- torch_dtype=torch.bfloat16,
66
- device_map="auto"
67
- )
68
-
69
- # 2. Load the Denglish LoRA Adapter
70
- model = PeftModel.from_pretrained(base_model, lora_model_id)
71
-
72
- # 3. Prepare the Chat Prompt
73
- target_lang = "English" # or "German"
74
- user_mistake = "Hello, my name is John and I is a student."
75
-
76
- system_prompt = (
77
- f"You are a friendly and strict {target_lang} tutor for Vietnamese students. "
78
- f"The user provided a {target_lang} input: '{user_mistake}'. "
79
- f"Task: 1. Correct any grammatical, spelling, or pronunciation mistakes. "
80
- f"2. Explain the corrections clearly in Vietnamese. "
81
- f"3. Provide the perfectly corrected sentence in {target_lang} at the very end."
82
- )
83
-
84
- messages = [
85
- {"role": "system", "content": system_prompt},
86
- {"role": "user", "content": "Hãy chấm bài và sửa lỗi cho tôi."}
87
- ]
88
-
89
- # 4. Generate Response
90
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
91
- inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
92
-
93
- outputs = model.generate(
94
- **inputs,
95
- max_new_tokens=400,
96
- temperature=0.3, # Low temperature for accurate grammar corrections
97
- pad_token_id=tokenizer.eos_token_id
98
- )
99
-
100
- ai_response = tokenizer.batch_decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)[0].strip()
101
- print(ai_response)
102
- ```
103
-
104
- ## 📝 Example Output
105
-
106
- **Input (User):** "Hello, my name is John and I is a student."
107
- **Target Language:** English
108
 
109
- **Output (AI):**
110
-
111
- Máy phát hiện lỗi sử dụng động từ to be. Chủ ngữ "I" phải đi với "am" thay "is".
112
-
113
- Câu đúng: "Hello, my name is John and I am a student."
114
 
115
  ## Training procedure
116
 
117
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/phgrouptechs-phgroup-technology-solutions-co-ltd/my-awesome-project/runs/7vqgjc2u)
118
 
119
 
120
 
121
  This model was trained with SFT.
122
 
123
- ## ⚠️ Limitations
124
- * **Quantization Constraints:** The base model is 4-bit quantized. While it is highly efficient, extremely complex logical reasoning might be slightly degraded compared to the FP16 base model.
125
-
126
- * **Language Scope:** The model is highly optimized for English/German to Vietnamese explanations. Using it for other language pairs might yield suboptimal results.
127
-
128
  ### Framework versions
129
 
130
  - PEFT 0.18.1
131
  - TRL: 0.29.0
132
- - Transformers: 5.2.0
133
  - Pytorch: 2.8.0+cu128
134
- - Datasets: 4.6.0
135
  - Tokenizers: 0.22.2
136
 
137
- ## 👨‍💻 Developed by
138
- **PHGROUP TECHNOLOGY SOLUTIONS CO., LTD** - Building AI-driven educational and omnichannel solutions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: unsloth/llama-3-8b-Instruct-bnb-4bit
3
  library_name: peft
4
  model_name: tutor_model_output
5
  tags:
 
8
  - sft
9
  - transformers
10
  - trl
 
 
 
 
 
 
 
 
 
 
11
  licence: license
12
  pipeline_tag: text-generation
13
  ---
14
 
15
+ # Model Card for tutor_model_output
16
 
17
+ This model is a fine-tuned version of [unsloth/llama-3-8b-Instruct-bnb-4bit](https://huggingface.co/unsloth/llama-3-8b-Instruct-bnb-4bit).
18
+ It has been trained using [TRL](https://github.com/huggingface/trl).
19
 
20
+ ## Quick start
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ```python
23
+ from transformers import pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
26
+ generator = pipeline("text-generation", model="None", device="cuda")
27
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
28
+ print(output["generated_text"])
29
+ ```
30
 
31
  ## Training procedure
32
 
33
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/phgrouptechs-phgroup-technology-solutions-co-ltd/my-awesome-project/runs/74x5vj6l)
34
 
35
 
36
 
37
  This model was trained with SFT.
38
 
 
 
 
 
 
39
  ### Framework versions
40
 
41
  - PEFT 0.18.1
42
  - TRL: 0.29.0
43
+ - Transformers: 5.3.0
44
  - Pytorch: 2.8.0+cu128
45
+ - Datasets: 4.6.1
46
  - Tokenizers: 0.22.2
47
 
48
+ ## Citations
49
+
50
+
51
+
52
+ Cite TRL as:
53
+
54
+ ```bibtex
55
+ @software{vonwerra2020trl,
56
+ title = {{TRL: Transformers Reinforcement Learning}},
57
+ author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
58
+ license = {Apache-2.0},
59
+ url = {https://github.com/huggingface/trl},
60
+ year = {2020}
61
+ }
62
+ ```
adapter_config.json CHANGED
@@ -29,13 +29,13 @@
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
- "down_proj",
33
  "o_proj",
34
- "k_proj",
35
  "q_proj",
36
- "v_proj",
37
  "gate_proj",
38
- "up_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
 
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
+ "v_proj",
33
  "o_proj",
34
+ "up_proj",
35
  "q_proj",
36
+ "down_proj",
37
  "gate_proj",
38
+ "k_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:27e2e812aa91b0af98fa9af3f5cbd95f3212af35d91ec3ab0e8d1cf1f47b5ba6
3
  size 83946192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52c5f909945589d0c78975a1cb4af27dcba08206910975f240e0ceb21013a2e2
3
  size 83946192
checkpoint-100/adapter_config.json CHANGED
@@ -29,13 +29,13 @@
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
- "down_proj",
33
  "o_proj",
34
- "k_proj",
35
  "q_proj",
36
- "v_proj",
37
  "gate_proj",
38
- "up_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
 
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
+ "v_proj",
33
  "o_proj",
34
+ "up_proj",
35
  "q_proj",
36
+ "down_proj",
37
  "gate_proj",
38
+ "k_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
checkpoint-100/adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9a2c2b8092d72e36662bbcb939947cb8b00883bb89d3249cfedbc4e6a800463f
3
  size 83946192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd3571d376b0d67ebe69939a4324b1619b93944194edad7b9e1b8e8503fac290
3
  size 83946192
checkpoint-100/optimizer.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:189d043694baff6a2630a6b1e5a58f07b1e7a255a848278d3f7622fe030eabe4
3
  size 335817867
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2873fdc2d53f293c0fddd0c6945e25936379b8656d7d9730013ad80daac1db00
3
  size 335817867
checkpoint-100/rng_state.pth CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:656940eec6948746efc59dba0c191ea5ae91cfbd43a4858cae4f839eac52b6a0
3
  size 14645
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd42335f66aa8837109aed797819fac4d73aa4b840d682d5e72348336f572739
3
  size 14645
checkpoint-100/trainer_state.json CHANGED
@@ -2,7 +2,7 @@
2
  "best_global_step": null,
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
- "epoch": 0.010103561505430665,
6
  "eval_steps": 500,
7
  "global_step": 100,
8
  "is_hyper_param_search": false,
@@ -10,103 +10,103 @@
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
- "entropy": 1.45259268283844,
14
- "epoch": 0.0010103561505430665,
15
- "grad_norm": 0.2451171875,
16
  "learning_rate": 0.0001964,
17
- "loss": 1.7130912780761718,
18
- "mean_token_accuracy": 0.6436435863375664,
19
- "num_tokens": 28151.0,
20
  "step": 10
21
  },
22
  {
23
- "entropy": 1.3634081721305846,
24
- "epoch": 0.002020712301086133,
25
- "grad_norm": 0.484375,
26
  "learning_rate": 0.00019240000000000001,
27
- "loss": 1.3450921058654786,
28
- "mean_token_accuracy": 0.686880823969841,
29
- "num_tokens": 57374.0,
30
  "step": 20
31
  },
32
  {
33
- "entropy": 1.1857430338859558,
34
- "epoch": 0.0030310684516291994,
35
- "grad_norm": 0.322265625,
36
  "learning_rate": 0.0001884,
37
- "loss": 1.205996608734131,
38
- "mean_token_accuracy": 0.706908255815506,
39
- "num_tokens": 86755.0,
40
  "step": 30
41
  },
42
  {
43
- "entropy": 1.156875231862068,
44
- "epoch": 0.004041424602172266,
45
- "grad_norm": 0.359375,
46
  "learning_rate": 0.0001844,
47
- "loss": 1.1667716026306152,
48
- "mean_token_accuracy": 0.7132074117660523,
49
- "num_tokens": 114130.0,
50
  "step": 40
51
  },
52
  {
53
- "entropy": 1.083250343799591,
54
- "epoch": 0.005051780752715332,
55
- "grad_norm": 0.330078125,
56
  "learning_rate": 0.00018040000000000002,
57
- "loss": 1.1047730445861816,
58
- "mean_token_accuracy": 0.7196864277124405,
59
- "num_tokens": 141768.0,
60
  "step": 50
61
  },
62
  {
63
- "entropy": 1.0874410301446915,
64
- "epoch": 0.006062136903258399,
65
- "grad_norm": 0.265625,
66
  "learning_rate": 0.0001764,
67
- "loss": 1.0894905090332032,
68
- "mean_token_accuracy": 0.7210412979125976,
69
- "num_tokens": 172272.0,
70
  "step": 60
71
  },
72
  {
73
- "entropy": 1.1303296595811845,
74
- "epoch": 0.007072493053801465,
75
- "grad_norm": 0.25390625,
76
  "learning_rate": 0.00017240000000000002,
77
- "loss": 1.1445048332214356,
78
- "mean_token_accuracy": 0.7178041815757752,
79
- "num_tokens": 200909.0,
80
  "step": 70
81
  },
82
  {
83
- "entropy": 1.0938484042882919,
84
- "epoch": 0.008082849204344532,
85
- "grad_norm": 0.271484375,
86
  "learning_rate": 0.0001684,
87
- "loss": 1.098367691040039,
88
- "mean_token_accuracy": 0.7246084690093995,
89
- "num_tokens": 228726.0,
90
  "step": 80
91
  },
92
  {
93
- "entropy": 1.0779876083135604,
94
- "epoch": 0.009093205354887599,
95
- "grad_norm": 0.2216796875,
96
  "learning_rate": 0.0001644,
97
- "loss": 1.0803230285644532,
98
- "mean_token_accuracy": 0.720952507853508,
99
- "num_tokens": 255181.0,
100
  "step": 90
101
  },
102
  {
103
- "entropy": 1.1645614862442017,
104
- "epoch": 0.010103561505430665,
105
- "grad_norm": 0.1826171875,
106
  "learning_rate": 0.00016040000000000002,
107
- "loss": 1.147304630279541,
108
- "mean_token_accuracy": 0.7067203193902969,
109
- "num_tokens": 283616.0,
110
  "step": 100
111
  }
112
  ],
@@ -127,7 +127,7 @@
127
  "attributes": {}
128
  }
129
  },
130
- "total_flos": 2.9285710718828544e+16,
131
  "train_batch_size": 8,
132
  "trial_name": null,
133
  "trial_params": null
 
2
  "best_global_step": null,
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
+ "epoch": 7.401168496482224e-05,
6
  "eval_steps": 500,
7
  "global_step": 100,
8
  "is_hyper_param_search": false,
 
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
+ "entropy": 1.4767830133438111,
14
+ "epoch": 7.401168496482225e-06,
15
+ "grad_norm": 0.578125,
16
  "learning_rate": 0.0001964,
17
+ "loss": 1.6877475738525392,
18
+ "mean_token_accuracy": 0.7061349496245384,
19
+ "num_tokens": 14911.0,
20
  "step": 10
21
  },
22
  {
23
+ "entropy": 1.0930627048015595,
24
+ "epoch": 1.480233699296445e-05,
25
+ "grad_norm": 0.458984375,
26
  "learning_rate": 0.00019240000000000001,
27
+ "loss": 1.0562871932983398,
28
+ "mean_token_accuracy": 0.8108290940523147,
29
+ "num_tokens": 28646.0,
30
  "step": 20
31
  },
32
  {
33
+ "entropy": 0.8788679152727127,
34
+ "epoch": 2.2203505489446674e-05,
35
+ "grad_norm": 0.5859375,
36
  "learning_rate": 0.0001884,
37
+ "loss": 0.8974875450134278,
38
+ "mean_token_accuracy": 0.8296987593173981,
39
+ "num_tokens": 41474.0,
40
  "step": 30
41
  },
42
  {
43
+ "entropy": 0.8145956963300705,
44
+ "epoch": 2.96046739859289e-05,
45
+ "grad_norm": 0.39453125,
46
  "learning_rate": 0.0001844,
47
+ "loss": 0.8066701889038086,
48
+ "mean_token_accuracy": 0.8340015441179276,
49
+ "num_tokens": 54466.0,
50
  "step": 40
51
  },
52
  {
53
+ "entropy": 0.7157480388879776,
54
+ "epoch": 3.700584248241112e-05,
55
+ "grad_norm": 0.326171875,
56
  "learning_rate": 0.00018040000000000002,
57
+ "loss": 0.7251500129699707,
58
+ "mean_token_accuracy": 0.8420351594686508,
59
+ "num_tokens": 66880.0,
60
  "step": 50
61
  },
62
  {
63
+ "entropy": 0.7959431439638138,
64
+ "epoch": 4.440701097889335e-05,
65
+ "grad_norm": 0.326171875,
66
  "learning_rate": 0.0001764,
67
+ "loss": 0.8049167633056641,
68
+ "mean_token_accuracy": 0.8289562940597535,
69
+ "num_tokens": 80036.0,
70
  "step": 60
71
  },
72
  {
73
+ "entropy": 0.8342548221349716,
74
+ "epoch": 5.180817947537557e-05,
75
+ "grad_norm": 0.326171875,
76
  "learning_rate": 0.00017240000000000002,
77
+ "loss": 0.8336853981018066,
78
+ "mean_token_accuracy": 0.8279720038175583,
79
+ "num_tokens": 93357.0,
80
  "step": 70
81
  },
82
  {
83
+ "entropy": 0.7970967918634415,
84
+ "epoch": 5.92093479718578e-05,
85
+ "grad_norm": 0.73046875,
86
  "learning_rate": 0.0001684,
87
+ "loss": 0.7949181079864502,
88
+ "mean_token_accuracy": 0.828959608078003,
89
+ "num_tokens": 106951.0,
90
  "step": 80
91
  },
92
  {
93
+ "entropy": 0.7967441529035568,
94
+ "epoch": 6.661051646834002e-05,
95
+ "grad_norm": 0.34375,
96
  "learning_rate": 0.0001644,
97
+ "loss": 0.8285197257995606,
98
+ "mean_token_accuracy": 0.8272027671337128,
99
+ "num_tokens": 120269.0,
100
  "step": 90
101
  },
102
  {
103
+ "entropy": 0.7741447448730469,
104
+ "epoch": 7.401168496482224e-05,
105
+ "grad_norm": 0.271484375,
106
  "learning_rate": 0.00016040000000000002,
107
+ "loss": 0.7636381626129151,
108
+ "mean_token_accuracy": 0.8373189926147461,
109
+ "num_tokens": 133116.0,
110
  "step": 100
111
  }
112
  ],
 
127
  "attributes": {}
128
  }
129
  },
130
+ "total_flos": 9597079982112768.0,
131
  "train_batch_size": 8,
132
  "trial_name": null,
133
  "trial_params": null
checkpoint-200/adapter_config.json CHANGED
@@ -29,13 +29,13 @@
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
- "down_proj",
33
  "o_proj",
34
- "k_proj",
35
  "q_proj",
36
- "v_proj",
37
  "gate_proj",
38
- "up_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
 
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
+ "v_proj",
33
  "o_proj",
34
+ "up_proj",
35
  "q_proj",
36
+ "down_proj",
37
  "gate_proj",
38
+ "k_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
checkpoint-200/adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:09fa87e3d3e86a067d90eaa846ea88c86e8db4d6dfc7c81c48161d222148cc90
3
  size 83946192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2b38a965a78dd2741df9584b07323c6472be2985ef11caf5e56857e87bb65fb4
3
  size 83946192
checkpoint-200/optimizer.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3067ac6f15a78cb59a530f24b6f633438906cff514328780a56664d4019d1cac
3
  size 335817867
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ae2af00d109790f6a0cae3a0a00c8da0f1d9bb6b988a377f14be3b34936563f
3
  size 335817867
checkpoint-200/rng_state.pth CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f83c2fb90a464d2069f8c9696adef67a1221780665f6aa89b1aee6e5e66a9bb1
3
  size 14645
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:daadb075e2f031fbda74514b09d5bc1b433960d924ec2d86606ab755c3504c6c
3
  size 14645
checkpoint-200/trainer_state.json CHANGED
@@ -2,7 +2,7 @@
2
  "best_global_step": null,
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
- "epoch": 0.02020712301086133,
6
  "eval_steps": 500,
7
  "global_step": 200,
8
  "is_hyper_param_search": false,
@@ -10,203 +10,203 @@
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
- "entropy": 1.45259268283844,
14
- "epoch": 0.0010103561505430665,
15
- "grad_norm": 0.2451171875,
16
  "learning_rate": 0.0001964,
17
- "loss": 1.7130912780761718,
18
- "mean_token_accuracy": 0.6436435863375664,
19
- "num_tokens": 28151.0,
20
  "step": 10
21
  },
22
  {
23
- "entropy": 1.3634081721305846,
24
- "epoch": 0.002020712301086133,
25
- "grad_norm": 0.484375,
26
  "learning_rate": 0.00019240000000000001,
27
- "loss": 1.3450921058654786,
28
- "mean_token_accuracy": 0.686880823969841,
29
- "num_tokens": 57374.0,
30
  "step": 20
31
  },
32
  {
33
- "entropy": 1.1857430338859558,
34
- "epoch": 0.0030310684516291994,
35
- "grad_norm": 0.322265625,
36
  "learning_rate": 0.0001884,
37
- "loss": 1.205996608734131,
38
- "mean_token_accuracy": 0.706908255815506,
39
- "num_tokens": 86755.0,
40
  "step": 30
41
  },
42
  {
43
- "entropy": 1.156875231862068,
44
- "epoch": 0.004041424602172266,
45
- "grad_norm": 0.359375,
46
  "learning_rate": 0.0001844,
47
- "loss": 1.1667716026306152,
48
- "mean_token_accuracy": 0.7132074117660523,
49
- "num_tokens": 114130.0,
50
  "step": 40
51
  },
52
  {
53
- "entropy": 1.083250343799591,
54
- "epoch": 0.005051780752715332,
55
- "grad_norm": 0.330078125,
56
  "learning_rate": 0.00018040000000000002,
57
- "loss": 1.1047730445861816,
58
- "mean_token_accuracy": 0.7196864277124405,
59
- "num_tokens": 141768.0,
60
  "step": 50
61
  },
62
  {
63
- "entropy": 1.0874410301446915,
64
- "epoch": 0.006062136903258399,
65
- "grad_norm": 0.265625,
66
  "learning_rate": 0.0001764,
67
- "loss": 1.0894905090332032,
68
- "mean_token_accuracy": 0.7210412979125976,
69
- "num_tokens": 172272.0,
70
  "step": 60
71
  },
72
  {
73
- "entropy": 1.1303296595811845,
74
- "epoch": 0.007072493053801465,
75
- "grad_norm": 0.25390625,
76
  "learning_rate": 0.00017240000000000002,
77
- "loss": 1.1445048332214356,
78
- "mean_token_accuracy": 0.7178041815757752,
79
- "num_tokens": 200909.0,
80
  "step": 70
81
  },
82
  {
83
- "entropy": 1.0938484042882919,
84
- "epoch": 0.008082849204344532,
85
- "grad_norm": 0.271484375,
86
  "learning_rate": 0.0001684,
87
- "loss": 1.098367691040039,
88
- "mean_token_accuracy": 0.7246084690093995,
89
- "num_tokens": 228726.0,
90
  "step": 80
91
  },
92
  {
93
- "entropy": 1.0779876083135604,
94
- "epoch": 0.009093205354887599,
95
- "grad_norm": 0.2216796875,
96
  "learning_rate": 0.0001644,
97
- "loss": 1.0803230285644532,
98
- "mean_token_accuracy": 0.720952507853508,
99
- "num_tokens": 255181.0,
100
  "step": 90
101
  },
102
  {
103
- "entropy": 1.1645614862442017,
104
- "epoch": 0.010103561505430665,
105
- "grad_norm": 0.1826171875,
106
  "learning_rate": 0.00016040000000000002,
107
- "loss": 1.147304630279541,
108
- "mean_token_accuracy": 0.7067203193902969,
109
- "num_tokens": 283616.0,
110
  "step": 100
111
  },
112
  {
113
- "entropy": 1.126008078455925,
114
- "epoch": 0.011113917655973731,
115
- "grad_norm": 0.2001953125,
116
  "learning_rate": 0.0001564,
117
- "loss": 1.1415093421936036,
118
- "mean_token_accuracy": 0.7101349741220474,
119
- "num_tokens": 312151.0,
120
  "step": 110
121
  },
122
  {
123
- "entropy": 1.091178685426712,
124
- "epoch": 0.012124273806516797,
125
- "grad_norm": 0.1953125,
126
  "learning_rate": 0.00015240000000000002,
127
- "loss": 1.0913947105407715,
128
- "mean_token_accuracy": 0.7238801747560502,
129
- "num_tokens": 340776.0,
130
  "step": 120
131
  },
132
  {
133
- "entropy": 1.2382428109645844,
134
- "epoch": 0.013134629957059864,
135
- "grad_norm": 0.2099609375,
136
  "learning_rate": 0.0001484,
137
- "loss": 1.2411503791809082,
138
- "mean_token_accuracy": 0.697870621085167,
139
- "num_tokens": 371270.0,
140
  "step": 130
141
  },
142
  {
143
- "entropy": 1.1168828099966048,
144
- "epoch": 0.01414498610760293,
145
- "grad_norm": 0.220703125,
146
  "learning_rate": 0.0001444,
147
- "loss": 1.1341249465942382,
148
- "mean_token_accuracy": 0.7141003280878067,
149
- "num_tokens": 400176.0,
150
  "step": 140
151
  },
152
  {
153
- "entropy": 1.114673560857773,
154
- "epoch": 0.015155342258145996,
155
- "grad_norm": 0.2109375,
156
  "learning_rate": 0.0001404,
157
- "loss": 1.1116752624511719,
158
- "mean_token_accuracy": 0.7234076589345932,
159
- "num_tokens": 427204.0,
160
  "step": 150
161
  },
162
  {
163
- "entropy": 1.1378572463989258,
164
- "epoch": 0.016165698408689064,
165
- "grad_norm": 0.1904296875,
166
  "learning_rate": 0.0001364,
167
- "loss": 1.1589903831481934,
168
- "mean_token_accuracy": 0.7053093910217285,
169
- "num_tokens": 458094.0,
170
  "step": 160
171
  },
172
  {
173
- "entropy": 1.110730269551277,
174
- "epoch": 0.01717605455923213,
175
- "grad_norm": 0.1962890625,
176
  "learning_rate": 0.00013240000000000002,
177
- "loss": 1.087682342529297,
178
- "mean_token_accuracy": 0.7177392661571502,
179
- "num_tokens": 487098.0,
180
  "step": 170
181
  },
182
  {
183
- "entropy": 1.0602406531572341,
184
- "epoch": 0.018186410709775197,
185
- "grad_norm": 0.228515625,
186
  "learning_rate": 0.0001284,
187
- "loss": 1.0950936317443847,
188
- "mean_token_accuracy": 0.7214239358901977,
189
- "num_tokens": 516025.0,
190
  "step": 180
191
  },
192
  {
193
- "entropy": 1.1597254037857057,
194
- "epoch": 0.01919676686031826,
195
- "grad_norm": 0.203125,
196
  "learning_rate": 0.00012440000000000002,
197
- "loss": 1.1208978652954102,
198
- "mean_token_accuracy": 0.7143576145172119,
199
- "num_tokens": 544810.0,
200
  "step": 190
201
  },
202
  {
203
- "entropy": 1.0519475936889648,
204
- "epoch": 0.02020712301086133,
205
- "grad_norm": 0.20703125,
206
  "learning_rate": 0.0001204,
207
- "loss": 1.0744948387145996,
208
- "mean_token_accuracy": 0.718455109000206,
209
- "num_tokens": 573002.0,
210
  "step": 200
211
  }
212
  ],
@@ -227,7 +227,7 @@
227
  "attributes": {}
228
  }
229
  },
230
- "total_flos": 6.072716910585446e+16,
231
  "train_batch_size": 8,
232
  "trial_name": null,
233
  "trial_params": null
 
2
  "best_global_step": null,
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
+ "epoch": 0.00014802336992964448,
6
  "eval_steps": 500,
7
  "global_step": 200,
8
  "is_hyper_param_search": false,
 
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
+ "entropy": 1.4767830133438111,
14
+ "epoch": 7.401168496482225e-06,
15
+ "grad_norm": 0.578125,
16
  "learning_rate": 0.0001964,
17
+ "loss": 1.6877475738525392,
18
+ "mean_token_accuracy": 0.7061349496245384,
19
+ "num_tokens": 14911.0,
20
  "step": 10
21
  },
22
  {
23
+ "entropy": 1.0930627048015595,
24
+ "epoch": 1.480233699296445e-05,
25
+ "grad_norm": 0.458984375,
26
  "learning_rate": 0.00019240000000000001,
27
+ "loss": 1.0562871932983398,
28
+ "mean_token_accuracy": 0.8108290940523147,
29
+ "num_tokens": 28646.0,
30
  "step": 20
31
  },
32
  {
33
+ "entropy": 0.8788679152727127,
34
+ "epoch": 2.2203505489446674e-05,
35
+ "grad_norm": 0.5859375,
36
  "learning_rate": 0.0001884,
37
+ "loss": 0.8974875450134278,
38
+ "mean_token_accuracy": 0.8296987593173981,
39
+ "num_tokens": 41474.0,
40
  "step": 30
41
  },
42
  {
43
+ "entropy": 0.8145956963300705,
44
+ "epoch": 2.96046739859289e-05,
45
+ "grad_norm": 0.39453125,
46
  "learning_rate": 0.0001844,
47
+ "loss": 0.8066701889038086,
48
+ "mean_token_accuracy": 0.8340015441179276,
49
+ "num_tokens": 54466.0,
50
  "step": 40
51
  },
52
  {
53
+ "entropy": 0.7157480388879776,
54
+ "epoch": 3.700584248241112e-05,
55
+ "grad_norm": 0.326171875,
56
  "learning_rate": 0.00018040000000000002,
57
+ "loss": 0.7251500129699707,
58
+ "mean_token_accuracy": 0.8420351594686508,
59
+ "num_tokens": 66880.0,
60
  "step": 50
61
  },
62
  {
63
+ "entropy": 0.7959431439638138,
64
+ "epoch": 4.440701097889335e-05,
65
+ "grad_norm": 0.326171875,
66
  "learning_rate": 0.0001764,
67
+ "loss": 0.8049167633056641,
68
+ "mean_token_accuracy": 0.8289562940597535,
69
+ "num_tokens": 80036.0,
70
  "step": 60
71
  },
72
  {
73
+ "entropy": 0.8342548221349716,
74
+ "epoch": 5.180817947537557e-05,
75
+ "grad_norm": 0.326171875,
76
  "learning_rate": 0.00017240000000000002,
77
+ "loss": 0.8336853981018066,
78
+ "mean_token_accuracy": 0.8279720038175583,
79
+ "num_tokens": 93357.0,
80
  "step": 70
81
  },
82
  {
83
+ "entropy": 0.7970967918634415,
84
+ "epoch": 5.92093479718578e-05,
85
+ "grad_norm": 0.73046875,
86
  "learning_rate": 0.0001684,
87
+ "loss": 0.7949181079864502,
88
+ "mean_token_accuracy": 0.828959608078003,
89
+ "num_tokens": 106951.0,
90
  "step": 80
91
  },
92
  {
93
+ "entropy": 0.7967441529035568,
94
+ "epoch": 6.661051646834002e-05,
95
+ "grad_norm": 0.34375,
96
  "learning_rate": 0.0001644,
97
+ "loss": 0.8285197257995606,
98
+ "mean_token_accuracy": 0.8272027671337128,
99
+ "num_tokens": 120269.0,
100
  "step": 90
101
  },
102
  {
103
+ "entropy": 0.7741447448730469,
104
+ "epoch": 7.401168496482224e-05,
105
+ "grad_norm": 0.271484375,
106
  "learning_rate": 0.00016040000000000002,
107
+ "loss": 0.7636381626129151,
108
+ "mean_token_accuracy": 0.8373189926147461,
109
+ "num_tokens": 133116.0,
110
  "step": 100
111
  },
112
  {
113
+ "entropy": 0.72959463596344,
114
+ "epoch": 8.141285346130448e-05,
115
+ "grad_norm": 0.421875,
116
  "learning_rate": 0.0001564,
117
+ "loss": 0.7404542446136475,
118
+ "mean_token_accuracy": 0.8400259047746659,
119
+ "num_tokens": 146103.0,
120
  "step": 110
121
  },
122
  {
123
+ "entropy": 0.777249938249588,
124
+ "epoch": 8.88140219577867e-05,
125
+ "grad_norm": 0.3984375,
126
  "learning_rate": 0.00015240000000000002,
127
+ "loss": 0.7868029117584229,
128
+ "mean_token_accuracy": 0.8342386931180954,
129
+ "num_tokens": 158980.0,
130
  "step": 120
131
  },
132
  {
133
+ "entropy": 0.8305783897638321,
134
+ "epoch": 9.621519045426892e-05,
135
+ "grad_norm": 0.328125,
136
  "learning_rate": 0.0001484,
137
+ "loss": 0.8155685424804687,
138
+ "mean_token_accuracy": 0.8282770067453384,
139
+ "num_tokens": 172414.0,
140
  "step": 130
141
  },
142
  {
143
+ "entropy": 0.8582165241241455,
144
+ "epoch": 0.00010361635895075114,
145
+ "grad_norm": 0.322265625,
146
  "learning_rate": 0.0001444,
147
+ "loss": 0.8684965133666992,
148
+ "mean_token_accuracy": 0.8188153028488159,
149
+ "num_tokens": 186224.0,
150
  "step": 140
151
  },
152
  {
153
+ "entropy": 0.823002302646637,
154
+ "epoch": 0.00011101752744723338,
155
+ "grad_norm": 0.41796875,
156
  "learning_rate": 0.0001404,
157
+ "loss": 0.8199325561523437,
158
+ "mean_token_accuracy": 0.8285818427801133,
159
+ "num_tokens": 199564.0,
160
  "step": 150
161
  },
162
  {
163
+ "entropy": 0.7803006649017334,
164
+ "epoch": 0.0001184186959437156,
165
+ "grad_norm": 0.28125,
166
  "learning_rate": 0.0001364,
167
+ "loss": 0.8177242279052734,
168
+ "mean_token_accuracy": 0.8276909857988357,
169
+ "num_tokens": 212955.0,
170
  "step": 160
171
  },
172
  {
173
+ "entropy": 0.7576605170965195,
174
+ "epoch": 0.00012581986444019783,
175
+ "grad_norm": 0.298828125,
176
  "learning_rate": 0.00013240000000000002,
177
+ "loss": 0.7334442615509034,
178
+ "mean_token_accuracy": 0.8368929207324982,
179
+ "num_tokens": 225983.0,
180
  "step": 170
181
  },
182
  {
183
+ "entropy": 0.8388681739568711,
184
+ "epoch": 0.00013322103293668004,
185
+ "grad_norm": 4.15625,
186
  "learning_rate": 0.0001284,
187
+ "loss": 0.878928279876709,
188
+ "mean_token_accuracy": 0.8206132620573043,
189
+ "num_tokens": 240490.0,
190
  "step": 180
191
  },
192
  {
193
+ "entropy": 0.8390863686800003,
194
+ "epoch": 0.00014062220143316227,
195
+ "grad_norm": 0.25,
196
  "learning_rate": 0.00012440000000000002,
197
+ "loss": 0.8454230308532715,
198
+ "mean_token_accuracy": 0.8245942384004593,
199
+ "num_tokens": 254696.0,
200
  "step": 190
201
  },
202
  {
203
+ "entropy": 0.8603733956813813,
204
+ "epoch": 0.00014802336992964448,
205
+ "grad_norm": 0.2734375,
206
  "learning_rate": 0.0001204,
207
+ "loss": 0.8759581565856933,
208
+ "mean_token_accuracy": 0.8165332227945328,
209
+ "num_tokens": 269719.0,
210
  "step": 200
211
  }
212
  ],
 
227
  "attributes": {}
228
  }
229
  },
230
+ "total_flos": 1.979803013106893e+16,
231
  "train_batch_size": 8,
232
  "trial_name": null,
233
  "trial_params": null
checkpoint-300/adapter_config.json CHANGED
@@ -29,13 +29,13 @@
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
- "down_proj",
33
  "o_proj",
34
- "k_proj",
35
  "q_proj",
36
- "v_proj",
37
  "gate_proj",
38
- "up_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
 
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
+ "v_proj",
33
  "o_proj",
34
+ "up_proj",
35
  "q_proj",
36
+ "down_proj",
37
  "gate_proj",
38
+ "k_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
checkpoint-300/adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7cbcf3ca8d5775d18c717b9664d07d0f3dbf4ad047e02f29c45b2aafbe03f792
3
  size 83946192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:85efefbb46bc43fcbac85541bd7b747e61f87e638522522412f091837cc1b8a1
3
  size 83946192
checkpoint-300/optimizer.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6fe138a3267a60f56d4593f3a156e8d355dda78fbaf390d33617b908b29bccfb
3
  size 335818315
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0df08cfa7716b9bae6ec43b0a8ef4cb39429b50f55a2ab597adea0596080bac6
3
  size 335818315
checkpoint-300/rng_state.pth CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:298e8a5d47da5232d3f20a30a20c275f9abe3afa14bbb395ac3df2d3ab6f5203
3
  size 14645
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9d64886c1f6b45c33ffbb3c56e1debba8d8c711e548d1360b58267adb2ccdba
3
  size 14645
checkpoint-300/trainer_state.json CHANGED
@@ -2,7 +2,7 @@
2
  "best_global_step": null,
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
- "epoch": 0.030310684516291993,
6
  "eval_steps": 500,
7
  "global_step": 300,
8
  "is_hyper_param_search": false,
@@ -10,303 +10,303 @@
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
- "entropy": 1.45259268283844,
14
- "epoch": 0.0010103561505430665,
15
- "grad_norm": 0.2451171875,
16
  "learning_rate": 0.0001964,
17
- "loss": 1.7130912780761718,
18
- "mean_token_accuracy": 0.6436435863375664,
19
- "num_tokens": 28151.0,
20
  "step": 10
21
  },
22
  {
23
- "entropy": 1.3634081721305846,
24
- "epoch": 0.002020712301086133,
25
- "grad_norm": 0.484375,
26
  "learning_rate": 0.00019240000000000001,
27
- "loss": 1.3450921058654786,
28
- "mean_token_accuracy": 0.686880823969841,
29
- "num_tokens": 57374.0,
30
  "step": 20
31
  },
32
  {
33
- "entropy": 1.1857430338859558,
34
- "epoch": 0.0030310684516291994,
35
- "grad_norm": 0.322265625,
36
  "learning_rate": 0.0001884,
37
- "loss": 1.205996608734131,
38
- "mean_token_accuracy": 0.706908255815506,
39
- "num_tokens": 86755.0,
40
  "step": 30
41
  },
42
  {
43
- "entropy": 1.156875231862068,
44
- "epoch": 0.004041424602172266,
45
- "grad_norm": 0.359375,
46
  "learning_rate": 0.0001844,
47
- "loss": 1.1667716026306152,
48
- "mean_token_accuracy": 0.7132074117660523,
49
- "num_tokens": 114130.0,
50
  "step": 40
51
  },
52
  {
53
- "entropy": 1.083250343799591,
54
- "epoch": 0.005051780752715332,
55
- "grad_norm": 0.330078125,
56
  "learning_rate": 0.00018040000000000002,
57
- "loss": 1.1047730445861816,
58
- "mean_token_accuracy": 0.7196864277124405,
59
- "num_tokens": 141768.0,
60
  "step": 50
61
  },
62
  {
63
- "entropy": 1.0874410301446915,
64
- "epoch": 0.006062136903258399,
65
- "grad_norm": 0.265625,
66
  "learning_rate": 0.0001764,
67
- "loss": 1.0894905090332032,
68
- "mean_token_accuracy": 0.7210412979125976,
69
- "num_tokens": 172272.0,
70
  "step": 60
71
  },
72
  {
73
- "entropy": 1.1303296595811845,
74
- "epoch": 0.007072493053801465,
75
- "grad_norm": 0.25390625,
76
  "learning_rate": 0.00017240000000000002,
77
- "loss": 1.1445048332214356,
78
- "mean_token_accuracy": 0.7178041815757752,
79
- "num_tokens": 200909.0,
80
  "step": 70
81
  },
82
  {
83
- "entropy": 1.0938484042882919,
84
- "epoch": 0.008082849204344532,
85
- "grad_norm": 0.271484375,
86
  "learning_rate": 0.0001684,
87
- "loss": 1.098367691040039,
88
- "mean_token_accuracy": 0.7246084690093995,
89
- "num_tokens": 228726.0,
90
  "step": 80
91
  },
92
  {
93
- "entropy": 1.0779876083135604,
94
- "epoch": 0.009093205354887599,
95
- "grad_norm": 0.2216796875,
96
  "learning_rate": 0.0001644,
97
- "loss": 1.0803230285644532,
98
- "mean_token_accuracy": 0.720952507853508,
99
- "num_tokens": 255181.0,
100
  "step": 90
101
  },
102
  {
103
- "entropy": 1.1645614862442017,
104
- "epoch": 0.010103561505430665,
105
- "grad_norm": 0.1826171875,
106
  "learning_rate": 0.00016040000000000002,
107
- "loss": 1.147304630279541,
108
- "mean_token_accuracy": 0.7067203193902969,
109
- "num_tokens": 283616.0,
110
  "step": 100
111
  },
112
  {
113
- "entropy": 1.126008078455925,
114
- "epoch": 0.011113917655973731,
115
- "grad_norm": 0.2001953125,
116
  "learning_rate": 0.0001564,
117
- "loss": 1.1415093421936036,
118
- "mean_token_accuracy": 0.7101349741220474,
119
- "num_tokens": 312151.0,
120
  "step": 110
121
  },
122
  {
123
- "entropy": 1.091178685426712,
124
- "epoch": 0.012124273806516797,
125
- "grad_norm": 0.1953125,
126
  "learning_rate": 0.00015240000000000002,
127
- "loss": 1.0913947105407715,
128
- "mean_token_accuracy": 0.7238801747560502,
129
- "num_tokens": 340776.0,
130
  "step": 120
131
  },
132
  {
133
- "entropy": 1.2382428109645844,
134
- "epoch": 0.013134629957059864,
135
- "grad_norm": 0.2099609375,
136
  "learning_rate": 0.0001484,
137
- "loss": 1.2411503791809082,
138
- "mean_token_accuracy": 0.697870621085167,
139
- "num_tokens": 371270.0,
140
  "step": 130
141
  },
142
  {
143
- "entropy": 1.1168828099966048,
144
- "epoch": 0.01414498610760293,
145
- "grad_norm": 0.220703125,
146
  "learning_rate": 0.0001444,
147
- "loss": 1.1341249465942382,
148
- "mean_token_accuracy": 0.7141003280878067,
149
- "num_tokens": 400176.0,
150
  "step": 140
151
  },
152
  {
153
- "entropy": 1.114673560857773,
154
- "epoch": 0.015155342258145996,
155
- "grad_norm": 0.2109375,
156
  "learning_rate": 0.0001404,
157
- "loss": 1.1116752624511719,
158
- "mean_token_accuracy": 0.7234076589345932,
159
- "num_tokens": 427204.0,
160
  "step": 150
161
  },
162
  {
163
- "entropy": 1.1378572463989258,
164
- "epoch": 0.016165698408689064,
165
- "grad_norm": 0.1904296875,
166
  "learning_rate": 0.0001364,
167
- "loss": 1.1589903831481934,
168
- "mean_token_accuracy": 0.7053093910217285,
169
- "num_tokens": 458094.0,
170
  "step": 160
171
  },
172
  {
173
- "entropy": 1.110730269551277,
174
- "epoch": 0.01717605455923213,
175
- "grad_norm": 0.1962890625,
176
  "learning_rate": 0.00013240000000000002,
177
- "loss": 1.087682342529297,
178
- "mean_token_accuracy": 0.7177392661571502,
179
- "num_tokens": 487098.0,
180
  "step": 170
181
  },
182
  {
183
- "entropy": 1.0602406531572341,
184
- "epoch": 0.018186410709775197,
185
- "grad_norm": 0.228515625,
186
  "learning_rate": 0.0001284,
187
- "loss": 1.0950936317443847,
188
- "mean_token_accuracy": 0.7214239358901977,
189
- "num_tokens": 516025.0,
190
  "step": 180
191
  },
192
  {
193
- "entropy": 1.1597254037857057,
194
- "epoch": 0.01919676686031826,
195
- "grad_norm": 0.203125,
196
  "learning_rate": 0.00012440000000000002,
197
- "loss": 1.1208978652954102,
198
- "mean_token_accuracy": 0.7143576145172119,
199
- "num_tokens": 544810.0,
200
  "step": 190
201
  },
202
  {
203
- "entropy": 1.0519475936889648,
204
- "epoch": 0.02020712301086133,
205
- "grad_norm": 0.20703125,
206
  "learning_rate": 0.0001204,
207
- "loss": 1.0744948387145996,
208
- "mean_token_accuracy": 0.718455109000206,
209
- "num_tokens": 573002.0,
210
  "step": 200
211
  },
212
  {
213
- "entropy": 1.2084601551294327,
214
- "epoch": 0.021217479161404394,
215
- "grad_norm": 0.1943359375,
216
  "learning_rate": 0.0001164,
217
- "loss": 1.2174930572509766,
218
- "mean_token_accuracy": 0.6999662011861801,
219
- "num_tokens": 602401.0,
220
  "step": 210
221
  },
222
  {
223
- "entropy": 1.1912338614463807,
224
- "epoch": 0.022227835311947462,
225
- "grad_norm": 0.216796875,
226
  "learning_rate": 0.00011240000000000002,
227
- "loss": 1.183759880065918,
228
- "mean_token_accuracy": 0.7107820093631745,
229
- "num_tokens": 629994.0,
230
  "step": 220
231
  },
232
  {
233
- "entropy": 1.0905429303646088,
234
- "epoch": 0.023238191462490527,
235
- "grad_norm": 0.203125,
236
  "learning_rate": 0.00010840000000000002,
237
- "loss": 1.086796474456787,
238
- "mean_token_accuracy": 0.7176417618989944,
239
- "num_tokens": 658686.0,
240
  "step": 230
241
  },
242
  {
243
- "entropy": 1.0157978028059005,
244
- "epoch": 0.024248547613033595,
245
- "grad_norm": 0.23828125,
246
  "learning_rate": 0.0001044,
247
- "loss": 1.012559700012207,
248
- "mean_token_accuracy": 0.7369582027196884,
249
- "num_tokens": 685539.0,
250
  "step": 240
251
  },
252
  {
253
- "entropy": 1.1027084678411483,
254
- "epoch": 0.02525890376357666,
255
- "grad_norm": 0.2314453125,
256
  "learning_rate": 0.0001004,
257
- "loss": 1.1228812217712403,
258
- "mean_token_accuracy": 0.7192230314016342,
259
- "num_tokens": 715224.0,
260
  "step": 250
261
  },
262
  {
263
- "entropy": 1.0666967660188675,
264
- "epoch": 0.026269259914119727,
265
- "grad_norm": 0.2412109375,
266
  "learning_rate": 9.64e-05,
267
- "loss": 1.0753504753112793,
268
- "mean_token_accuracy": 0.719659861922264,
269
- "num_tokens": 745782.0,
270
  "step": 260
271
  },
272
  {
273
- "entropy": 1.034983891248703,
274
- "epoch": 0.027279616064662792,
275
- "grad_norm": 0.2216796875,
276
  "learning_rate": 9.240000000000001e-05,
277
- "loss": 1.032216453552246,
278
- "mean_token_accuracy": 0.7364529073238373,
279
- "num_tokens": 772358.0,
280
  "step": 270
281
  },
282
  {
283
- "entropy": 1.0821890532970428,
284
- "epoch": 0.02828997221520586,
285
- "grad_norm": 0.2294921875,
286
  "learning_rate": 8.840000000000001e-05,
287
- "loss": 1.0743555068969726,
288
- "mean_token_accuracy": 0.7248906105756759,
289
- "num_tokens": 800173.0,
290
  "step": 280
291
  },
292
  {
293
- "entropy": 1.1281798005104064,
294
- "epoch": 0.029300328365748928,
295
- "grad_norm": 0.2216796875,
296
  "learning_rate": 8.44e-05,
297
- "loss": 1.1471177101135255,
298
- "mean_token_accuracy": 0.7095231086015701,
299
- "num_tokens": 826835.0,
300
  "step": 290
301
  },
302
  {
303
- "entropy": 1.0436414241790772,
304
- "epoch": 0.030310684516291993,
305
- "grad_norm": 0.2001953125,
306
  "learning_rate": 8.04e-05,
307
- "loss": 1.0298893928527832,
308
- "mean_token_accuracy": 0.7325254619121552,
309
- "num_tokens": 853598.0,
310
  "step": 300
311
  }
312
  ],
@@ -327,7 +327,7 @@
327
  "attributes": {}
328
  }
329
  },
330
- "total_flos": 9.026935257700762e+16,
331
  "train_batch_size": 8,
332
  "trial_name": null,
333
  "trial_params": null
 
2
  "best_global_step": null,
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
+ "epoch": 0.00022203505489446675,
6
  "eval_steps": 500,
7
  "global_step": 300,
8
  "is_hyper_param_search": false,
 
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
+ "entropy": 1.4767830133438111,
14
+ "epoch": 7.401168496482225e-06,
15
+ "grad_norm": 0.578125,
16
  "learning_rate": 0.0001964,
17
+ "loss": 1.6877475738525392,
18
+ "mean_token_accuracy": 0.7061349496245384,
19
+ "num_tokens": 14911.0,
20
  "step": 10
21
  },
22
  {
23
+ "entropy": 1.0930627048015595,
24
+ "epoch": 1.480233699296445e-05,
25
+ "grad_norm": 0.458984375,
26
  "learning_rate": 0.00019240000000000001,
27
+ "loss": 1.0562871932983398,
28
+ "mean_token_accuracy": 0.8108290940523147,
29
+ "num_tokens": 28646.0,
30
  "step": 20
31
  },
32
  {
33
+ "entropy": 0.8788679152727127,
34
+ "epoch": 2.2203505489446674e-05,
35
+ "grad_norm": 0.5859375,
36
  "learning_rate": 0.0001884,
37
+ "loss": 0.8974875450134278,
38
+ "mean_token_accuracy": 0.8296987593173981,
39
+ "num_tokens": 41474.0,
40
  "step": 30
41
  },
42
  {
43
+ "entropy": 0.8145956963300705,
44
+ "epoch": 2.96046739859289e-05,
45
+ "grad_norm": 0.39453125,
46
  "learning_rate": 0.0001844,
47
+ "loss": 0.8066701889038086,
48
+ "mean_token_accuracy": 0.8340015441179276,
49
+ "num_tokens": 54466.0,
50
  "step": 40
51
  },
52
  {
53
+ "entropy": 0.7157480388879776,
54
+ "epoch": 3.700584248241112e-05,
55
+ "grad_norm": 0.326171875,
56
  "learning_rate": 0.00018040000000000002,
57
+ "loss": 0.7251500129699707,
58
+ "mean_token_accuracy": 0.8420351594686508,
59
+ "num_tokens": 66880.0,
60
  "step": 50
61
  },
62
  {
63
+ "entropy": 0.7959431439638138,
64
+ "epoch": 4.440701097889335e-05,
65
+ "grad_norm": 0.326171875,
66
  "learning_rate": 0.0001764,
67
+ "loss": 0.8049167633056641,
68
+ "mean_token_accuracy": 0.8289562940597535,
69
+ "num_tokens": 80036.0,
70
  "step": 60
71
  },
72
  {
73
+ "entropy": 0.8342548221349716,
74
+ "epoch": 5.180817947537557e-05,
75
+ "grad_norm": 0.326171875,
76
  "learning_rate": 0.00017240000000000002,
77
+ "loss": 0.8336853981018066,
78
+ "mean_token_accuracy": 0.8279720038175583,
79
+ "num_tokens": 93357.0,
80
  "step": 70
81
  },
82
  {
83
+ "entropy": 0.7970967918634415,
84
+ "epoch": 5.92093479718578e-05,
85
+ "grad_norm": 0.73046875,
86
  "learning_rate": 0.0001684,
87
+ "loss": 0.7949181079864502,
88
+ "mean_token_accuracy": 0.828959608078003,
89
+ "num_tokens": 106951.0,
90
  "step": 80
91
  },
92
  {
93
+ "entropy": 0.7967441529035568,
94
+ "epoch": 6.661051646834002e-05,
95
+ "grad_norm": 0.34375,
96
  "learning_rate": 0.0001644,
97
+ "loss": 0.8285197257995606,
98
+ "mean_token_accuracy": 0.8272027671337128,
99
+ "num_tokens": 120269.0,
100
  "step": 90
101
  },
102
  {
103
+ "entropy": 0.7741447448730469,
104
+ "epoch": 7.401168496482224e-05,
105
+ "grad_norm": 0.271484375,
106
  "learning_rate": 0.00016040000000000002,
107
+ "loss": 0.7636381626129151,
108
+ "mean_token_accuracy": 0.8373189926147461,
109
+ "num_tokens": 133116.0,
110
  "step": 100
111
  },
112
  {
113
+ "entropy": 0.72959463596344,
114
+ "epoch": 8.141285346130448e-05,
115
+ "grad_norm": 0.421875,
116
  "learning_rate": 0.0001564,
117
+ "loss": 0.7404542446136475,
118
+ "mean_token_accuracy": 0.8400259047746659,
119
+ "num_tokens": 146103.0,
120
  "step": 110
121
  },
122
  {
123
+ "entropy": 0.777249938249588,
124
+ "epoch": 8.88140219577867e-05,
125
+ "grad_norm": 0.3984375,
126
  "learning_rate": 0.00015240000000000002,
127
+ "loss": 0.7868029117584229,
128
+ "mean_token_accuracy": 0.8342386931180954,
129
+ "num_tokens": 158980.0,
130
  "step": 120
131
  },
132
  {
133
+ "entropy": 0.8305783897638321,
134
+ "epoch": 9.621519045426892e-05,
135
+ "grad_norm": 0.328125,
136
  "learning_rate": 0.0001484,
137
+ "loss": 0.8155685424804687,
138
+ "mean_token_accuracy": 0.8282770067453384,
139
+ "num_tokens": 172414.0,
140
  "step": 130
141
  },
142
  {
143
+ "entropy": 0.8582165241241455,
144
+ "epoch": 0.00010361635895075114,
145
+ "grad_norm": 0.322265625,
146
  "learning_rate": 0.0001444,
147
+ "loss": 0.8684965133666992,
148
+ "mean_token_accuracy": 0.8188153028488159,
149
+ "num_tokens": 186224.0,
150
  "step": 140
151
  },
152
  {
153
+ "entropy": 0.823002302646637,
154
+ "epoch": 0.00011101752744723338,
155
+ "grad_norm": 0.41796875,
156
  "learning_rate": 0.0001404,
157
+ "loss": 0.8199325561523437,
158
+ "mean_token_accuracy": 0.8285818427801133,
159
+ "num_tokens": 199564.0,
160
  "step": 150
161
  },
162
  {
163
+ "entropy": 0.7803006649017334,
164
+ "epoch": 0.0001184186959437156,
165
+ "grad_norm": 0.28125,
166
  "learning_rate": 0.0001364,
167
+ "loss": 0.8177242279052734,
168
+ "mean_token_accuracy": 0.8276909857988357,
169
+ "num_tokens": 212955.0,
170
  "step": 160
171
  },
172
  {
173
+ "entropy": 0.7576605170965195,
174
+ "epoch": 0.00012581986444019783,
175
+ "grad_norm": 0.298828125,
176
  "learning_rate": 0.00013240000000000002,
177
+ "loss": 0.7334442615509034,
178
+ "mean_token_accuracy": 0.8368929207324982,
179
+ "num_tokens": 225983.0,
180
  "step": 170
181
  },
182
  {
183
+ "entropy": 0.8388681739568711,
184
+ "epoch": 0.00013322103293668004,
185
+ "grad_norm": 4.15625,
186
  "learning_rate": 0.0001284,
187
+ "loss": 0.878928279876709,
188
+ "mean_token_accuracy": 0.8206132620573043,
189
+ "num_tokens": 240490.0,
190
  "step": 180
191
  },
192
  {
193
+ "entropy": 0.8390863686800003,
194
+ "epoch": 0.00014062220143316227,
195
+ "grad_norm": 0.25,
196
  "learning_rate": 0.00012440000000000002,
197
+ "loss": 0.8454230308532715,
198
+ "mean_token_accuracy": 0.8245942384004593,
199
+ "num_tokens": 254696.0,
200
  "step": 190
201
  },
202
  {
203
+ "entropy": 0.8603733956813813,
204
+ "epoch": 0.00014802336992964448,
205
+ "grad_norm": 0.2734375,
206
  "learning_rate": 0.0001204,
207
+ "loss": 0.8759581565856933,
208
+ "mean_token_accuracy": 0.8165332227945328,
209
+ "num_tokens": 269719.0,
210
  "step": 200
211
  },
212
  {
213
+ "entropy": 0.76580231487751,
214
+ "epoch": 0.00015542453842612672,
215
+ "grad_norm": 0.240234375,
216
  "learning_rate": 0.0001164,
217
+ "loss": 0.7616221904754639,
218
+ "mean_token_accuracy": 0.8392421275377273,
219
+ "num_tokens": 282621.0,
220
  "step": 210
221
  },
222
  {
223
+ "entropy": 0.7803073287010193,
224
+ "epoch": 0.00016282570692260895,
225
+ "grad_norm": 0.341796875,
226
  "learning_rate": 0.00011240000000000002,
227
+ "loss": 0.7809097766876221,
228
+ "mean_token_accuracy": 0.8302495568990708,
229
+ "num_tokens": 295624.0,
230
  "step": 220
231
  },
232
  {
233
+ "entropy": 0.7702126175165176,
234
+ "epoch": 0.00017022687541909116,
235
+ "grad_norm": 0.251953125,
236
  "learning_rate": 0.00010840000000000002,
237
+ "loss": 0.7757031917572021,
238
+ "mean_token_accuracy": 0.8389965564012527,
239
+ "num_tokens": 308856.0,
240
  "step": 230
241
  },
242
  {
243
+ "entropy": 0.8611143410205842,
244
+ "epoch": 0.0001776280439155734,
245
+ "grad_norm": 0.337890625,
246
  "learning_rate": 0.0001044,
247
+ "loss": 0.8744688034057617,
248
+ "mean_token_accuracy": 0.8146604359149933,
249
+ "num_tokens": 322610.0,
250
  "step": 240
251
  },
252
  {
253
+ "entropy": 0.8659275263547898,
254
+ "epoch": 0.0001850292124120556,
255
+ "grad_norm": 0.326171875,
256
  "learning_rate": 0.0001004,
257
+ "loss": 0.8652327537536622,
258
+ "mean_token_accuracy": 0.819416218996048,
259
+ "num_tokens": 336948.0,
260
  "step": 250
261
  },
262
  {
263
+ "entropy": 0.768859726190567,
264
+ "epoch": 0.00019243038090853784,
265
+ "grad_norm": 0.2890625,
266
  "learning_rate": 9.64e-05,
267
+ "loss": 0.7651469707489014,
268
+ "mean_token_accuracy": 0.8339170336723327,
269
+ "num_tokens": 350252.0,
270
  "step": 260
271
  },
272
  {
273
+ "entropy": 0.8208303570747375,
274
+ "epoch": 0.00019983154940502007,
275
+ "grad_norm": 0.296875,
276
  "learning_rate": 9.240000000000001e-05,
277
+ "loss": 0.8234204292297364,
278
+ "mean_token_accuracy": 0.8225490599870682,
279
+ "num_tokens": 364325.0,
280
  "step": 270
281
  },
282
  {
283
+ "entropy": 0.7798860669136047,
284
+ "epoch": 0.00020723271790150228,
285
+ "grad_norm": 0.3046875,
286
  "learning_rate": 8.840000000000001e-05,
287
+ "loss": 0.7923468112945556,
288
+ "mean_token_accuracy": 0.831676983833313,
289
+ "num_tokens": 378088.0,
290
  "step": 280
291
  },
292
  {
293
+ "entropy": 0.7306642323732376,
294
+ "epoch": 0.00021463388639798452,
295
+ "grad_norm": 0.279296875,
296
  "learning_rate": 8.44e-05,
297
+ "loss": 0.7504455089569092,
298
+ "mean_token_accuracy": 0.8410079121589661,
299
+ "num_tokens": 391023.0,
300
  "step": 290
301
  },
302
  {
303
+ "entropy": 0.8291689246892929,
304
+ "epoch": 0.00022203505489446675,
305
+ "grad_norm": 0.24609375,
306
  "learning_rate": 8.04e-05,
307
+ "loss": 0.8151634216308594,
308
+ "mean_token_accuracy": 0.8278465926647186,
309
+ "num_tokens": 404816.0,
310
  "step": 300
311
  }
312
  ],
 
327
  "attributes": {}
328
  }
329
  },
330
+ "total_flos": 2.9772936498315264e+16,
331
  "train_batch_size": 8,
332
  "trial_name": null,
333
  "trial_params": null
checkpoint-400/adapter_config.json CHANGED
@@ -29,13 +29,13 @@
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
- "down_proj",
33
  "o_proj",
34
- "k_proj",
35
  "q_proj",
36
- "v_proj",
37
  "gate_proj",
38
- "up_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
 
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
+ "v_proj",
33
  "o_proj",
34
+ "up_proj",
35
  "q_proj",
36
+ "down_proj",
37
  "gate_proj",
38
+ "k_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
checkpoint-400/adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ed830c28dd9009fd9f8f9ef49fadbe801b9643637319560ff6802f35368f57f7
3
  size 83946192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:68c2068c6a06bf1f5a8372f814dfda631696defd4498d00cb745365c4084a9ac
3
  size 83946192
checkpoint-400/optimizer.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a29a5e86755e5f4d7bf3569dfbccdee1d4b290afa30db3b8c00266514c2b8248
3
  size 335818315
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a4a8c480eb6a45d018a29e9fc497907ce5f99ce120d183d6cce629c5abcfe3ba
3
  size 335818315
checkpoint-400/rng_state.pth CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:edf3e1d34f77115ba655f936fb1927d096562299f766f37a65033d66f88d36c4
3
  size 14645
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ff51dafa0029f4a138f8ab74c1e295d16db5f4d42cfafa0fd3ace6ee42e94f3
3
  size 14645
checkpoint-400/trainer_state.json CHANGED
@@ -2,7 +2,7 @@
2
  "best_global_step": null,
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
- "epoch": 0.04041424602172266,
6
  "eval_steps": 500,
7
  "global_step": 400,
8
  "is_hyper_param_search": false,
@@ -10,403 +10,403 @@
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
- "entropy": 1.45259268283844,
14
- "epoch": 0.0010103561505430665,
15
- "grad_norm": 0.2451171875,
16
  "learning_rate": 0.0001964,
17
- "loss": 1.7130912780761718,
18
- "mean_token_accuracy": 0.6436435863375664,
19
- "num_tokens": 28151.0,
20
  "step": 10
21
  },
22
  {
23
- "entropy": 1.3634081721305846,
24
- "epoch": 0.002020712301086133,
25
- "grad_norm": 0.484375,
26
  "learning_rate": 0.00019240000000000001,
27
- "loss": 1.3450921058654786,
28
- "mean_token_accuracy": 0.686880823969841,
29
- "num_tokens": 57374.0,
30
  "step": 20
31
  },
32
  {
33
- "entropy": 1.1857430338859558,
34
- "epoch": 0.0030310684516291994,
35
- "grad_norm": 0.322265625,
36
  "learning_rate": 0.0001884,
37
- "loss": 1.205996608734131,
38
- "mean_token_accuracy": 0.706908255815506,
39
- "num_tokens": 86755.0,
40
  "step": 30
41
  },
42
  {
43
- "entropy": 1.156875231862068,
44
- "epoch": 0.004041424602172266,
45
- "grad_norm": 0.359375,
46
  "learning_rate": 0.0001844,
47
- "loss": 1.1667716026306152,
48
- "mean_token_accuracy": 0.7132074117660523,
49
- "num_tokens": 114130.0,
50
  "step": 40
51
  },
52
  {
53
- "entropy": 1.083250343799591,
54
- "epoch": 0.005051780752715332,
55
- "grad_norm": 0.330078125,
56
  "learning_rate": 0.00018040000000000002,
57
- "loss": 1.1047730445861816,
58
- "mean_token_accuracy": 0.7196864277124405,
59
- "num_tokens": 141768.0,
60
  "step": 50
61
  },
62
  {
63
- "entropy": 1.0874410301446915,
64
- "epoch": 0.006062136903258399,
65
- "grad_norm": 0.265625,
66
  "learning_rate": 0.0001764,
67
- "loss": 1.0894905090332032,
68
- "mean_token_accuracy": 0.7210412979125976,
69
- "num_tokens": 172272.0,
70
  "step": 60
71
  },
72
  {
73
- "entropy": 1.1303296595811845,
74
- "epoch": 0.007072493053801465,
75
- "grad_norm": 0.25390625,
76
  "learning_rate": 0.00017240000000000002,
77
- "loss": 1.1445048332214356,
78
- "mean_token_accuracy": 0.7178041815757752,
79
- "num_tokens": 200909.0,
80
  "step": 70
81
  },
82
  {
83
- "entropy": 1.0938484042882919,
84
- "epoch": 0.008082849204344532,
85
- "grad_norm": 0.271484375,
86
  "learning_rate": 0.0001684,
87
- "loss": 1.098367691040039,
88
- "mean_token_accuracy": 0.7246084690093995,
89
- "num_tokens": 228726.0,
90
  "step": 80
91
  },
92
  {
93
- "entropy": 1.0779876083135604,
94
- "epoch": 0.009093205354887599,
95
- "grad_norm": 0.2216796875,
96
  "learning_rate": 0.0001644,
97
- "loss": 1.0803230285644532,
98
- "mean_token_accuracy": 0.720952507853508,
99
- "num_tokens": 255181.0,
100
  "step": 90
101
  },
102
  {
103
- "entropy": 1.1645614862442017,
104
- "epoch": 0.010103561505430665,
105
- "grad_norm": 0.1826171875,
106
  "learning_rate": 0.00016040000000000002,
107
- "loss": 1.147304630279541,
108
- "mean_token_accuracy": 0.7067203193902969,
109
- "num_tokens": 283616.0,
110
  "step": 100
111
  },
112
  {
113
- "entropy": 1.126008078455925,
114
- "epoch": 0.011113917655973731,
115
- "grad_norm": 0.2001953125,
116
  "learning_rate": 0.0001564,
117
- "loss": 1.1415093421936036,
118
- "mean_token_accuracy": 0.7101349741220474,
119
- "num_tokens": 312151.0,
120
  "step": 110
121
  },
122
  {
123
- "entropy": 1.091178685426712,
124
- "epoch": 0.012124273806516797,
125
- "grad_norm": 0.1953125,
126
  "learning_rate": 0.00015240000000000002,
127
- "loss": 1.0913947105407715,
128
- "mean_token_accuracy": 0.7238801747560502,
129
- "num_tokens": 340776.0,
130
  "step": 120
131
  },
132
  {
133
- "entropy": 1.2382428109645844,
134
- "epoch": 0.013134629957059864,
135
- "grad_norm": 0.2099609375,
136
  "learning_rate": 0.0001484,
137
- "loss": 1.2411503791809082,
138
- "mean_token_accuracy": 0.697870621085167,
139
- "num_tokens": 371270.0,
140
  "step": 130
141
  },
142
  {
143
- "entropy": 1.1168828099966048,
144
- "epoch": 0.01414498610760293,
145
- "grad_norm": 0.220703125,
146
  "learning_rate": 0.0001444,
147
- "loss": 1.1341249465942382,
148
- "mean_token_accuracy": 0.7141003280878067,
149
- "num_tokens": 400176.0,
150
  "step": 140
151
  },
152
  {
153
- "entropy": 1.114673560857773,
154
- "epoch": 0.015155342258145996,
155
- "grad_norm": 0.2109375,
156
  "learning_rate": 0.0001404,
157
- "loss": 1.1116752624511719,
158
- "mean_token_accuracy": 0.7234076589345932,
159
- "num_tokens": 427204.0,
160
  "step": 150
161
  },
162
  {
163
- "entropy": 1.1378572463989258,
164
- "epoch": 0.016165698408689064,
165
- "grad_norm": 0.1904296875,
166
  "learning_rate": 0.0001364,
167
- "loss": 1.1589903831481934,
168
- "mean_token_accuracy": 0.7053093910217285,
169
- "num_tokens": 458094.0,
170
  "step": 160
171
  },
172
  {
173
- "entropy": 1.110730269551277,
174
- "epoch": 0.01717605455923213,
175
- "grad_norm": 0.1962890625,
176
  "learning_rate": 0.00013240000000000002,
177
- "loss": 1.087682342529297,
178
- "mean_token_accuracy": 0.7177392661571502,
179
- "num_tokens": 487098.0,
180
  "step": 170
181
  },
182
  {
183
- "entropy": 1.0602406531572341,
184
- "epoch": 0.018186410709775197,
185
- "grad_norm": 0.228515625,
186
  "learning_rate": 0.0001284,
187
- "loss": 1.0950936317443847,
188
- "mean_token_accuracy": 0.7214239358901977,
189
- "num_tokens": 516025.0,
190
  "step": 180
191
  },
192
  {
193
- "entropy": 1.1597254037857057,
194
- "epoch": 0.01919676686031826,
195
- "grad_norm": 0.203125,
196
  "learning_rate": 0.00012440000000000002,
197
- "loss": 1.1208978652954102,
198
- "mean_token_accuracy": 0.7143576145172119,
199
- "num_tokens": 544810.0,
200
  "step": 190
201
  },
202
  {
203
- "entropy": 1.0519475936889648,
204
- "epoch": 0.02020712301086133,
205
- "grad_norm": 0.20703125,
206
  "learning_rate": 0.0001204,
207
- "loss": 1.0744948387145996,
208
- "mean_token_accuracy": 0.718455109000206,
209
- "num_tokens": 573002.0,
210
  "step": 200
211
  },
212
  {
213
- "entropy": 1.2084601551294327,
214
- "epoch": 0.021217479161404394,
215
- "grad_norm": 0.1943359375,
216
  "learning_rate": 0.0001164,
217
- "loss": 1.2174930572509766,
218
- "mean_token_accuracy": 0.6999662011861801,
219
- "num_tokens": 602401.0,
220
  "step": 210
221
  },
222
  {
223
- "entropy": 1.1912338614463807,
224
- "epoch": 0.022227835311947462,
225
- "grad_norm": 0.216796875,
226
  "learning_rate": 0.00011240000000000002,
227
- "loss": 1.183759880065918,
228
- "mean_token_accuracy": 0.7107820093631745,
229
- "num_tokens": 629994.0,
230
  "step": 220
231
  },
232
  {
233
- "entropy": 1.0905429303646088,
234
- "epoch": 0.023238191462490527,
235
- "grad_norm": 0.203125,
236
  "learning_rate": 0.00010840000000000002,
237
- "loss": 1.086796474456787,
238
- "mean_token_accuracy": 0.7176417618989944,
239
- "num_tokens": 658686.0,
240
  "step": 230
241
  },
242
  {
243
- "entropy": 1.0157978028059005,
244
- "epoch": 0.024248547613033595,
245
- "grad_norm": 0.23828125,
246
  "learning_rate": 0.0001044,
247
- "loss": 1.012559700012207,
248
- "mean_token_accuracy": 0.7369582027196884,
249
- "num_tokens": 685539.0,
250
  "step": 240
251
  },
252
  {
253
- "entropy": 1.1027084678411483,
254
- "epoch": 0.02525890376357666,
255
- "grad_norm": 0.2314453125,
256
  "learning_rate": 0.0001004,
257
- "loss": 1.1228812217712403,
258
- "mean_token_accuracy": 0.7192230314016342,
259
- "num_tokens": 715224.0,
260
  "step": 250
261
  },
262
  {
263
- "entropy": 1.0666967660188675,
264
- "epoch": 0.026269259914119727,
265
- "grad_norm": 0.2412109375,
266
  "learning_rate": 9.64e-05,
267
- "loss": 1.0753504753112793,
268
- "mean_token_accuracy": 0.719659861922264,
269
- "num_tokens": 745782.0,
270
  "step": 260
271
  },
272
  {
273
- "entropy": 1.034983891248703,
274
- "epoch": 0.027279616064662792,
275
- "grad_norm": 0.2216796875,
276
  "learning_rate": 9.240000000000001e-05,
277
- "loss": 1.032216453552246,
278
- "mean_token_accuracy": 0.7364529073238373,
279
- "num_tokens": 772358.0,
280
  "step": 270
281
  },
282
  {
283
- "entropy": 1.0821890532970428,
284
- "epoch": 0.02828997221520586,
285
- "grad_norm": 0.2294921875,
286
  "learning_rate": 8.840000000000001e-05,
287
- "loss": 1.0743555068969726,
288
- "mean_token_accuracy": 0.7248906105756759,
289
- "num_tokens": 800173.0,
290
  "step": 280
291
  },
292
  {
293
- "entropy": 1.1281798005104064,
294
- "epoch": 0.029300328365748928,
295
- "grad_norm": 0.2216796875,
296
  "learning_rate": 8.44e-05,
297
- "loss": 1.1471177101135255,
298
- "mean_token_accuracy": 0.7095231086015701,
299
- "num_tokens": 826835.0,
300
  "step": 290
301
  },
302
  {
303
- "entropy": 1.0436414241790772,
304
- "epoch": 0.030310684516291993,
305
- "grad_norm": 0.2001953125,
306
  "learning_rate": 8.04e-05,
307
- "loss": 1.0298893928527832,
308
- "mean_token_accuracy": 0.7325254619121552,
309
- "num_tokens": 853598.0,
310
  "step": 300
311
  },
312
  {
313
- "entropy": 1.0947031527757645,
314
- "epoch": 0.03132104066683506,
315
- "grad_norm": 0.232421875,
316
  "learning_rate": 7.64e-05,
317
- "loss": 1.0945837020874023,
318
- "mean_token_accuracy": 0.7231394708156585,
319
- "num_tokens": 884023.0,
320
  "step": 310
321
  },
322
  {
323
- "entropy": 1.1012510120868684,
324
- "epoch": 0.03233139681737813,
325
- "grad_norm": 0.17578125,
326
  "learning_rate": 7.24e-05,
327
- "loss": 1.106449031829834,
328
- "mean_token_accuracy": 0.7150771111249924,
329
- "num_tokens": 910709.0,
330
  "step": 320
331
  },
332
  {
333
- "entropy": 1.0871451586484908,
334
- "epoch": 0.03334175296792119,
335
- "grad_norm": 0.2041015625,
336
  "learning_rate": 6.840000000000001e-05,
337
- "loss": 1.0799496650695801,
338
- "mean_token_accuracy": 0.7183821439743042,
339
- "num_tokens": 938961.0,
340
  "step": 330
341
  },
342
  {
343
- "entropy": 1.0514528155326843,
344
- "epoch": 0.03435210911846426,
345
- "grad_norm": 0.2080078125,
346
  "learning_rate": 6.440000000000001e-05,
347
- "loss": 1.081194019317627,
348
- "mean_token_accuracy": 0.7243827939033508,
349
- "num_tokens": 967584.0,
350
  "step": 340
351
  },
352
  {
353
- "entropy": 1.1092546790838242,
354
- "epoch": 0.035362465269007326,
355
- "grad_norm": 0.201171875,
356
  "learning_rate": 6.04e-05,
357
- "loss": 1.0813716888427733,
358
- "mean_token_accuracy": 0.7228824734687805,
359
- "num_tokens": 995895.0,
360
  "step": 350
361
  },
362
  {
363
- "entropy": 1.0287963569164276,
364
- "epoch": 0.036372821419550394,
365
- "grad_norm": 0.265625,
366
  "learning_rate": 5.6399999999999995e-05,
367
- "loss": 1.0419590950012207,
368
- "mean_token_accuracy": 0.7281966924667358,
369
- "num_tokens": 1024358.0,
370
  "step": 360
371
  },
372
  {
373
- "entropy": 1.121953997015953,
374
- "epoch": 0.037383177570093455,
375
- "grad_norm": 0.26171875,
376
  "learning_rate": 5.2400000000000007e-05,
377
- "loss": 1.0987840652465821,
378
- "mean_token_accuracy": 0.7203631848096848,
379
- "num_tokens": 1052323.0,
380
  "step": 370
381
  },
382
  {
383
- "entropy": 1.110439071059227,
384
- "epoch": 0.03839353372063652,
385
- "grad_norm": 0.291015625,
386
  "learning_rate": 4.8400000000000004e-05,
387
- "loss": 1.0895070075988769,
388
- "mean_token_accuracy": 0.7241304695606232,
389
- "num_tokens": 1079594.0,
390
  "step": 380
391
  },
392
  {
393
- "entropy": 1.070469456911087,
394
- "epoch": 0.03940388987117959,
395
- "grad_norm": 0.3046875,
396
  "learning_rate": 4.44e-05,
397
- "loss": 1.1078936576843261,
398
- "mean_token_accuracy": 0.7139627873897553,
399
- "num_tokens": 1110202.0,
400
  "step": 390
401
  },
402
  {
403
- "entropy": 1.1429108887910844,
404
- "epoch": 0.04041424602172266,
405
- "grad_norm": 0.265625,
406
  "learning_rate": 4.0400000000000006e-05,
407
- "loss": 1.1321399688720704,
408
- "mean_token_accuracy": 0.7121616303920746,
409
- "num_tokens": 1138357.0,
410
  "step": 400
411
  }
412
  ],
@@ -427,7 +427,7 @@
427
  "attributes": {}
428
  }
429
  },
430
- "total_flos": 1.1975538735238349e+17,
431
  "train_batch_size": 8,
432
  "trial_name": null,
433
  "trial_params": null
 
2
  "best_global_step": null,
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
+ "epoch": 0.00029604673985928896,
6
  "eval_steps": 500,
7
  "global_step": 400,
8
  "is_hyper_param_search": false,
 
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
+ "entropy": 1.4767830133438111,
14
+ "epoch": 7.401168496482225e-06,
15
+ "grad_norm": 0.578125,
16
  "learning_rate": 0.0001964,
17
+ "loss": 1.6877475738525392,
18
+ "mean_token_accuracy": 0.7061349496245384,
19
+ "num_tokens": 14911.0,
20
  "step": 10
21
  },
22
  {
23
+ "entropy": 1.0930627048015595,
24
+ "epoch": 1.480233699296445e-05,
25
+ "grad_norm": 0.458984375,
26
  "learning_rate": 0.00019240000000000001,
27
+ "loss": 1.0562871932983398,
28
+ "mean_token_accuracy": 0.8108290940523147,
29
+ "num_tokens": 28646.0,
30
  "step": 20
31
  },
32
  {
33
+ "entropy": 0.8788679152727127,
34
+ "epoch": 2.2203505489446674e-05,
35
+ "grad_norm": 0.5859375,
36
  "learning_rate": 0.0001884,
37
+ "loss": 0.8974875450134278,
38
+ "mean_token_accuracy": 0.8296987593173981,
39
+ "num_tokens": 41474.0,
40
  "step": 30
41
  },
42
  {
43
+ "entropy": 0.8145956963300705,
44
+ "epoch": 2.96046739859289e-05,
45
+ "grad_norm": 0.39453125,
46
  "learning_rate": 0.0001844,
47
+ "loss": 0.8066701889038086,
48
+ "mean_token_accuracy": 0.8340015441179276,
49
+ "num_tokens": 54466.0,
50
  "step": 40
51
  },
52
  {
53
+ "entropy": 0.7157480388879776,
54
+ "epoch": 3.700584248241112e-05,
55
+ "grad_norm": 0.326171875,
56
  "learning_rate": 0.00018040000000000002,
57
+ "loss": 0.7251500129699707,
58
+ "mean_token_accuracy": 0.8420351594686508,
59
+ "num_tokens": 66880.0,
60
  "step": 50
61
  },
62
  {
63
+ "entropy": 0.7959431439638138,
64
+ "epoch": 4.440701097889335e-05,
65
+ "grad_norm": 0.326171875,
66
  "learning_rate": 0.0001764,
67
+ "loss": 0.8049167633056641,
68
+ "mean_token_accuracy": 0.8289562940597535,
69
+ "num_tokens": 80036.0,
70
  "step": 60
71
  },
72
  {
73
+ "entropy": 0.8342548221349716,
74
+ "epoch": 5.180817947537557e-05,
75
+ "grad_norm": 0.326171875,
76
  "learning_rate": 0.00017240000000000002,
77
+ "loss": 0.8336853981018066,
78
+ "mean_token_accuracy": 0.8279720038175583,
79
+ "num_tokens": 93357.0,
80
  "step": 70
81
  },
82
  {
83
+ "entropy": 0.7970967918634415,
84
+ "epoch": 5.92093479718578e-05,
85
+ "grad_norm": 0.73046875,
86
  "learning_rate": 0.0001684,
87
+ "loss": 0.7949181079864502,
88
+ "mean_token_accuracy": 0.828959608078003,
89
+ "num_tokens": 106951.0,
90
  "step": 80
91
  },
92
  {
93
+ "entropy": 0.7967441529035568,
94
+ "epoch": 6.661051646834002e-05,
95
+ "grad_norm": 0.34375,
96
  "learning_rate": 0.0001644,
97
+ "loss": 0.8285197257995606,
98
+ "mean_token_accuracy": 0.8272027671337128,
99
+ "num_tokens": 120269.0,
100
  "step": 90
101
  },
102
  {
103
+ "entropy": 0.7741447448730469,
104
+ "epoch": 7.401168496482224e-05,
105
+ "grad_norm": 0.271484375,
106
  "learning_rate": 0.00016040000000000002,
107
+ "loss": 0.7636381626129151,
108
+ "mean_token_accuracy": 0.8373189926147461,
109
+ "num_tokens": 133116.0,
110
  "step": 100
111
  },
112
  {
113
+ "entropy": 0.72959463596344,
114
+ "epoch": 8.141285346130448e-05,
115
+ "grad_norm": 0.421875,
116
  "learning_rate": 0.0001564,
117
+ "loss": 0.7404542446136475,
118
+ "mean_token_accuracy": 0.8400259047746659,
119
+ "num_tokens": 146103.0,
120
  "step": 110
121
  },
122
  {
123
+ "entropy": 0.777249938249588,
124
+ "epoch": 8.88140219577867e-05,
125
+ "grad_norm": 0.3984375,
126
  "learning_rate": 0.00015240000000000002,
127
+ "loss": 0.7868029117584229,
128
+ "mean_token_accuracy": 0.8342386931180954,
129
+ "num_tokens": 158980.0,
130
  "step": 120
131
  },
132
  {
133
+ "entropy": 0.8305783897638321,
134
+ "epoch": 9.621519045426892e-05,
135
+ "grad_norm": 0.328125,
136
  "learning_rate": 0.0001484,
137
+ "loss": 0.8155685424804687,
138
+ "mean_token_accuracy": 0.8282770067453384,
139
+ "num_tokens": 172414.0,
140
  "step": 130
141
  },
142
  {
143
+ "entropy": 0.8582165241241455,
144
+ "epoch": 0.00010361635895075114,
145
+ "grad_norm": 0.322265625,
146
  "learning_rate": 0.0001444,
147
+ "loss": 0.8684965133666992,
148
+ "mean_token_accuracy": 0.8188153028488159,
149
+ "num_tokens": 186224.0,
150
  "step": 140
151
  },
152
  {
153
+ "entropy": 0.823002302646637,
154
+ "epoch": 0.00011101752744723338,
155
+ "grad_norm": 0.41796875,
156
  "learning_rate": 0.0001404,
157
+ "loss": 0.8199325561523437,
158
+ "mean_token_accuracy": 0.8285818427801133,
159
+ "num_tokens": 199564.0,
160
  "step": 150
161
  },
162
  {
163
+ "entropy": 0.7803006649017334,
164
+ "epoch": 0.0001184186959437156,
165
+ "grad_norm": 0.28125,
166
  "learning_rate": 0.0001364,
167
+ "loss": 0.8177242279052734,
168
+ "mean_token_accuracy": 0.8276909857988357,
169
+ "num_tokens": 212955.0,
170
  "step": 160
171
  },
172
  {
173
+ "entropy": 0.7576605170965195,
174
+ "epoch": 0.00012581986444019783,
175
+ "grad_norm": 0.298828125,
176
  "learning_rate": 0.00013240000000000002,
177
+ "loss": 0.7334442615509034,
178
+ "mean_token_accuracy": 0.8368929207324982,
179
+ "num_tokens": 225983.0,
180
  "step": 170
181
  },
182
  {
183
+ "entropy": 0.8388681739568711,
184
+ "epoch": 0.00013322103293668004,
185
+ "grad_norm": 4.15625,
186
  "learning_rate": 0.0001284,
187
+ "loss": 0.878928279876709,
188
+ "mean_token_accuracy": 0.8206132620573043,
189
+ "num_tokens": 240490.0,
190
  "step": 180
191
  },
192
  {
193
+ "entropy": 0.8390863686800003,
194
+ "epoch": 0.00014062220143316227,
195
+ "grad_norm": 0.25,
196
  "learning_rate": 0.00012440000000000002,
197
+ "loss": 0.8454230308532715,
198
+ "mean_token_accuracy": 0.8245942384004593,
199
+ "num_tokens": 254696.0,
200
  "step": 190
201
  },
202
  {
203
+ "entropy": 0.8603733956813813,
204
+ "epoch": 0.00014802336992964448,
205
+ "grad_norm": 0.2734375,
206
  "learning_rate": 0.0001204,
207
+ "loss": 0.8759581565856933,
208
+ "mean_token_accuracy": 0.8165332227945328,
209
+ "num_tokens": 269719.0,
210
  "step": 200
211
  },
212
  {
213
+ "entropy": 0.76580231487751,
214
+ "epoch": 0.00015542453842612672,
215
+ "grad_norm": 0.240234375,
216
  "learning_rate": 0.0001164,
217
+ "loss": 0.7616221904754639,
218
+ "mean_token_accuracy": 0.8392421275377273,
219
+ "num_tokens": 282621.0,
220
  "step": 210
221
  },
222
  {
223
+ "entropy": 0.7803073287010193,
224
+ "epoch": 0.00016282570692260895,
225
+ "grad_norm": 0.341796875,
226
  "learning_rate": 0.00011240000000000002,
227
+ "loss": 0.7809097766876221,
228
+ "mean_token_accuracy": 0.8302495568990708,
229
+ "num_tokens": 295624.0,
230
  "step": 220
231
  },
232
  {
233
+ "entropy": 0.7702126175165176,
234
+ "epoch": 0.00017022687541909116,
235
+ "grad_norm": 0.251953125,
236
  "learning_rate": 0.00010840000000000002,
237
+ "loss": 0.7757031917572021,
238
+ "mean_token_accuracy": 0.8389965564012527,
239
+ "num_tokens": 308856.0,
240
  "step": 230
241
  },
242
  {
243
+ "entropy": 0.8611143410205842,
244
+ "epoch": 0.0001776280439155734,
245
+ "grad_norm": 0.337890625,
246
  "learning_rate": 0.0001044,
247
+ "loss": 0.8744688034057617,
248
+ "mean_token_accuracy": 0.8146604359149933,
249
+ "num_tokens": 322610.0,
250
  "step": 240
251
  },
252
  {
253
+ "entropy": 0.8659275263547898,
254
+ "epoch": 0.0001850292124120556,
255
+ "grad_norm": 0.326171875,
256
  "learning_rate": 0.0001004,
257
+ "loss": 0.8652327537536622,
258
+ "mean_token_accuracy": 0.819416218996048,
259
+ "num_tokens": 336948.0,
260
  "step": 250
261
  },
262
  {
263
+ "entropy": 0.768859726190567,
264
+ "epoch": 0.00019243038090853784,
265
+ "grad_norm": 0.2890625,
266
  "learning_rate": 9.64e-05,
267
+ "loss": 0.7651469707489014,
268
+ "mean_token_accuracy": 0.8339170336723327,
269
+ "num_tokens": 350252.0,
270
  "step": 260
271
  },
272
  {
273
+ "entropy": 0.8208303570747375,
274
+ "epoch": 0.00019983154940502007,
275
+ "grad_norm": 0.296875,
276
  "learning_rate": 9.240000000000001e-05,
277
+ "loss": 0.8234204292297364,
278
+ "mean_token_accuracy": 0.8225490599870682,
279
+ "num_tokens": 364325.0,
280
  "step": 270
281
  },
282
  {
283
+ "entropy": 0.7798860669136047,
284
+ "epoch": 0.00020723271790150228,
285
+ "grad_norm": 0.3046875,
286
  "learning_rate": 8.840000000000001e-05,
287
+ "loss": 0.7923468112945556,
288
+ "mean_token_accuracy": 0.831676983833313,
289
+ "num_tokens": 378088.0,
290
  "step": 280
291
  },
292
  {
293
+ "entropy": 0.7306642323732376,
294
+ "epoch": 0.00021463388639798452,
295
+ "grad_norm": 0.279296875,
296
  "learning_rate": 8.44e-05,
297
+ "loss": 0.7504455089569092,
298
+ "mean_token_accuracy": 0.8410079121589661,
299
+ "num_tokens": 391023.0,
300
  "step": 290
301
  },
302
  {
303
+ "entropy": 0.8291689246892929,
304
+ "epoch": 0.00022203505489446675,
305
+ "grad_norm": 0.24609375,
306
  "learning_rate": 8.04e-05,
307
+ "loss": 0.8151634216308594,
308
+ "mean_token_accuracy": 0.8278465926647186,
309
+ "num_tokens": 404816.0,
310
  "step": 300
311
  },
312
  {
313
+ "entropy": 0.7772005677223206,
314
+ "epoch": 0.00022943622339094896,
315
+ "grad_norm": 0.326171875,
316
  "learning_rate": 7.64e-05,
317
+ "loss": 0.7859255313873291,
318
+ "mean_token_accuracy": 0.8338077068328857,
319
+ "num_tokens": 418105.0,
320
  "step": 310
321
  },
322
  {
323
+ "entropy": 0.8288773983716965,
324
+ "epoch": 0.0002368373918874312,
325
+ "grad_norm": 0.28125,
326
  "learning_rate": 7.24e-05,
327
+ "loss": 0.8528160095214844,
328
+ "mean_token_accuracy": 0.8213677883148194,
329
+ "num_tokens": 432042.0,
330
  "step": 320
331
  },
332
  {
333
+ "entropy": 0.7887327700853348,
334
+ "epoch": 0.0002442385603839134,
335
+ "grad_norm": 0.326171875,
336
  "learning_rate": 6.840000000000001e-05,
337
+ "loss": 0.7650537014007568,
338
+ "mean_token_accuracy": 0.8351205557584762,
339
+ "num_tokens": 444796.0,
340
  "step": 330
341
  },
342
  {
343
+ "entropy": 0.7681846857070923,
344
+ "epoch": 0.00025163972888039566,
345
+ "grad_norm": 0.287109375,
346
  "learning_rate": 6.440000000000001e-05,
347
+ "loss": 0.7828513145446777,
348
+ "mean_token_accuracy": 0.8325754940509796,
349
+ "num_tokens": 457664.0,
350
  "step": 340
351
  },
352
  {
353
+ "entropy": 0.8200330525636673,
354
+ "epoch": 0.00025904089737687787,
355
+ "grad_norm": 0.26953125,
356
  "learning_rate": 6.04e-05,
357
+ "loss": 0.8019542694091797,
358
+ "mean_token_accuracy": 0.8313791334629059,
359
+ "num_tokens": 470606.0,
360
  "step": 350
361
  },
362
  {
363
+ "entropy": 0.8059133917093277,
364
+ "epoch": 0.0002664420658733601,
365
+ "grad_norm": 0.259765625,
366
  "learning_rate": 5.6399999999999995e-05,
367
+ "loss": 0.7930517673492432,
368
+ "mean_token_accuracy": 0.8300592184066773,
369
+ "num_tokens": 484904.0,
370
  "step": 360
371
  },
372
  {
373
+ "entropy": 0.7620012789964676,
374
+ "epoch": 0.0002738432343698423,
375
+ "grad_norm": 0.306640625,
376
  "learning_rate": 5.2400000000000007e-05,
377
+ "loss": 0.7779502868652344,
378
+ "mean_token_accuracy": 0.8312928855419159,
379
+ "num_tokens": 498302.0,
380
  "step": 370
381
  },
382
  {
383
+ "entropy": 0.7787803679704666,
384
+ "epoch": 0.00028124440286632455,
385
+ "grad_norm": 0.3125,
386
  "learning_rate": 4.8400000000000004e-05,
387
+ "loss": 0.7784494400024414,
388
+ "mean_token_accuracy": 0.8316877603530883,
389
+ "num_tokens": 512369.0,
390
  "step": 380
391
  },
392
  {
393
+ "entropy": 0.7438325166702271,
394
+ "epoch": 0.00028864557136280676,
395
+ "grad_norm": 0.271484375,
396
  "learning_rate": 4.44e-05,
397
+ "loss": 0.7538249015808105,
398
+ "mean_token_accuracy": 0.8402615815401078,
399
+ "num_tokens": 525385.0,
400
  "step": 390
401
  },
402
  {
403
+ "entropy": 0.7514106065034867,
404
+ "epoch": 0.00029604673985928896,
405
+ "grad_norm": 0.263671875,
406
  "learning_rate": 4.0400000000000006e-05,
407
+ "loss": 0.742708683013916,
408
+ "mean_token_accuracy": 0.8387834310531617,
409
+ "num_tokens": 538315.0,
410
  "step": 400
411
  }
412
  ],
 
427
  "attributes": {}
428
  }
429
  },
430
+ "total_flos": 3.968843392293274e+16,
431
  "train_batch_size": 8,
432
  "trial_name": null,
433
  "trial_params": null
checkpoint-500/adapter_config.json CHANGED
@@ -29,13 +29,13 @@
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
- "down_proj",
33
  "o_proj",
34
- "k_proj",
35
  "q_proj",
36
- "v_proj",
37
  "gate_proj",
38
- "up_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
 
29
  "rank_pattern": {},
30
  "revision": null,
31
  "target_modules": [
32
+ "v_proj",
33
  "o_proj",
34
+ "up_proj",
35
  "q_proj",
36
+ "down_proj",
37
  "gate_proj",
38
+ "k_proj"
39
  ],
40
  "target_parameters": null,
41
  "task_type": "CAUSAL_LM",
checkpoint-500/adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:27e2e812aa91b0af98fa9af3f5cbd95f3212af35d91ec3ab0e8d1cf1f47b5ba6
3
  size 83946192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52c5f909945589d0c78975a1cb4af27dcba08206910975f240e0ceb21013a2e2
3
  size 83946192
checkpoint-500/optimizer.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f3dd3be57b426dd155e5a63405dc86206a544c04adfc987f45277f54953346ad
3
  size 335818315
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:da5c5e2f7ef1148f93c79913486a56785ae2f2161404e9d5a0e62c949a21ac9c
3
  size 335818315
checkpoint-500/rng_state.pth CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1e442bb66fef53d2f7c798e2651c4e40adcb5578f87e46992c7834c9e7c5c12d
3
  size 14645
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6aa888466977fe5448d0a9d30a2861628e2271f7a8c8cc85349b67e3f7cc9da6
3
  size 14645
checkpoint-500/trainer_state.json CHANGED
@@ -2,7 +2,7 @@
2
  "best_global_step": null,
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
- "epoch": 0.05051780752715332,
6
  "eval_steps": 500,
7
  "global_step": 500,
8
  "is_hyper_param_search": false,
@@ -10,503 +10,503 @@
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
- "entropy": 1.45259268283844,
14
- "epoch": 0.0010103561505430665,
15
- "grad_norm": 0.2451171875,
16
  "learning_rate": 0.0001964,
17
- "loss": 1.7130912780761718,
18
- "mean_token_accuracy": 0.6436435863375664,
19
- "num_tokens": 28151.0,
20
  "step": 10
21
  },
22
  {
23
- "entropy": 1.3634081721305846,
24
- "epoch": 0.002020712301086133,
25
- "grad_norm": 0.484375,
26
  "learning_rate": 0.00019240000000000001,
27
- "loss": 1.3450921058654786,
28
- "mean_token_accuracy": 0.686880823969841,
29
- "num_tokens": 57374.0,
30
  "step": 20
31
  },
32
  {
33
- "entropy": 1.1857430338859558,
34
- "epoch": 0.0030310684516291994,
35
- "grad_norm": 0.322265625,
36
  "learning_rate": 0.0001884,
37
- "loss": 1.205996608734131,
38
- "mean_token_accuracy": 0.706908255815506,
39
- "num_tokens": 86755.0,
40
  "step": 30
41
  },
42
  {
43
- "entropy": 1.156875231862068,
44
- "epoch": 0.004041424602172266,
45
- "grad_norm": 0.359375,
46
  "learning_rate": 0.0001844,
47
- "loss": 1.1667716026306152,
48
- "mean_token_accuracy": 0.7132074117660523,
49
- "num_tokens": 114130.0,
50
  "step": 40
51
  },
52
  {
53
- "entropy": 1.083250343799591,
54
- "epoch": 0.005051780752715332,
55
- "grad_norm": 0.330078125,
56
  "learning_rate": 0.00018040000000000002,
57
- "loss": 1.1047730445861816,
58
- "mean_token_accuracy": 0.7196864277124405,
59
- "num_tokens": 141768.0,
60
  "step": 50
61
  },
62
  {
63
- "entropy": 1.0874410301446915,
64
- "epoch": 0.006062136903258399,
65
- "grad_norm": 0.265625,
66
  "learning_rate": 0.0001764,
67
- "loss": 1.0894905090332032,
68
- "mean_token_accuracy": 0.7210412979125976,
69
- "num_tokens": 172272.0,
70
  "step": 60
71
  },
72
  {
73
- "entropy": 1.1303296595811845,
74
- "epoch": 0.007072493053801465,
75
- "grad_norm": 0.25390625,
76
  "learning_rate": 0.00017240000000000002,
77
- "loss": 1.1445048332214356,
78
- "mean_token_accuracy": 0.7178041815757752,
79
- "num_tokens": 200909.0,
80
  "step": 70
81
  },
82
  {
83
- "entropy": 1.0938484042882919,
84
- "epoch": 0.008082849204344532,
85
- "grad_norm": 0.271484375,
86
  "learning_rate": 0.0001684,
87
- "loss": 1.098367691040039,
88
- "mean_token_accuracy": 0.7246084690093995,
89
- "num_tokens": 228726.0,
90
  "step": 80
91
  },
92
  {
93
- "entropy": 1.0779876083135604,
94
- "epoch": 0.009093205354887599,
95
- "grad_norm": 0.2216796875,
96
  "learning_rate": 0.0001644,
97
- "loss": 1.0803230285644532,
98
- "mean_token_accuracy": 0.720952507853508,
99
- "num_tokens": 255181.0,
100
  "step": 90
101
  },
102
  {
103
- "entropy": 1.1645614862442017,
104
- "epoch": 0.010103561505430665,
105
- "grad_norm": 0.1826171875,
106
  "learning_rate": 0.00016040000000000002,
107
- "loss": 1.147304630279541,
108
- "mean_token_accuracy": 0.7067203193902969,
109
- "num_tokens": 283616.0,
110
  "step": 100
111
  },
112
  {
113
- "entropy": 1.126008078455925,
114
- "epoch": 0.011113917655973731,
115
- "grad_norm": 0.2001953125,
116
  "learning_rate": 0.0001564,
117
- "loss": 1.1415093421936036,
118
- "mean_token_accuracy": 0.7101349741220474,
119
- "num_tokens": 312151.0,
120
  "step": 110
121
  },
122
  {
123
- "entropy": 1.091178685426712,
124
- "epoch": 0.012124273806516797,
125
- "grad_norm": 0.1953125,
126
  "learning_rate": 0.00015240000000000002,
127
- "loss": 1.0913947105407715,
128
- "mean_token_accuracy": 0.7238801747560502,
129
- "num_tokens": 340776.0,
130
  "step": 120
131
  },
132
  {
133
- "entropy": 1.2382428109645844,
134
- "epoch": 0.013134629957059864,
135
- "grad_norm": 0.2099609375,
136
  "learning_rate": 0.0001484,
137
- "loss": 1.2411503791809082,
138
- "mean_token_accuracy": 0.697870621085167,
139
- "num_tokens": 371270.0,
140
  "step": 130
141
  },
142
  {
143
- "entropy": 1.1168828099966048,
144
- "epoch": 0.01414498610760293,
145
- "grad_norm": 0.220703125,
146
  "learning_rate": 0.0001444,
147
- "loss": 1.1341249465942382,
148
- "mean_token_accuracy": 0.7141003280878067,
149
- "num_tokens": 400176.0,
150
  "step": 140
151
  },
152
  {
153
- "entropy": 1.114673560857773,
154
- "epoch": 0.015155342258145996,
155
- "grad_norm": 0.2109375,
156
  "learning_rate": 0.0001404,
157
- "loss": 1.1116752624511719,
158
- "mean_token_accuracy": 0.7234076589345932,
159
- "num_tokens": 427204.0,
160
  "step": 150
161
  },
162
  {
163
- "entropy": 1.1378572463989258,
164
- "epoch": 0.016165698408689064,
165
- "grad_norm": 0.1904296875,
166
  "learning_rate": 0.0001364,
167
- "loss": 1.1589903831481934,
168
- "mean_token_accuracy": 0.7053093910217285,
169
- "num_tokens": 458094.0,
170
  "step": 160
171
  },
172
  {
173
- "entropy": 1.110730269551277,
174
- "epoch": 0.01717605455923213,
175
- "grad_norm": 0.1962890625,
176
  "learning_rate": 0.00013240000000000002,
177
- "loss": 1.087682342529297,
178
- "mean_token_accuracy": 0.7177392661571502,
179
- "num_tokens": 487098.0,
180
  "step": 170
181
  },
182
  {
183
- "entropy": 1.0602406531572341,
184
- "epoch": 0.018186410709775197,
185
- "grad_norm": 0.228515625,
186
  "learning_rate": 0.0001284,
187
- "loss": 1.0950936317443847,
188
- "mean_token_accuracy": 0.7214239358901977,
189
- "num_tokens": 516025.0,
190
  "step": 180
191
  },
192
  {
193
- "entropy": 1.1597254037857057,
194
- "epoch": 0.01919676686031826,
195
- "grad_norm": 0.203125,
196
  "learning_rate": 0.00012440000000000002,
197
- "loss": 1.1208978652954102,
198
- "mean_token_accuracy": 0.7143576145172119,
199
- "num_tokens": 544810.0,
200
  "step": 190
201
  },
202
  {
203
- "entropy": 1.0519475936889648,
204
- "epoch": 0.02020712301086133,
205
- "grad_norm": 0.20703125,
206
  "learning_rate": 0.0001204,
207
- "loss": 1.0744948387145996,
208
- "mean_token_accuracy": 0.718455109000206,
209
- "num_tokens": 573002.0,
210
  "step": 200
211
  },
212
  {
213
- "entropy": 1.2084601551294327,
214
- "epoch": 0.021217479161404394,
215
- "grad_norm": 0.1943359375,
216
  "learning_rate": 0.0001164,
217
- "loss": 1.2174930572509766,
218
- "mean_token_accuracy": 0.6999662011861801,
219
- "num_tokens": 602401.0,
220
  "step": 210
221
  },
222
  {
223
- "entropy": 1.1912338614463807,
224
- "epoch": 0.022227835311947462,
225
- "grad_norm": 0.216796875,
226
  "learning_rate": 0.00011240000000000002,
227
- "loss": 1.183759880065918,
228
- "mean_token_accuracy": 0.7107820093631745,
229
- "num_tokens": 629994.0,
230
  "step": 220
231
  },
232
  {
233
- "entropy": 1.0905429303646088,
234
- "epoch": 0.023238191462490527,
235
- "grad_norm": 0.203125,
236
  "learning_rate": 0.00010840000000000002,
237
- "loss": 1.086796474456787,
238
- "mean_token_accuracy": 0.7176417618989944,
239
- "num_tokens": 658686.0,
240
  "step": 230
241
  },
242
  {
243
- "entropy": 1.0157978028059005,
244
- "epoch": 0.024248547613033595,
245
- "grad_norm": 0.23828125,
246
  "learning_rate": 0.0001044,
247
- "loss": 1.012559700012207,
248
- "mean_token_accuracy": 0.7369582027196884,
249
- "num_tokens": 685539.0,
250
  "step": 240
251
  },
252
  {
253
- "entropy": 1.1027084678411483,
254
- "epoch": 0.02525890376357666,
255
- "grad_norm": 0.2314453125,
256
  "learning_rate": 0.0001004,
257
- "loss": 1.1228812217712403,
258
- "mean_token_accuracy": 0.7192230314016342,
259
- "num_tokens": 715224.0,
260
  "step": 250
261
  },
262
  {
263
- "entropy": 1.0666967660188675,
264
- "epoch": 0.026269259914119727,
265
- "grad_norm": 0.2412109375,
266
  "learning_rate": 9.64e-05,
267
- "loss": 1.0753504753112793,
268
- "mean_token_accuracy": 0.719659861922264,
269
- "num_tokens": 745782.0,
270
  "step": 260
271
  },
272
  {
273
- "entropy": 1.034983891248703,
274
- "epoch": 0.027279616064662792,
275
- "grad_norm": 0.2216796875,
276
  "learning_rate": 9.240000000000001e-05,
277
- "loss": 1.032216453552246,
278
- "mean_token_accuracy": 0.7364529073238373,
279
- "num_tokens": 772358.0,
280
  "step": 270
281
  },
282
  {
283
- "entropy": 1.0821890532970428,
284
- "epoch": 0.02828997221520586,
285
- "grad_norm": 0.2294921875,
286
  "learning_rate": 8.840000000000001e-05,
287
- "loss": 1.0743555068969726,
288
- "mean_token_accuracy": 0.7248906105756759,
289
- "num_tokens": 800173.0,
290
  "step": 280
291
  },
292
  {
293
- "entropy": 1.1281798005104064,
294
- "epoch": 0.029300328365748928,
295
- "grad_norm": 0.2216796875,
296
  "learning_rate": 8.44e-05,
297
- "loss": 1.1471177101135255,
298
- "mean_token_accuracy": 0.7095231086015701,
299
- "num_tokens": 826835.0,
300
  "step": 290
301
  },
302
  {
303
- "entropy": 1.0436414241790772,
304
- "epoch": 0.030310684516291993,
305
- "grad_norm": 0.2001953125,
306
  "learning_rate": 8.04e-05,
307
- "loss": 1.0298893928527832,
308
- "mean_token_accuracy": 0.7325254619121552,
309
- "num_tokens": 853598.0,
310
  "step": 300
311
  },
312
  {
313
- "entropy": 1.0947031527757645,
314
- "epoch": 0.03132104066683506,
315
- "grad_norm": 0.232421875,
316
  "learning_rate": 7.64e-05,
317
- "loss": 1.0945837020874023,
318
- "mean_token_accuracy": 0.7231394708156585,
319
- "num_tokens": 884023.0,
320
  "step": 310
321
  },
322
  {
323
- "entropy": 1.1012510120868684,
324
- "epoch": 0.03233139681737813,
325
- "grad_norm": 0.17578125,
326
  "learning_rate": 7.24e-05,
327
- "loss": 1.106449031829834,
328
- "mean_token_accuracy": 0.7150771111249924,
329
- "num_tokens": 910709.0,
330
  "step": 320
331
  },
332
  {
333
- "entropy": 1.0871451586484908,
334
- "epoch": 0.03334175296792119,
335
- "grad_norm": 0.2041015625,
336
  "learning_rate": 6.840000000000001e-05,
337
- "loss": 1.0799496650695801,
338
- "mean_token_accuracy": 0.7183821439743042,
339
- "num_tokens": 938961.0,
340
  "step": 330
341
  },
342
  {
343
- "entropy": 1.0514528155326843,
344
- "epoch": 0.03435210911846426,
345
- "grad_norm": 0.2080078125,
346
  "learning_rate": 6.440000000000001e-05,
347
- "loss": 1.081194019317627,
348
- "mean_token_accuracy": 0.7243827939033508,
349
- "num_tokens": 967584.0,
350
  "step": 340
351
  },
352
  {
353
- "entropy": 1.1092546790838242,
354
- "epoch": 0.035362465269007326,
355
- "grad_norm": 0.201171875,
356
  "learning_rate": 6.04e-05,
357
- "loss": 1.0813716888427733,
358
- "mean_token_accuracy": 0.7228824734687805,
359
- "num_tokens": 995895.0,
360
  "step": 350
361
  },
362
  {
363
- "entropy": 1.0287963569164276,
364
- "epoch": 0.036372821419550394,
365
- "grad_norm": 0.265625,
366
  "learning_rate": 5.6399999999999995e-05,
367
- "loss": 1.0419590950012207,
368
- "mean_token_accuracy": 0.7281966924667358,
369
- "num_tokens": 1024358.0,
370
  "step": 360
371
  },
372
  {
373
- "entropy": 1.121953997015953,
374
- "epoch": 0.037383177570093455,
375
- "grad_norm": 0.26171875,
376
  "learning_rate": 5.2400000000000007e-05,
377
- "loss": 1.0987840652465821,
378
- "mean_token_accuracy": 0.7203631848096848,
379
- "num_tokens": 1052323.0,
380
  "step": 370
381
  },
382
  {
383
- "entropy": 1.110439071059227,
384
- "epoch": 0.03839353372063652,
385
- "grad_norm": 0.291015625,
386
  "learning_rate": 4.8400000000000004e-05,
387
- "loss": 1.0895070075988769,
388
- "mean_token_accuracy": 0.7241304695606232,
389
- "num_tokens": 1079594.0,
390
  "step": 380
391
  },
392
  {
393
- "entropy": 1.070469456911087,
394
- "epoch": 0.03940388987117959,
395
- "grad_norm": 0.3046875,
396
  "learning_rate": 4.44e-05,
397
- "loss": 1.1078936576843261,
398
- "mean_token_accuracy": 0.7139627873897553,
399
- "num_tokens": 1110202.0,
400
  "step": 390
401
  },
402
  {
403
- "entropy": 1.1429108887910844,
404
- "epoch": 0.04041424602172266,
405
- "grad_norm": 0.265625,
406
  "learning_rate": 4.0400000000000006e-05,
407
- "loss": 1.1321399688720704,
408
- "mean_token_accuracy": 0.7121616303920746,
409
- "num_tokens": 1138357.0,
410
  "step": 400
411
  },
412
  {
413
- "entropy": 1.0827387034893037,
414
- "epoch": 0.04142460217226572,
415
- "grad_norm": 0.2294921875,
416
  "learning_rate": 3.6400000000000004e-05,
417
- "loss": 1.0639567375183105,
418
- "mean_token_accuracy": 0.7281183630228043,
419
- "num_tokens": 1163579.0,
420
  "step": 410
421
  },
422
  {
423
- "entropy": 1.0125485062599182,
424
- "epoch": 0.04243495832280879,
425
- "grad_norm": 0.197265625,
426
  "learning_rate": 3.24e-05,
427
- "loss": 1.0153983116149903,
428
- "mean_token_accuracy": 0.7316560536623001,
429
- "num_tokens": 1192631.0,
430
  "step": 420
431
  },
432
  {
433
- "entropy": 1.0439467519521712,
434
- "epoch": 0.043445314473351856,
435
- "grad_norm": 0.2314453125,
436
  "learning_rate": 2.84e-05,
437
- "loss": 1.0409717559814453,
438
- "mean_token_accuracy": 0.7321272224187851,
439
- "num_tokens": 1221925.0,
440
  "step": 430
441
  },
442
  {
443
- "entropy": 1.0967293322086333,
444
- "epoch": 0.044455670623894925,
445
- "grad_norm": 0.2236328125,
446
  "learning_rate": 2.44e-05,
447
- "loss": 1.1045302391052245,
448
- "mean_token_accuracy": 0.721249520778656,
449
- "num_tokens": 1252243.0,
450
  "step": 440
451
  },
452
  {
453
- "entropy": 1.0544108510017396,
454
- "epoch": 0.04546602677443799,
455
- "grad_norm": 0.2734375,
456
  "learning_rate": 2.04e-05,
457
- "loss": 1.050521469116211,
458
- "mean_token_accuracy": 0.7213579922914505,
459
- "num_tokens": 1283834.0,
460
  "step": 450
461
  },
462
  {
463
- "entropy": 1.0683125108480453,
464
- "epoch": 0.046476382924981054,
465
- "grad_norm": 0.267578125,
466
  "learning_rate": 1.6400000000000002e-05,
467
- "loss": 1.0767866134643556,
468
- "mean_token_accuracy": 0.7280572831630707,
469
- "num_tokens": 1310545.0,
470
  "step": 460
471
  },
472
  {
473
- "entropy": 1.06637182533741,
474
- "epoch": 0.04748673907552412,
475
- "grad_norm": 0.203125,
476
  "learning_rate": 1.24e-05,
477
- "loss": 1.0640035629272462,
478
- "mean_token_accuracy": 0.7313684940338134,
479
- "num_tokens": 1338772.0,
480
  "step": 470
481
  },
482
  {
483
- "entropy": 1.1006224006414413,
484
- "epoch": 0.04849709522606719,
485
- "grad_norm": 0.2314453125,
486
  "learning_rate": 8.400000000000001e-06,
487
- "loss": 1.0907609939575196,
488
- "mean_token_accuracy": 0.7184683322906494,
489
- "num_tokens": 1366284.0,
490
  "step": 480
491
  },
492
  {
493
- "entropy": 1.0476179122924805,
494
- "epoch": 0.04950745137661026,
495
- "grad_norm": 0.224609375,
496
  "learning_rate": 4.4e-06,
497
- "loss": 1.0324252128601075,
498
- "mean_token_accuracy": 0.7270208716392517,
499
- "num_tokens": 1395417.0,
500
  "step": 490
501
  },
502
  {
503
- "entropy": 1.1447266846895219,
504
- "epoch": 0.05051780752715332,
505
- "grad_norm": 0.2373046875,
506
  "learning_rate": 4.0000000000000003e-07,
507
- "loss": 1.1333248138427734,
508
- "mean_token_accuracy": 0.7175892472267151,
509
- "num_tokens": 1421503.0,
510
  "step": 500
511
  }
512
  ],
@@ -527,7 +527,7 @@
527
  "attributes": {}
528
  }
529
  },
530
- "total_flos": 1.4920048791728947e+17,
531
  "train_batch_size": 8,
532
  "trial_name": null,
533
  "trial_params": null
 
2
  "best_global_step": null,
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
+ "epoch": 0.0003700584248241112,
6
  "eval_steps": 500,
7
  "global_step": 500,
8
  "is_hyper_param_search": false,
 
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
+ "entropy": 1.4767830133438111,
14
+ "epoch": 7.401168496482225e-06,
15
+ "grad_norm": 0.578125,
16
  "learning_rate": 0.0001964,
17
+ "loss": 1.6877475738525392,
18
+ "mean_token_accuracy": 0.7061349496245384,
19
+ "num_tokens": 14911.0,
20
  "step": 10
21
  },
22
  {
23
+ "entropy": 1.0930627048015595,
24
+ "epoch": 1.480233699296445e-05,
25
+ "grad_norm": 0.458984375,
26
  "learning_rate": 0.00019240000000000001,
27
+ "loss": 1.0562871932983398,
28
+ "mean_token_accuracy": 0.8108290940523147,
29
+ "num_tokens": 28646.0,
30
  "step": 20
31
  },
32
  {
33
+ "entropy": 0.8788679152727127,
34
+ "epoch": 2.2203505489446674e-05,
35
+ "grad_norm": 0.5859375,
36
  "learning_rate": 0.0001884,
37
+ "loss": 0.8974875450134278,
38
+ "mean_token_accuracy": 0.8296987593173981,
39
+ "num_tokens": 41474.0,
40
  "step": 30
41
  },
42
  {
43
+ "entropy": 0.8145956963300705,
44
+ "epoch": 2.96046739859289e-05,
45
+ "grad_norm": 0.39453125,
46
  "learning_rate": 0.0001844,
47
+ "loss": 0.8066701889038086,
48
+ "mean_token_accuracy": 0.8340015441179276,
49
+ "num_tokens": 54466.0,
50
  "step": 40
51
  },
52
  {
53
+ "entropy": 0.7157480388879776,
54
+ "epoch": 3.700584248241112e-05,
55
+ "grad_norm": 0.326171875,
56
  "learning_rate": 0.00018040000000000002,
57
+ "loss": 0.7251500129699707,
58
+ "mean_token_accuracy": 0.8420351594686508,
59
+ "num_tokens": 66880.0,
60
  "step": 50
61
  },
62
  {
63
+ "entropy": 0.7959431439638138,
64
+ "epoch": 4.440701097889335e-05,
65
+ "grad_norm": 0.326171875,
66
  "learning_rate": 0.0001764,
67
+ "loss": 0.8049167633056641,
68
+ "mean_token_accuracy": 0.8289562940597535,
69
+ "num_tokens": 80036.0,
70
  "step": 60
71
  },
72
  {
73
+ "entropy": 0.8342548221349716,
74
+ "epoch": 5.180817947537557e-05,
75
+ "grad_norm": 0.326171875,
76
  "learning_rate": 0.00017240000000000002,
77
+ "loss": 0.8336853981018066,
78
+ "mean_token_accuracy": 0.8279720038175583,
79
+ "num_tokens": 93357.0,
80
  "step": 70
81
  },
82
  {
83
+ "entropy": 0.7970967918634415,
84
+ "epoch": 5.92093479718578e-05,
85
+ "grad_norm": 0.73046875,
86
  "learning_rate": 0.0001684,
87
+ "loss": 0.7949181079864502,
88
+ "mean_token_accuracy": 0.828959608078003,
89
+ "num_tokens": 106951.0,
90
  "step": 80
91
  },
92
  {
93
+ "entropy": 0.7967441529035568,
94
+ "epoch": 6.661051646834002e-05,
95
+ "grad_norm": 0.34375,
96
  "learning_rate": 0.0001644,
97
+ "loss": 0.8285197257995606,
98
+ "mean_token_accuracy": 0.8272027671337128,
99
+ "num_tokens": 120269.0,
100
  "step": 90
101
  },
102
  {
103
+ "entropy": 0.7741447448730469,
104
+ "epoch": 7.401168496482224e-05,
105
+ "grad_norm": 0.271484375,
106
  "learning_rate": 0.00016040000000000002,
107
+ "loss": 0.7636381626129151,
108
+ "mean_token_accuracy": 0.8373189926147461,
109
+ "num_tokens": 133116.0,
110
  "step": 100
111
  },
112
  {
113
+ "entropy": 0.72959463596344,
114
+ "epoch": 8.141285346130448e-05,
115
+ "grad_norm": 0.421875,
116
  "learning_rate": 0.0001564,
117
+ "loss": 0.7404542446136475,
118
+ "mean_token_accuracy": 0.8400259047746659,
119
+ "num_tokens": 146103.0,
120
  "step": 110
121
  },
122
  {
123
+ "entropy": 0.777249938249588,
124
+ "epoch": 8.88140219577867e-05,
125
+ "grad_norm": 0.3984375,
126
  "learning_rate": 0.00015240000000000002,
127
+ "loss": 0.7868029117584229,
128
+ "mean_token_accuracy": 0.8342386931180954,
129
+ "num_tokens": 158980.0,
130
  "step": 120
131
  },
132
  {
133
+ "entropy": 0.8305783897638321,
134
+ "epoch": 9.621519045426892e-05,
135
+ "grad_norm": 0.328125,
136
  "learning_rate": 0.0001484,
137
+ "loss": 0.8155685424804687,
138
+ "mean_token_accuracy": 0.8282770067453384,
139
+ "num_tokens": 172414.0,
140
  "step": 130
141
  },
142
  {
143
+ "entropy": 0.8582165241241455,
144
+ "epoch": 0.00010361635895075114,
145
+ "grad_norm": 0.322265625,
146
  "learning_rate": 0.0001444,
147
+ "loss": 0.8684965133666992,
148
+ "mean_token_accuracy": 0.8188153028488159,
149
+ "num_tokens": 186224.0,
150
  "step": 140
151
  },
152
  {
153
+ "entropy": 0.823002302646637,
154
+ "epoch": 0.00011101752744723338,
155
+ "grad_norm": 0.41796875,
156
  "learning_rate": 0.0001404,
157
+ "loss": 0.8199325561523437,
158
+ "mean_token_accuracy": 0.8285818427801133,
159
+ "num_tokens": 199564.0,
160
  "step": 150
161
  },
162
  {
163
+ "entropy": 0.7803006649017334,
164
+ "epoch": 0.0001184186959437156,
165
+ "grad_norm": 0.28125,
166
  "learning_rate": 0.0001364,
167
+ "loss": 0.8177242279052734,
168
+ "mean_token_accuracy": 0.8276909857988357,
169
+ "num_tokens": 212955.0,
170
  "step": 160
171
  },
172
  {
173
+ "entropy": 0.7576605170965195,
174
+ "epoch": 0.00012581986444019783,
175
+ "grad_norm": 0.298828125,
176
  "learning_rate": 0.00013240000000000002,
177
+ "loss": 0.7334442615509034,
178
+ "mean_token_accuracy": 0.8368929207324982,
179
+ "num_tokens": 225983.0,
180
  "step": 170
181
  },
182
  {
183
+ "entropy": 0.8388681739568711,
184
+ "epoch": 0.00013322103293668004,
185
+ "grad_norm": 4.15625,
186
  "learning_rate": 0.0001284,
187
+ "loss": 0.878928279876709,
188
+ "mean_token_accuracy": 0.8206132620573043,
189
+ "num_tokens": 240490.0,
190
  "step": 180
191
  },
192
  {
193
+ "entropy": 0.8390863686800003,
194
+ "epoch": 0.00014062220143316227,
195
+ "grad_norm": 0.25,
196
  "learning_rate": 0.00012440000000000002,
197
+ "loss": 0.8454230308532715,
198
+ "mean_token_accuracy": 0.8245942384004593,
199
+ "num_tokens": 254696.0,
200
  "step": 190
201
  },
202
  {
203
+ "entropy": 0.8603733956813813,
204
+ "epoch": 0.00014802336992964448,
205
+ "grad_norm": 0.2734375,
206
  "learning_rate": 0.0001204,
207
+ "loss": 0.8759581565856933,
208
+ "mean_token_accuracy": 0.8165332227945328,
209
+ "num_tokens": 269719.0,
210
  "step": 200
211
  },
212
  {
213
+ "entropy": 0.76580231487751,
214
+ "epoch": 0.00015542453842612672,
215
+ "grad_norm": 0.240234375,
216
  "learning_rate": 0.0001164,
217
+ "loss": 0.7616221904754639,
218
+ "mean_token_accuracy": 0.8392421275377273,
219
+ "num_tokens": 282621.0,
220
  "step": 210
221
  },
222
  {
223
+ "entropy": 0.7803073287010193,
224
+ "epoch": 0.00016282570692260895,
225
+ "grad_norm": 0.341796875,
226
  "learning_rate": 0.00011240000000000002,
227
+ "loss": 0.7809097766876221,
228
+ "mean_token_accuracy": 0.8302495568990708,
229
+ "num_tokens": 295624.0,
230
  "step": 220
231
  },
232
  {
233
+ "entropy": 0.7702126175165176,
234
+ "epoch": 0.00017022687541909116,
235
+ "grad_norm": 0.251953125,
236
  "learning_rate": 0.00010840000000000002,
237
+ "loss": 0.7757031917572021,
238
+ "mean_token_accuracy": 0.8389965564012527,
239
+ "num_tokens": 308856.0,
240
  "step": 230
241
  },
242
  {
243
+ "entropy": 0.8611143410205842,
244
+ "epoch": 0.0001776280439155734,
245
+ "grad_norm": 0.337890625,
246
  "learning_rate": 0.0001044,
247
+ "loss": 0.8744688034057617,
248
+ "mean_token_accuracy": 0.8146604359149933,
249
+ "num_tokens": 322610.0,
250
  "step": 240
251
  },
252
  {
253
+ "entropy": 0.8659275263547898,
254
+ "epoch": 0.0001850292124120556,
255
+ "grad_norm": 0.326171875,
256
  "learning_rate": 0.0001004,
257
+ "loss": 0.8652327537536622,
258
+ "mean_token_accuracy": 0.819416218996048,
259
+ "num_tokens": 336948.0,
260
  "step": 250
261
  },
262
  {
263
+ "entropy": 0.768859726190567,
264
+ "epoch": 0.00019243038090853784,
265
+ "grad_norm": 0.2890625,
266
  "learning_rate": 9.64e-05,
267
+ "loss": 0.7651469707489014,
268
+ "mean_token_accuracy": 0.8339170336723327,
269
+ "num_tokens": 350252.0,
270
  "step": 260
271
  },
272
  {
273
+ "entropy": 0.8208303570747375,
274
+ "epoch": 0.00019983154940502007,
275
+ "grad_norm": 0.296875,
276
  "learning_rate": 9.240000000000001e-05,
277
+ "loss": 0.8234204292297364,
278
+ "mean_token_accuracy": 0.8225490599870682,
279
+ "num_tokens": 364325.0,
280
  "step": 270
281
  },
282
  {
283
+ "entropy": 0.7798860669136047,
284
+ "epoch": 0.00020723271790150228,
285
+ "grad_norm": 0.3046875,
286
  "learning_rate": 8.840000000000001e-05,
287
+ "loss": 0.7923468112945556,
288
+ "mean_token_accuracy": 0.831676983833313,
289
+ "num_tokens": 378088.0,
290
  "step": 280
291
  },
292
  {
293
+ "entropy": 0.7306642323732376,
294
+ "epoch": 0.00021463388639798452,
295
+ "grad_norm": 0.279296875,
296
  "learning_rate": 8.44e-05,
297
+ "loss": 0.7504455089569092,
298
+ "mean_token_accuracy": 0.8410079121589661,
299
+ "num_tokens": 391023.0,
300
  "step": 290
301
  },
302
  {
303
+ "entropy": 0.8291689246892929,
304
+ "epoch": 0.00022203505489446675,
305
+ "grad_norm": 0.24609375,
306
  "learning_rate": 8.04e-05,
307
+ "loss": 0.8151634216308594,
308
+ "mean_token_accuracy": 0.8278465926647186,
309
+ "num_tokens": 404816.0,
310
  "step": 300
311
  },
312
  {
313
+ "entropy": 0.7772005677223206,
314
+ "epoch": 0.00022943622339094896,
315
+ "grad_norm": 0.326171875,
316
  "learning_rate": 7.64e-05,
317
+ "loss": 0.7859255313873291,
318
+ "mean_token_accuracy": 0.8338077068328857,
319
+ "num_tokens": 418105.0,
320
  "step": 310
321
  },
322
  {
323
+ "entropy": 0.8288773983716965,
324
+ "epoch": 0.0002368373918874312,
325
+ "grad_norm": 0.28125,
326
  "learning_rate": 7.24e-05,
327
+ "loss": 0.8528160095214844,
328
+ "mean_token_accuracy": 0.8213677883148194,
329
+ "num_tokens": 432042.0,
330
  "step": 320
331
  },
332
  {
333
+ "entropy": 0.7887327700853348,
334
+ "epoch": 0.0002442385603839134,
335
+ "grad_norm": 0.326171875,
336
  "learning_rate": 6.840000000000001e-05,
337
+ "loss": 0.7650537014007568,
338
+ "mean_token_accuracy": 0.8351205557584762,
339
+ "num_tokens": 444796.0,
340
  "step": 330
341
  },
342
  {
343
+ "entropy": 0.7681846857070923,
344
+ "epoch": 0.00025163972888039566,
345
+ "grad_norm": 0.287109375,
346
  "learning_rate": 6.440000000000001e-05,
347
+ "loss": 0.7828513145446777,
348
+ "mean_token_accuracy": 0.8325754940509796,
349
+ "num_tokens": 457664.0,
350
  "step": 340
351
  },
352
  {
353
+ "entropy": 0.8200330525636673,
354
+ "epoch": 0.00025904089737687787,
355
+ "grad_norm": 0.26953125,
356
  "learning_rate": 6.04e-05,
357
+ "loss": 0.8019542694091797,
358
+ "mean_token_accuracy": 0.8313791334629059,
359
+ "num_tokens": 470606.0,
360
  "step": 350
361
  },
362
  {
363
+ "entropy": 0.8059133917093277,
364
+ "epoch": 0.0002664420658733601,
365
+ "grad_norm": 0.259765625,
366
  "learning_rate": 5.6399999999999995e-05,
367
+ "loss": 0.7930517673492432,
368
+ "mean_token_accuracy": 0.8300592184066773,
369
+ "num_tokens": 484904.0,
370
  "step": 360
371
  },
372
  {
373
+ "entropy": 0.7620012789964676,
374
+ "epoch": 0.0002738432343698423,
375
+ "grad_norm": 0.306640625,
376
  "learning_rate": 5.2400000000000007e-05,
377
+ "loss": 0.7779502868652344,
378
+ "mean_token_accuracy": 0.8312928855419159,
379
+ "num_tokens": 498302.0,
380
  "step": 370
381
  },
382
  {
383
+ "entropy": 0.7787803679704666,
384
+ "epoch": 0.00028124440286632455,
385
+ "grad_norm": 0.3125,
386
  "learning_rate": 4.8400000000000004e-05,
387
+ "loss": 0.7784494400024414,
388
+ "mean_token_accuracy": 0.8316877603530883,
389
+ "num_tokens": 512369.0,
390
  "step": 380
391
  },
392
  {
393
+ "entropy": 0.7438325166702271,
394
+ "epoch": 0.00028864557136280676,
395
+ "grad_norm": 0.271484375,
396
  "learning_rate": 4.44e-05,
397
+ "loss": 0.7538249015808105,
398
+ "mean_token_accuracy": 0.8402615815401078,
399
+ "num_tokens": 525385.0,
400
  "step": 390
401
  },
402
  {
403
+ "entropy": 0.7514106065034867,
404
+ "epoch": 0.00029604673985928896,
405
+ "grad_norm": 0.263671875,
406
  "learning_rate": 4.0400000000000006e-05,
407
+ "loss": 0.742708683013916,
408
+ "mean_token_accuracy": 0.8387834310531617,
409
+ "num_tokens": 538315.0,
410
  "step": 400
411
  },
412
  {
413
+ "entropy": 0.7242682158946991,
414
+ "epoch": 0.0003034479083557712,
415
+ "grad_norm": 0.36328125,
416
  "learning_rate": 3.6400000000000004e-05,
417
+ "loss": 0.7231860160827637,
418
+ "mean_token_accuracy": 0.8462171643972397,
419
+ "num_tokens": 550874.0,
420
  "step": 410
421
  },
422
  {
423
+ "entropy": 0.7665889590978623,
424
+ "epoch": 0.00031084907685225343,
425
+ "grad_norm": 0.298828125,
426
  "learning_rate": 3.24e-05,
427
+ "loss": 0.7693154811859131,
428
+ "mean_token_accuracy": 0.8359919935464859,
429
+ "num_tokens": 564118.0,
430
  "step": 420
431
  },
432
  {
433
+ "entropy": 0.7493055462837219,
434
+ "epoch": 0.00031825024534873564,
435
+ "grad_norm": 0.30859375,
436
  "learning_rate": 2.84e-05,
437
+ "loss": 0.7551113128662109,
438
+ "mean_token_accuracy": 0.8372300088405609,
439
+ "num_tokens": 576999.0,
440
  "step": 430
441
  },
442
  {
443
+ "entropy": 0.7846053004264831,
444
+ "epoch": 0.0003256514138452179,
445
+ "grad_norm": 0.26953125,
446
  "learning_rate": 2.44e-05,
447
+ "loss": 0.7938904762268066,
448
+ "mean_token_accuracy": 0.8271657317876816,
449
+ "num_tokens": 590845.0,
450
  "step": 440
451
  },
452
  {
453
+ "entropy": 0.7567790508270263,
454
+ "epoch": 0.0003330525823417001,
455
+ "grad_norm": 0.27734375,
456
  "learning_rate": 2.04e-05,
457
+ "loss": 0.7597766876220703,
458
+ "mean_token_accuracy": 0.8390242576599121,
459
+ "num_tokens": 603943.0,
460
  "step": 450
461
  },
462
  {
463
+ "entropy": 0.7776343286037445,
464
+ "epoch": 0.0003404537508381823,
465
+ "grad_norm": 0.296875,
466
  "learning_rate": 1.6400000000000002e-05,
467
+ "loss": 0.7805155277252197,
468
+ "mean_token_accuracy": 0.8314157396554946,
469
+ "num_tokens": 617297.0,
470
  "step": 460
471
  },
472
  {
473
+ "entropy": 0.7654042065143585,
474
+ "epoch": 0.0003478549193346646,
475
+ "grad_norm": 0.294921875,
476
  "learning_rate": 1.24e-05,
477
+ "loss": 0.7467947483062745,
478
+ "mean_token_accuracy": 0.836009356379509,
479
+ "num_tokens": 630337.0,
480
  "step": 470
481
  },
482
  {
483
+ "entropy": 0.7470057517290115,
484
+ "epoch": 0.0003552560878311468,
485
+ "grad_norm": 0.283203125,
486
  "learning_rate": 8.400000000000001e-06,
487
+ "loss": 0.7188091278076172,
488
+ "mean_token_accuracy": 0.8408551633358001,
489
+ "num_tokens": 643059.0,
490
  "step": 480
491
  },
492
  {
493
+ "entropy": 0.7312727242708206,
494
+ "epoch": 0.000362657256327629,
495
+ "grad_norm": 0.23828125,
496
  "learning_rate": 4.4e-06,
497
+ "loss": 0.7032594203948974,
498
+ "mean_token_accuracy": 0.844664552807808,
499
+ "num_tokens": 655763.0,
500
  "step": 490
501
  },
502
  {
503
+ "entropy": 0.8029953300952911,
504
+ "epoch": 0.0003700584248241112,
505
+ "grad_norm": 0.306640625,
506
  "learning_rate": 4.0000000000000003e-07,
507
+ "loss": 0.786741065979004,
508
+ "mean_token_accuracy": 0.8345526486635209,
509
+ "num_tokens": 669554.0,
510
  "step": 500
511
  }
512
  ],
 
527
  "attributes": {}
528
  }
529
  },
530
+ "total_flos": 4.872149119972147e+16,
531
  "train_batch_size": 8,
532
  "trial_name": null,
533
  "trial_params": null