End of training
Browse files
README.md
CHANGED
|
@@ -77,7 +77,7 @@ LlamaForCausalLM(
|
|
| 77 |
|
| 78 |
# Resource Usage
|
| 79 |
|
| 80 |
-
- Max Train VRAM Use: 13.
|
| 81 |
- Available VRAM: 23.4329 GB
|
| 82 |
- GPUs:
|
| 83 |
- 1x NVIDIA GeForce RTX 4090
|
|
@@ -107,6 +107,28 @@ LlamaForCausalLM(
|
|
| 107 |
(self_attn): LlamaSdpaAttention(
|
| 108 |
(q_proj): Linear(in_features=576, out_features=576, bias=False)
|
| 109 |
(k_proj): Linear(in_features=576, out_features=192, bias=False)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
```
|
| 112 |
|
|
@@ -114,7 +136,7 @@ LlamaForCausalLM(
|
|
| 114 |
<br/>
|
| 115 |
|
| 116 |
# Train Dataset
|
| 117 |
-
Trained on 553,
|
| 118 |
|
| 119 |
- Num Samples: `998,000`
|
| 120 |
- Subset: `20231101.en`
|
|
@@ -150,6 +172,7 @@ The following hyperparameters were used during training:
|
|
| 150 |
- seed: `42`
|
| 151 |
- optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
|
| 152 |
- lr_scheduler_type: `polynomial`
|
|
|
|
| 153 |
- num_epochs: `1.0`
|
| 154 |
- distillation_objective: `DistillationObjective(
|
| 155 |
logits_loss_component=LossComponent(
|
|
@@ -163,7 +186,7 @@ The following hyperparameters were used during training:
|
|
| 163 |
weight=0
|
| 164 |
)
|
| 165 |
)`
|
| 166 |
-
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at
|
| 167 |
- student_model_name_or_path: `None`
|
| 168 |
- student_config_name_or_path: `None`
|
| 169 |
- student_model_config: `{'num_hidden_layers': 15}`
|
|
@@ -187,7 +210,7 @@ The following hyperparameters were used during training:
|
|
| 187 |
- gradient_accumulation_steps: `1`
|
| 188 |
- weight_decay: `0.0`
|
| 189 |
- max_grad_norm: `1.0`
|
| 190 |
-
- warmup_ratio: `0.
|
| 191 |
- warmup_steps: `0`
|
| 192 |
- gradient_checkpointing: `True`
|
| 193 |
|
|
|
|
| 77 |
|
| 78 |
# Resource Usage
|
| 79 |
|
| 80 |
+
- Max Train VRAM Use: 13.1269 GB
|
| 81 |
- Available VRAM: 23.4329 GB
|
| 82 |
- GPUs:
|
| 83 |
- 1x NVIDIA GeForce RTX 4090
|
|
|
|
| 107 |
(self_attn): LlamaSdpaAttention(
|
| 108 |
(q_proj): Linear(in_features=576, out_features=576, bias=False)
|
| 109 |
(k_proj): Linear(in_features=576, out_features=192, bias=False)
|
| 110 |
+
@@ -10,17 +10,16 @@
|
| 111 |
+
(o_proj): Linear(in_features=576, out_features=576, bias=False)
|
| 112 |
+
(rotary_emb): LlamaRotaryEmbedding()
|
| 113 |
+
)
|
| 114 |
+
- (mlp): LlamaMLP(
|
| 115 |
+
+ (mlp): LigerSwiGLUMLP(
|
| 116 |
+
(gate_proj): Linear(in_features=576, out_features=1536, bias=False)
|
| 117 |
+
(up_proj): Linear(in_features=576, out_features=1536, bias=False)
|
| 118 |
+
(down_proj): Linear(in_features=1536, out_features=576, bias=False)
|
| 119 |
+
- (act_fn): SiLU()
|
| 120 |
+
)
|
| 121 |
+
- (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
|
| 122 |
+
- (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
|
| 123 |
+
+ (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
| 124 |
+
+ (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
| 125 |
+
)
|
| 126 |
+
)
|
| 127 |
+
- (norm): LlamaRMSNorm((576,), eps=1e-05)
|
| 128 |
+
+ (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
| 129 |
+
(rotary_emb): LlamaRotaryEmbedding()
|
| 130 |
+
)
|
| 131 |
+
(lm_head): Linear(in_features=576, out_features=49152, bias=False)
|
| 132 |
|
| 133 |
```
|
| 134 |
|
|
|
|
| 136 |
<br/>
|
| 137 |
|
| 138 |
# Train Dataset
|
| 139 |
+
Trained on 553,295,062 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
|
| 140 |
|
| 141 |
- Num Samples: `998,000`
|
| 142 |
- Subset: `20231101.en`
|
|
|
|
| 172 |
- seed: `42`
|
| 173 |
- optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
|
| 174 |
- lr_scheduler_type: `polynomial`
|
| 175 |
+
- lr_scheduler_warmup_ratio: `0.1`
|
| 176 |
- num_epochs: `1.0`
|
| 177 |
- distillation_objective: `DistillationObjective(
|
| 178 |
logits_loss_component=LossComponent(
|
|
|
|
| 186 |
weight=0
|
| 187 |
)
|
| 188 |
)`
|
| 189 |
+
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7520ce1738b0>`
|
| 190 |
- student_model_name_or_path: `None`
|
| 191 |
- student_config_name_or_path: `None`
|
| 192 |
- student_model_config: `{'num_hidden_layers': 15}`
|
|
|
|
| 210 |
- gradient_accumulation_steps: `1`
|
| 211 |
- weight_decay: `0.0`
|
| 212 |
- max_grad_norm: `1.0`
|
| 213 |
+
- warmup_ratio: `0.1`
|
| 214 |
- warmup_steps: `0`
|
| 215 |
- gradient_checkpointing: `True`
|
| 216 |
|
logs/learning_rate=0.0001, lr_scheduler_kwargs=__power___0.7___lr_end___2e-05_, lr_scheduler_type=polynomial, per_device_train_batch_size=8, warmup_ratio=0.1/events.out.tfevents.1726725569.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7805bb537c375908961a23e47b045a33754b66eeddf1286b133c5263315c4085
|
| 3 |
+
size 529
|