End of training

Browse files

Files changed (2) hide show

README.md +27 -4
logs/learning_rate=0.0001, lr_scheduler_kwargs=__power___0.7___lr_end___2e-05_, lr_scheduler_type=polynomial, per_device_train_batch_size=8, warmup_ratio=0.1/events.out.tfevents.1726725569.1c1a426a2fee +3 -0

README.md CHANGED Viewed

@@ -77,7 +77,7 @@ LlamaForCausalLM(
 # Resource Usage
-- Max Train VRAM Use: 13.1273 GB
 - Available VRAM: 23.4329 GB
 - GPUs:
   - 1x NVIDIA GeForce RTX 4090
@@ -107,6 +107,28 @@ LlamaForCausalLM(
          (self_attn): LlamaSdpaAttention(
            (q_proj): Linear(in_features=576, out_features=576, bias=False)
            (k_proj): Linear(in_features=576, out_features=192, bias=False)
 ```
@@ -114,7 +136,7 @@ LlamaForCausalLM(
 <br/>
 # Train Dataset
-Trained on 553,266,374 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
 - Num Samples: `998,000`
 - Subset: `20231101.en`
@@ -150,6 +172,7 @@ The following hyperparameters were used during training:
 - seed: `42`
 - optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
 - lr_scheduler_type: `polynomial`
 - num_epochs: `1.0`
 - distillation_objective: `DistillationObjective(
     logits_loss_component=LossComponent(
@@ -163,7 +186,7 @@ The following hyperparameters were used during training:
         weight=0
     )
 )`
-- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x777cbafb23b0>`
 - student_model_name_or_path: `None`
 - student_config_name_or_path: `None`
 - student_model_config: `{'num_hidden_layers': 15}`
@@ -187,7 +210,7 @@ The following hyperparameters were used during training:
 - gradient_accumulation_steps: `1`
 - weight_decay: `0.0`
 - max_grad_norm: `1.0`
-- warmup_ratio: `0.0`
 - warmup_steps: `0`
 - gradient_checkpointing: `True`

 # Resource Usage
+- Max Train VRAM Use: 13.1269 GB
 - Available VRAM: 23.4329 GB
 - GPUs:
   - 1x NVIDIA GeForce RTX 4090
          (self_attn): LlamaSdpaAttention(
            (q_proj): Linear(in_features=576, out_features=576, bias=False)
            (k_proj): Linear(in_features=576, out_features=192, bias=False)
+@@ -10,17 +10,16 @@
+           (o_proj): Linear(in_features=576, out_features=576, bias=False)
+           (rotary_emb): LlamaRotaryEmbedding()
+         )
+-        (mlp): LlamaMLP(
++        (mlp): LigerSwiGLUMLP(
+           (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
+           (up_proj): Linear(in_features=576, out_features=1536, bias=False)
+           (down_proj): Linear(in_features=1536, out_features=576, bias=False)
+-          (act_fn): SiLU()
+         )
+-        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
+-        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
++        (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
++        (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
+       )
+     )
+-    (norm): LlamaRMSNorm((576,), eps=1e-05)
++    (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
+     (rotary_emb): LlamaRotaryEmbedding()
+   )
+   (lm_head): Linear(in_features=576, out_features=49152, bias=False)
 ```
 <br/>
 # Train Dataset
+Trained on 553,295,062 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
 - Num Samples: `998,000`
 - Subset: `20231101.en`
 - seed: `42`
 - optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
 - lr_scheduler_type: `polynomial`
+- lr_scheduler_warmup_ratio: `0.1`
 - num_epochs: `1.0`
 - distillation_objective: `DistillationObjective(
     logits_loss_component=LossComponent(
         weight=0
     )
 )`
+- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7520ce1738b0>`
 - student_model_name_or_path: `None`
 - student_config_name_or_path: `None`
 - student_model_config: `{'num_hidden_layers': 15}`
 - gradient_accumulation_steps: `1`
 - weight_decay: `0.0`
 - max_grad_norm: `1.0`
+- warmup_ratio: `0.1`
 - warmup_steps: `0`
 - gradient_checkpointing: `True`

logs/learning_rate=0.0001, lr_scheduler_kwargs=__power___0.7___lr_end___2e-05_, lr_scheduler_type=polynomial, per_device_train_batch_size=8, warmup_ratio=0.1/events.out.tfevents.1726725569.1c1a426a2fee ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7805bb537c375908961a23e47b045a33754b66eeddf1286b133c5263315c4085
+size 529