femT-data
/

llama-3.1-newformat-instruct

@@ -51,10 +51,10 @@ lora_fan_in_fan_out:
 gradient_accumulation_steps: 4
 micro_batch_size: 2
-num_epochs: 1
 optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
-learning_rate: 0.0002
 train_on_inputs: false
 group_by_length: false
@@ -71,7 +71,7 @@ xformers_attention:
 flash_attention: true
 s2_attention:
-warmup_steps: 10
 evals_per_epoch: 1
 eval_table_size:
 eval_max_new_tokens: 128
@@ -92,7 +92,7 @@ special_tokens:
 This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on the None dataset.
 It achieves the following results on the evaluation set:
-- Loss: 0.0369
 ## Model description
@@ -111,25 +111,28 @@ More information needed
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- learning_rate: 0.0002
 - train_batch_size: 2
 - eval_batch_size: 2
 - seed: 42
 - distributed_type: multi-GPU
-- num_devices: 4
 - gradient_accumulation_steps: 4
-- total_train_batch_size: 32
-- total_eval_batch_size: 8
 - optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 - lr_scheduler_type: cosine
-- lr_scheduler_warmup_steps: 10
-- num_epochs: 1
 ### Training results
-| Training Loss | Epoch | Step | Validation Loss |
-|:-------------:|:-----:|:----:|:---------------:|
-| 0.0366        | 1.0   | 297  | 0.0369          |
 ### Framework versions

 gradient_accumulation_steps: 4
 micro_batch_size: 2
+num_epochs: 3
 optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
+learning_rate: 0.00002
 train_on_inputs: false
 group_by_length: false
 flash_attention: true
 s2_attention:
+warmup_ratio: 0.04
 evals_per_epoch: 1
 eval_table_size:
 eval_max_new_tokens: 128
 This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on the None dataset.
 It achieves the following results on the evaluation set:
+- Loss: 0.0467
 ## Model description
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- learning_rate: 2e-05
 - train_batch_size: 2
 - eval_batch_size: 2
 - seed: 42
 - distributed_type: multi-GPU
+- num_devices: 8
 - gradient_accumulation_steps: 4
+- total_train_batch_size: 64
+- total_eval_batch_size: 16
 - optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 - lr_scheduler_type: cosine
+- lr_scheduler_warmup_steps: 17
+- num_epochs: 3
 ### Training results
+| Training Loss | Epoch  | Step | Validation Loss |
+|:-------------:|:------:|:----:|:---------------:|
+| 0.3341        | 0.0067 | 1    | 0.3710          |
+| 0.061         | 0.9966 | 148  | 0.0574          |
+| 0.0413        | 1.9933 | 296  | 0.0476          |
+| 0.0453        | 2.9899 | 444  | 0.0467          |
 ### Framework versions

adapter_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:647907a67d19244734cc698d4946e0a5a8534f921aa7a9d5016ad091d705392b
 size 335706186

 version https://git-lfs.github.com/spec/v1
+oid sha256:cbcbea02c7645350192e3d11f3dcec3f419bc7f070baa78cceb909228c19b9b4
 size 335706186