--- library_name: transformers license: apache-2.0 base_model: Qwen/Qwen3-1.7B tags: - generated_from_trainer datasets: - sumuks/essential-web-v1.0-sample-100M-with-cleaned-responses-sft model-index: - name: output/1.7B-Instruct-Tuned-New-Data results: [] --- [Built with Axolotl](https://github.com/axolotl-ai-cloud/axolotl)
See axolotl config axolotl version: `0.11.0` ```yaml base_model: Qwen/Qwen3-1.7B # plugins: # - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin strict: false # plugins: # - axolotl.integrations.liger.LigerPlugin # liger_rope: true # liger_rms_norm: true # liger_glu_activation: true # liger_layer_norm: true # liger_fused_linear_cross_entropy: true datasets: - path: sumuks/essential-web-v1.0-sample-100M-with-cleaned-responses-sft type: chat_template field_messages: conversations split: train val_set_size: 0.05 dataset_prepared_path: dataset/prepared_dataset_1.7b train_on_inputs: false output_dir: ./output/1.7B-Instruct-Tuned-New-Data chat_template: qwen3 sequence_len: 8192 sample_packing: true eval_sample_packing: true # pad_to_sequence_len: true wandb_project: essential-web-sft wandb_name: qwen3-1.7b-sft-new-data gradient_accumulation_steps: 4 gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false flash_attention: true micro_batch_size: 1 optimizer: paged_adamw_8bit lr_scheduler: cosine learning_rate: 2e-5 num_epochs: 1 load_best_model_at_end: true metric_for_best_model: loss greater_is_better: false early_stopping_patience: 3 bf16: auto tf32: true logging_steps: 5 deepspeed: ./configs_prod/zero3.json save_steps: 500 eval_steps: 500 warmup_ratio: 0.05 # save_first_step: true ```

# output/1.7B-Instruct-Tuned-New-Data This model is a fine-tuned version of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) on the sumuks/essential-web-v1.0-sample-100M-with-cleaned-responses-sft dataset. It achieves the following results on the evaluation set: - Loss: 0.3669 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - distributed_type: multi-GPU - num_devices: 2 - gradient_accumulation_steps: 4 - total_train_batch_size: 8 - total_eval_batch_size: 2 - optimizer: Use OptimizerNames.PAGED_ADAMW_8BIT with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 164 - training_steps: 3297 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | No log | 0 | 0 | 0.8829 | | 0.3689 | 0.1517 | 500 | 0.4088 | | 0.3919 | 0.3033 | 1000 | 0.3952 | | 0.386 | 0.4550 | 1500 | 0.3839 | | 0.409 | 0.6066 | 2000 | 0.3755 | | 0.3473 | 0.7583 | 2500 | 0.3694 | | 0.3518 | 0.9099 | 3000 | 0.3669 | ### Framework versions - Transformers 4.53.1 - Pytorch 2.7.1+cu126 - Datasets 3.6.0 - Tokenizers 0.21.2