You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

moe_d_het

This model is a fine-tuned version of on the arrow dataset. It achieves the following results on the evaluation set:

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 32
eval_batch_size: 32
seed: 42
distributed_type: multi-GPU
num_devices: 2
gradient_accumulation_steps: 16
total_train_batch_size: 1024
total_eval_batch_size: 64
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-06 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 15297
mixed_precision_training: Native AMP

Training Loss	Epoch	Step	Validation Loss
No log	0	0	10.9726
6.3815	0.6537	1000	6.1101
5.6893	1.0	1530	5.4428
5.2263	1.3072	2000	5.1823
4.916	1.9609	3000	4.8803
4.916	2.0	3060	4.8665
4.6588	2.6145	4000	4.7204
4.4151	3.2680	5000	4.6267
4.3816	3.9217	6000	4.5479
4.1873	4.5752	7000	4.5310
3.9649	5.2288	8000	4.5390
3.9952	5.8825	9000	4.5147
3.7926	6.5360	10000	4.5653
3.5756	7.1896	11000	4.6144
3.6211	7.8432	12000	4.6217
3.4485	8.4968	13000	4.6837
3.4576	9.0	13770	4.6867
3.3107	9.1503	14000	4.7196
3.3235	9.8040	15000	4.7310

Safetensors

Model size

0.5B params

Tensor type

F32