Mxode's picture
Update README.md
d91c92e verified
|
raw
history blame
6.62 kB
metadata
license: gpl-3.0
datasets:
  - Mxode/BiST
language:
  - en
  - zh
pipeline_tag: translation
library_name: transformers

NanoTranslator-Experimental

Models

Arch. Act. V. H. I. L. A. K. Tie
LLaMA SwiGLU 2K 256 768 2 8 4 True
Qwen2 SwiGLU 2K 256 768 2 8 4 True
Mistral SwiGLU 2K 256 768 2 8 4 True
Gemma GeGLU(Tanh) 2K 256 768 2 8 4 True
Gemma2 GeGLU(Tanh) 2K 256 768 2 8 4 True
OLMo SwiGLU 2K 256 768 2 8 4 True
Cohere SwiGLU 2K 256 768 2 8 4 True
Phi GeGLU 2K 256 1024 2 8 4 True
StarCoder2 GeGLU(Tanh) 2K 256 768 2 8 4 True
StableLM SwiGLU 2K 256 768 2 8 4 True
GPT2 GeGLU 2K 256 1024 2 8 8 True
GPT-J GeGLU 2K 256 1024 2 4 4 True
GPT-NeoX GeGLU 2K 256 1024 2 8 8 True
Bloom GeGLU 2K 256 1024 2 8 8 True
MPT GeGLU 2K 256 1024 2 8 8 True
RWKV - 2K 256 1024 2 - - True

Experimental Setup

Value
Batch Size 1024
Grad Acc Steps 1
Max LR 1.5 * 10^-3
LR Scheduler Trapezoidal / Cosine
Warmup Ratio 0.01
Decay Ratio 0.35
Decay Progress Exponential
Min Decay LR 0.01 * Max LR
Optimizer AdamW
Weight Decay 0.1
Max Grad Norm 1.0
Num Epochs 1
FP16 True
Device Tesla-V100-SXM2-32GB
Seed 3407

Results

Trapezoidal v.s. Cosine

Arch. Training Speed (it/s) Total Loss Final Loss (Last 10 steps Avg.)
Trapezoidal Cosine Trapezoidal Cosine
LLaMA 4.35 1.5734 1.5626 1.2784 1.2855
Qwen2 4.41 1.5735 1.5565 1.2760 1.2943
Mistral 4.44 1.5756 1.5645 1.2787 1.3004
Gemma 1.79 1.3894 1.3737 1.0841 1.1010
Gemma2 1.59 1.3754 1.3597 1.0601 1.0752
OLMo 5.00 1.6011 1.5855 1.2857 1.3039
Cohere 4.04 2.1327 2.1152 1.6244 1.6593
Phi 5.78 1.7525 1.7419 1.4770 1.4876
StarCoder2 3.01 1.6125 1.6498 1.3044 1.3718
StableLM 5.06 1.5835 1.5905 1.2662 1.2998
GPT2 3.53 2.1100 2.1081 1.8236 1.8508
GPT-J 3.06 1.7198 1.6976 1.4503 1.4541
GPT-NeoX 5.06 1.7233 1.6981 1.4400 1.4303
Bloom 3.33 1.6910 1.6704 1.3690 1.3774
MPT 4.39 1.6466 1.6317 1.3443 1.3550
RWKV 0.72 3.0151 3.0810 1.8569 1.9628
Avg. - 1.755 1.749 1.389 1.413

BF16 & FP16

Arch. Total Loss Final Loss (Last 10 steps Avg.)
FP16 BF16 FP16 BF16
LLaMA 1.5734 1.5714 1.2784 1.2758
Qwen2 1.5735 1.5675 1.2760 1.2764
Mistral 1.5756 1.5694 1.2787 1.2740
OLMo 1.6011 1.6059 1.2857 1.2901
Cohere 2.1327 2.1112 1.6244 1.6346

Citation

@misc{NanoExperiment,
    title={NanoExperiment},
    url={https://huggingface.co/Mxode/NanoExperiment-Models},
    author={Mxode},
    month={September},
    year={2024}
}