metadata
license: gpl-3.0
datasets:
- Mxode/BiST
language:
- en
- zh
pipeline_tag: translation
library_name: transformers
NanoTranslator-Experimental
Models
| Arch. | Act. | V. | H. | I. | L. | A. | K. | Tie |
|---|---|---|---|---|---|---|---|---|
| LLaMA | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
| Qwen2 | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
| Mistral | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
| Gemma | GeGLU(Tanh) | 2K | 256 | 768 | 2 | 8 | 4 | True |
| Gemma2 | GeGLU(Tanh) | 2K | 256 | 768 | 2 | 8 | 4 | True |
| OLMo | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
| Cohere | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
| Phi | GeGLU | 2K | 256 | 1024 | 2 | 8 | 4 | True |
| StarCoder2 | GeGLU(Tanh) | 2K | 256 | 768 | 2 | 8 | 4 | True |
| StableLM | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
| GPT2 | GeGLU | 2K | 256 | 1024 | 2 | 8 | 8 | True |
| GPT-J | GeGLU | 2K | 256 | 1024 | 2 | 4 | 4 | True |
| GPT-NeoX | GeGLU | 2K | 256 | 1024 | 2 | 8 | 8 | True |
| Bloom | GeGLU | 2K | 256 | 1024 | 2 | 8 | 8 | True |
| MPT | GeGLU | 2K | 256 | 1024 | 2 | 8 | 8 | True |
| RWKV | - | 2K | 256 | 1024 | 2 | - | - | True |
Experimental Setup
| Value | |
|---|---|
| Batch Size | 1024 |
| Grad Acc Steps | 1 |
| Max LR | 1.5 * 10^-3 |
| LR Scheduler | Trapezoidal / Cosine |
| Warmup Ratio | 0.01 |
| Decay Ratio | 0.35 |
| Decay Progress | Exponential |
| Min Decay LR | 0.01 * Max LR |
| Optimizer | AdamW |
| Weight Decay | 0.1 |
| Max Grad Norm | 1.0 |
| Num Epochs | 1 |
| FP16 | True |
| Device | Tesla-V100-SXM2-32GB |
| Seed | 3407 |
Results
Trapezoidal v.s. Cosine
| Arch. | Training Speed (it/s) | Total Loss | Final Loss (Last 10 steps Avg.) | ||
|---|---|---|---|---|---|
| Trapezoidal | Cosine | Trapezoidal | Cosine | ||
| LLaMA | 4.35 | 1.5734 | 1.5626 | 1.2784 | 1.2855 |
| Qwen2 | 4.41 | 1.5735 | 1.5565 | 1.2760 | 1.2943 |
| Mistral | 4.44 | 1.5756 | 1.5645 | 1.2787 | 1.3004 |
| Gemma | 1.79 | 1.3894 | 1.3737 | 1.0841 | 1.1010 |
| Gemma2 | 1.59 | 1.3754 | 1.3597 | 1.0601 | 1.0752 |
| OLMo | 5.00 | 1.6011 | 1.5855 | 1.2857 | 1.3039 |
| Cohere | 4.04 | 2.1327 | 2.1152 | 1.6244 | 1.6593 |
| Phi | 5.78 | 1.7525 | 1.7419 | 1.4770 | 1.4876 |
| StarCoder2 | 3.01 | 1.6125 | 1.6498 | 1.3044 | 1.3718 |
| StableLM | 5.06 | 1.5835 | 1.5905 | 1.2662 | 1.2998 |
| GPT2 | 3.53 | 2.1100 | 2.1081 | 1.8236 | 1.8508 |
| GPT-J | 3.06 | 1.7198 | 1.6976 | 1.4503 | 1.4541 |
| GPT-NeoX | 5.06 | 1.7233 | 1.6981 | 1.4400 | 1.4303 |
| Bloom | 3.33 | 1.6910 | 1.6704 | 1.3690 | 1.3774 |
| MPT | 4.39 | 1.6466 | 1.6317 | 1.3443 | 1.3550 |
| RWKV | 0.72 | 3.0151 | 3.0810 | 1.8569 | 1.9628 |
| Avg. | - | 1.755 | 1.749 | 1.389 | 1.413 |
BF16 & FP16
| Arch. | Total Loss | Final Loss (Last 10 steps Avg.) | ||
|---|---|---|---|---|
| FP16 | BF16 | FP16 | BF16 | |
| LLaMA | 1.5734 | 1.5714 | 1.2784 | 1.2758 |
| Qwen2 | 1.5735 | 1.5675 | 1.2760 | 1.2764 |
| Mistral | 1.5756 | 1.5694 | 1.2787 | 1.2740 |
| OLMo | 1.6011 | 1.6059 | 1.2857 | 1.2901 |
| Cohere | 2.1327 | 2.1112 | 1.6244 | 1.6346 |
Citation
@misc{NanoExperiment,
title={NanoExperiment},
url={https://huggingface.co/Mxode/NanoExperiment-Models},
author={Mxode},
month={September},
year={2024}
}