68m-base / README.md
Govind222's picture
step 30000
4b9f47d
# Test-shakespeare
Trained with **transformer-toolkit**.
## Architecture
| param | value |
|---|---|
| `vocab_size` | `32000` |
| `dim` | `512` |
| `n_layers` | `12` |
| `n_heads` | `8` |
| `max_seq` | `512` |
| `attn` | `gqa` |
| `n_kv_heads` | `2` |
| `latent_dim` | `64` |
| `ffn` | `swiglu` |
| `hidden_dim` | `1536` |
| `n_experts` | `8` |
| `top_k` | `2` |
| `moe_aux_weight` | `0.01` |
| `moe_capacity` | `1.0` |
| `moe_n_shared` | `2` |
| `moe_n_routed` | `6` |
| `norm` | `rmsnorm` |
| `eps` | `1e-06` |
| `pos_enc` | `rope` |
| `dropout` | `0.1` |
| `tie_weights` | `False` |
## Metrics
| metric | value |
|---|---|
| `val_loss` | `3.949221694469452` |