|
|
--- |
|
|
datasets: |
|
|
- NeelNanda/c4-code-20k |
|
|
tags: |
|
|
- mechanistic_interpretability |
|
|
--- |
|
|
### GELU_2L512W_C4_Code Model Card |
|
|
|
|
|
**Model Overview** |
|
|
- **Model Name:** GELU_2L512W_C4_Code |
|
|
- **Version:** 201 |
|
|
- **Primary Application:** Code-related tasks |
|
|
- **Model Architecture:** Transformer-based |
|
|
- **Activation Function:** GELU (Gaussian Error Linear Unit) |
|
|
- **Normalization:** Layer Normalization (LN) |
|
|
|
|
|
**Model Specifications** |
|
|
- **Number of Layers:** 2 |
|
|
- **Model Dimension (d_model):** 512 |
|
|
- **MLP Dimension (d_mlp):** 2048 |
|
|
- **Head Dimension (d_head):** 64 |
|
|
- **Number of Heads (n_heads):** 8 |
|
|
- **Context Size (n_ctx):** 1024 |
|
|
- **Vocabulary Size (d_vocab):** 48,262 |
|
|
- **Number of Parameters:** 6,291,456 |
|
|
|
|
|
**Training Configurations** |
|
|
- **Dataset:** c4_code |
|
|
- **Batch Size per Device:** 32 |
|
|
- **Total Batch Size:** 256 |
|
|
- **Batches per Step:** 1 |
|
|
- **Max Steps:** 83,923 |
|
|
- **Warmup Steps:** 1,144 |
|
|
- **Learning Rate Schedule:** Cosine Warmup |
|
|
- **Learning Rate (Hidden Layers):** 0.002 |
|
|
- **Learning Rate (Vector):** 0.001 |
|
|
- **Optimizer Betas:** [0.9, 0.99] |
|
|
- **Weight Decay:** 0.05 |
|
|
- **Gradient Norm Clipping:** 1.0 |
|
|
- **Max Tokens:** 22,000,000,000 |
|
|
- **Warmup Tokens:** 300,000,000 |
|
|
- **Truncate Tokens:** 1,000,000,000,000 |
|
|
|
|
|
**Technical Specifications** |
|
|
- **Number of Devices:** 8 |
|
|
- **Seed:** 259123 |
|
|
- **Use of bfloat16 for MatMul:** True |
|
|
- **Debug Options:** Disabled |
|
|
- **Save Checkpoints:** Enabled |
|
|
- **Tokens per Step:** 262,144 |
|
|
- **Initializer Scales:** |
|
|
- Global: 1.0 |
|
|
- Hidden: 0.02 |
|
|
- Embed: 0.1 |
|
|
- Unembed: 0.02 |
|
|
- **Neuron Scale:** 1.0 |
|
|
- **Neuron Temperature:** 1.0 |
|
|
- **Weight Initialization Scheme:** GPT-2 |
|
|
- **Fixed Initialization:** 2L512W_init |
|
|
|
|
|
**Tokenizer** |
|
|
- **Name:** NeelNanda/gpt-neox-tokenizer-digits |
|
|
|
|
|
**Miscellaneous** |
|
|
- **Layer-wise Learning Rate Decay:** 0.99 |
|
|
- **Log Interval:** 50 |
|
|
- **Control Parameter:** 1.0 |
|
|
- **Shortformer Positional Embedding:** Disabled |
|
|
- **Attention Only:** False |
|
|
- **Use Accelerated Computation:** False |
|
|
- **Layer Normalization Epsilon:** 1e-05 |
|
|
|
|
|
**Model Limitations & Ethical Considerations** |
|
|
- This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets. |
|
|
- As with any AI model, results may vary depending on the complexity and specificity of the task. |
|
|
- Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making. |
|
|
|
|
|
**Notes for Users** |
|
|
- The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset. |
|
|
- Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs. |
|
|
|
|
|
--- |
|
|
|
|
|
*This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.* |