--- datasets: - NeelNanda/c4-code-20k tags: - mechanistic_interpretability --- ### GELU_2L512W_C4_Code Model Card **Model Overview** - **Model Name:** GELU_2L512W_C4_Code - **Version:** 201 - **Primary Application:** Code-related tasks - **Model Architecture:** Transformer-based - **Activation Function:** GELU (Gaussian Error Linear Unit) - **Normalization:** Layer Normalization (LN) **Model Specifications** - **Number of Layers:** 2 - **Model Dimension (d_model):** 512 - **MLP Dimension (d_mlp):** 2048 - **Head Dimension (d_head):** 64 - **Number of Heads (n_heads):** 8 - **Context Size (n_ctx):** 1024 - **Vocabulary Size (d_vocab):** 48,262 - **Number of Parameters:** 6,291,456 **Training Configurations** - **Dataset:** c4_code - **Batch Size per Device:** 32 - **Total Batch Size:** 256 - **Batches per Step:** 1 - **Max Steps:** 83,923 - **Warmup Steps:** 1,144 - **Learning Rate Schedule:** Cosine Warmup - **Learning Rate (Hidden Layers):** 0.002 - **Learning Rate (Vector):** 0.001 - **Optimizer Betas:** [0.9, 0.99] - **Weight Decay:** 0.05 - **Gradient Norm Clipping:** 1.0 - **Max Tokens:** 22,000,000,000 - **Warmup Tokens:** 300,000,000 - **Truncate Tokens:** 1,000,000,000,000 **Technical Specifications** - **Number of Devices:** 8 - **Seed:** 259123 - **Use of bfloat16 for MatMul:** True - **Debug Options:** Disabled - **Save Checkpoints:** Enabled - **Tokens per Step:** 262,144 - **Initializer Scales:** - Global: 1.0 - Hidden: 0.02 - Embed: 0.1 - Unembed: 0.02 - **Neuron Scale:** 1.0 - **Neuron Temperature:** 1.0 - **Weight Initialization Scheme:** GPT-2 - **Fixed Initialization:** 2L512W_init **Tokenizer** - **Name:** NeelNanda/gpt-neox-tokenizer-digits **Miscellaneous** - **Layer-wise Learning Rate Decay:** 0.99 - **Log Interval:** 50 - **Control Parameter:** 1.0 - **Shortformer Positional Embedding:** Disabled - **Attention Only:** False - **Use Accelerated Computation:** False - **Layer Normalization Epsilon:** 1e-05 **Model Limitations & Ethical Considerations** - This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets. - As with any AI model, results may vary depending on the complexity and specificity of the task. - Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making. **Notes for Users** - The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset. - Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs. --- *This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.*