---
datasets:
- NeelNanda/c4-code-20k
tags:
- mechanistic_interpretability
---
### GELU_2L512W_C4_Code Model Card

**Model Overview**
- **Model Name:** GELU_2L512W_C4_Code
- **Version:** 201
- **Primary Application:** Code-related tasks
- **Model Architecture:** Transformer-based
- **Activation Function:** GELU (Gaussian Error Linear Unit)
- **Normalization:** Layer Normalization (LN)

**Model Specifications**
- **Number of Layers:** 2
- **Model Dimension (d_model):** 512
- **MLP Dimension (d_mlp):** 2048
- **Head Dimension (d_head):** 64
- **Number of Heads (n_heads):** 8
- **Context Size (n_ctx):** 1024
- **Vocabulary Size (d_vocab):** 48,262
- **Number of Parameters:** 6,291,456

**Training Configurations**
- **Dataset:** c4_code
- **Batch Size per Device:** 32
- **Total Batch Size:** 256
- **Batches per Step:** 1
- **Max Steps:** 83,923
- **Warmup Steps:** 1,144
- **Learning Rate Schedule:** Cosine Warmup
- **Learning Rate (Hidden Layers):** 0.002
- **Learning Rate (Vector):** 0.001
- **Optimizer Betas:** [0.9, 0.99]
- **Weight Decay:** 0.05
- **Gradient Norm Clipping:** 1.0
- **Max Tokens:** 22,000,000,000
- **Warmup Tokens:** 300,000,000
- **Truncate Tokens:** 1,000,000,000,000

**Technical Specifications**
- **Number of Devices:** 8
- **Seed:** 259123
- **Use of bfloat16 for MatMul:** True
- **Debug Options:** Disabled
- **Save Checkpoints:** Enabled
- **Tokens per Step:** 262,144
- **Initializer Scales:**
  - Global: 1.0
  - Hidden: 0.02
  - Embed: 0.1
  - Unembed: 0.02
- **Neuron Scale:** 1.0
- **Neuron Temperature:** 1.0
- **Weight Initialization Scheme:** GPT-2
- **Fixed Initialization:** 2L512W_init

**Tokenizer**
- **Name:** NeelNanda/gpt-neox-tokenizer-digits

**Miscellaneous**
- **Layer-wise Learning Rate Decay:** 0.99
- **Log Interval:** 50
- **Control Parameter:** 1.0
- **Shortformer Positional Embedding:** Disabled
- **Attention Only:** False
- **Use Accelerated Computation:** False
- **Layer Normalization Epsilon:** 1e-05

**Model Limitations & Ethical Considerations**
- This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets.
- As with any AI model, results may vary depending on the complexity and specificity of the task.
- Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making.

**Notes for Users**
- The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset.
- Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs.

---

*This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.*