Create README.md

f91919a verified about 2 years ago

2.99 kB

	---
	datasets:
	- NeelNanda/c4-code-20k
	tags:
	- mechanistic_interpretability
	---
	### GELU_2L512W_C4_Code Model Card

	Model Overview
	- Model Name: GELU_2L512W_C4_Code
	- Version: 201
	- Primary Application: Code-related tasks
	- Model Architecture: Transformer-based
	- Activation Function: GELU (Gaussian Error Linear Unit)
	- Normalization: Layer Normalization (LN)

	Model Specifications
	- Number of Layers: 2
	- Model Dimension (d_model): 512
	- MLP Dimension (d_mlp): 2048
	- Head Dimension (d_head): 64
	- Number of Heads (n_heads): 8
	- Context Size (n_ctx): 1024
	- Vocabulary Size (d_vocab): 48,262
	- Number of Parameters: 6,291,456

	Training Configurations
	- Dataset: c4_code
	- Batch Size per Device: 32
	- Total Batch Size: 256
	- Batches per Step: 1
	- Max Steps: 83,923
	- Warmup Steps: 1,144
	- Learning Rate Schedule: Cosine Warmup
	- Learning Rate (Hidden Layers): 0.002
	- Learning Rate (Vector): 0.001
	- Optimizer Betas: [0.9, 0.99]
	- Weight Decay: 0.05
	- Gradient Norm Clipping: 1.0
	- Max Tokens: 22,000,000,000
	- Warmup Tokens: 300,000,000
	- Truncate Tokens: 1,000,000,000,000

	Technical Specifications
	- Number of Devices: 8
	- Seed: 259123
	- Use of bfloat16 for MatMul: True
	- Debug Options: Disabled
	- Save Checkpoints: Enabled
	- Tokens per Step: 262,144
	- Initializer Scales:
	- Global: 1.0
	- Hidden: 0.02
	- Embed: 0.1
	- Unembed: 0.02
	- Neuron Scale: 1.0
	- Neuron Temperature: 1.0
	- Weight Initialization Scheme: GPT-2
	- Fixed Initialization: 2L512W_init

	Tokenizer
	- Name: NeelNanda/gpt-neox-tokenizer-digits

	Miscellaneous
	- Layer-wise Learning Rate Decay: 0.99
	- Log Interval: 50
	- Control Parameter: 1.0
	- Shortformer Positional Embedding: Disabled
	- Attention Only: False
	- Use Accelerated Computation: False
	- Layer Normalization Epsilon: 1e-05

	Model Limitations & Ethical Considerations
	- This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets.
	- As with any AI model, results may vary depending on the complexity and specificity of the task.
	- Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making.

	Notes for Users
	- The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset.
	- Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs.

	---

	This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.

	---
	datasets:
	- NeelNanda/c4-code-20k
	tags:
	- mechanistic_interpretability
	---
	### GELU_2L512W_C4_Code Model Card

	Model Overview
	- Model Name: GELU_2L512W_C4_Code
	- Version: 201
	- Primary Application: Code-related tasks
	- Model Architecture: Transformer-based
	- Activation Function: GELU (Gaussian Error Linear Unit)
	- Normalization: Layer Normalization (LN)

	Model Specifications
	- Number of Layers: 2
	- Model Dimension (d_model): 512
	- MLP Dimension (d_mlp): 2048
	- Head Dimension (d_head): 64
	- Number of Heads (n_heads): 8
	- Context Size (n_ctx): 1024
	- Vocabulary Size (d_vocab): 48,262
	- Number of Parameters: 6,291,456

	Training Configurations
	- Dataset: c4_code
	- Batch Size per Device: 32
	- Total Batch Size: 256
	- Batches per Step: 1
	- Max Steps: 83,923
	- Warmup Steps: 1,144
	- Learning Rate Schedule: Cosine Warmup
	- Learning Rate (Hidden Layers): 0.002
	- Learning Rate (Vector): 0.001
	- Optimizer Betas: [0.9, 0.99]
	- Weight Decay: 0.05
	- Gradient Norm Clipping: 1.0
	- Max Tokens: 22,000,000,000
	- Warmup Tokens: 300,000,000
	- Truncate Tokens: 1,000,000,000,000

	Technical Specifications
	- Number of Devices: 8
	- Seed: 259123
	- Use of bfloat16 for MatMul: True
	- Debug Options: Disabled
	- Save Checkpoints: Enabled
	- Tokens per Step: 262,144
	- Initializer Scales:
	- Global: 1.0
	- Hidden: 0.02
	- Embed: 0.1
	- Unembed: 0.02
	- Neuron Scale: 1.0
	- Neuron Temperature: 1.0
	- Weight Initialization Scheme: GPT-2
	- Fixed Initialization: 2L512W_init

	Tokenizer
	- Name: NeelNanda/gpt-neox-tokenizer-digits

	Miscellaneous
	- Layer-wise Learning Rate Decay: 0.99
	- Log Interval: 50
	- Control Parameter: 1.0
	- Shortformer Positional Embedding: Disabled
	- Attention Only: False
	- Use Accelerated Computation: False
	- Layer Normalization Epsilon: 1e-05

	Model Limitations & Ethical Considerations
	- This model, being specifically trained on code datasets, is optimized for code-related tasks and might not perform optimally on non-code datasets.
	- As with any AI model, results may vary depending on the complexity and specificity of the task.
	- Ethical considerations should be taken into account when deploying this model, especially in contexts where automation could significantly impact human labor or decision-making.

	Notes for Users
	- The model's performance can be influenced by hyperparameter tuning and the specific nature of the dataset.
	- Users are encouraged to familiarize themselves with the model's specifications and training configurations to optimize its use for their specific needs.

	---

	This model card is intended to provide a detailed overview of the GELU_2L512W_C4_Code model. Users should refer to additional documentation and resources for more comprehensive guidelines and best practices on deploying and utilizing this model.