CarperAI
/

FIM-NeoX-1.3B

@@ -8,6 +8,7 @@ tags:
 - pytorch
 - causal-lm
 - code-generation
 license: apache-2.0
@@ -29,55 +30,28 @@ This is a preliminary release of an experimental artifact and should be treated
 | Hyperparameter       | Value                                                                                                                                  |
 |----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
 | \\(n_{parameters}\\) | 1,331,810,304                                                                                                                           |
 | \\(n_{layers}\\)     | 24                                                                                                                                     |
-| \\(d_{model}\\)      | 2,048                                                                                                                                   |
-| \\(d_{ff}\\)         | 8,192                                                                                                                                   |
 | \\(n_{heads}\\)      | 16                                                                                                                                     |
 | \\(d_{head}\\)       | 128                                                                                                                                    |
-| \\(n_{ctx}\\)        | 2,048                                                                                                                                   |
-| \\(n_{vocab}\\)      | 50256                                                                                                                      |
-| Positional Encoding  | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)                                                                   |
-The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
-dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is used.
-The model is trained with the same tokenizer as GPT-NeoX-20b (link here), for a vocabulary of 50254 tokens.
 ## Training Data
-The model was trained on the Pile, an 800Gb dataset composed of varied web corpora. The datasheet and paper for the Pile can be found [here] and [here] respectively
 ## Training Details
@@ -88,8 +62,7 @@ Following Bavarian et al. 2022, we train the model to additionally perform infil
 Middle segments “to infill” were selected uniformly at random from contexts at the character level, and these contexts were then reformatted as
-<SUF> {last 1/3rd of the context} <PRE> {first 1/3rd of the context} <MID> {middle 1/3rd of the context} <EOD>
@@ -118,11 +91,11 @@ model = AutoModelForCausalLM.from_pretrained("CarperAI/FIM-1.3b")
 Suppose we have some text that we would like to perform infilling on at a certain “cursor location”.
-This would have the form {some prelude text here} <INFILLING LOCATION> {some text following cursor}.
 The way to perform infilling generation would be via placing the input text into this format:
-<SUF> {some text following cursor} <PRE> {some prelude text here} <MID> ... language model output is generated after <MID> token!
 ## Intended Uses and Limitations
@@ -156,3 +129,5 @@ We also perform preliminary investigation on code generation and infilling capab

 - pytorch
 - causal-lm
 - code-generation
+- The Pile
 license: apache-2.0
 | Hyperparameter       | Value                                                                                                                                  |
 |----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
 | \\(n_{parameters}\\) | 1,331,810,304                                                                                                                           |
 | \\(n_{layers}\\)     | 24                                                                                                                                     |
+| \\(d_{model}\\)      | 2048                                                                                                                                   |
+| \\(d_{ff}\\)         | 8192                                                                                                                                   |
 | \\(n_{heads}\\)      | 16                                                                                                                                     |
 | \\(d_{head}\\)       | 128                                                                                                                                    |
+| \\(n_{ctx}\\)        | 2048                                                                                                                                   |
+| \\(n_{vocab}\\)      | 50254                                                                                                                     |
+| Positional Encoding  | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)
+The model consists of 24 transformer layers with a hidden dimension of 2048, and a feedforward intermediate dimension of 8192. The hidden dimension is split into 16 heads for self-attention, each with a dimension of 128. Rotary Position Embedding (RoPE) is used.
+The model is trained with the same tokenizer as [GPT-NeoX-20b](https://arxiv.org/abs/2204.06745), for a vocabulary size of 50254 tokens.
 ## Training Data
+The model was trained on the Pile, an 800Gb dataset composed of varied web corpora. The datasheet and paper for the Pile can be found [here](https://arxiv.org/abs/2201.07311) and [here](https://arxiv.org/abs/2101.00027) respectively.
 ## Training Details
 Middle segments “to infill” were selected uniformly at random from contexts at the character level, and these contexts were then reformatted as
+\<SUF\> {last 1/3rd of the context} \<PRE\> {first 1/3rd of the context} \<MID\> {middle 1/3rd of the context} \<EOD\>
 Suppose we have some text that we would like to perform infilling on at a certain “cursor location”.
+This would have the form {some prelude text here} \<INFILLING LOCATION\> {some text following cursor}.
 The way to perform infilling generation would be via placing the input text into this format:
+\<SUF\> {some text following cursor} \<PRE\> {some prelude text here} \<MID\> ... language model output is generated after \<MID\> token!
 ## Intended Uses and Limitations