togethercomputer
/

GPT-JT-6B-v0

Text Generation

Model card Files Files and versions

juewang commited on Nov 23, 2022

Commit

ce5e160

·

1 Parent(s): 99acfd8

Update README.md

Files changed (1) hide show

README.md +27 -2

README.md CHANGED Viewed

@@ -5,8 +5,11 @@ datasets:
   - natural_instructions
   - the_pile
   - cot
 tags:
   - gpt
 widget:
   - text: "Where is Zurich? Ans:"
   - text: "What is the highest mountain? Answer:"
@@ -16,7 +19,6 @@ widget:
 We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference.
 The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training.
-We fine-tune GPT-J-6B on NI, P3, COT, the pile data.
 # Quick Start
@@ -26,4 +28,27 @@ from transformers import pipeline
 pipe = pipeline(model='togethercomputer/Together-gpt-J-6B-ProxAdam-50x')
 pipe("Where is Zurich? Ans:")
-```

   - natural_instructions
   - the_pile
   - cot
+  - Muennighoff/P3
 tags:
   - gpt
+pipeline_tag: text-generation
+inference: true
 widget:
   - text: "Where is Zurich? Ans:"
   - text: "What is the highest mountain? Answer:"
 We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference.
 The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training.
 # Quick Start
 pipe = pipeline(model='togethercomputer/Together-gpt-J-6B-ProxAdam-50x')
 pipe("Where is Zurich? Ans:")
+```
+# Training Data
+We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
+- [Natural-Instructions](https://github.com/allenai/natural-instructions)
+- [P3](https://huggingface.co/datasets/Muennighoff/P3)
+- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
+- [the pile](https://huggingface.co/datasets/the_pile)
+The pile is used to keep the general ability of GPT-J.
+Others are instruction-tuning datasets.
+# Hyperparameters
+We used AdamW with a learning rate of 1e-5 and global batch size of 64, and train for 5k steps.
+We used mix-precision training where the activation is in FP16 while the optimizer states are kept in FP32.
+We truncate the input sequence to 2048 tokens, and for input sequence that contains less than 2048 tokens, we concatenate multiple sequences into one long sequence to improve the data efficiency.
+# Infrastructure
+We used [the Together Research Computer](https://together.xyz/) to conduct training.
+Specifically, we used 4 data parallel workers, each containing 2 \* A100 80GB GPUs.
+Together Research Computer connects clusters at Stanford University, ETH Zurich, Open Science Grid, and University of Wisconsin-Madison.