Update README.md
Browse files
README.md
CHANGED
|
@@ -5,8 +5,11 @@ datasets:
|
|
| 5 |
- natural_instructions
|
| 6 |
- the_pile
|
| 7 |
- cot
|
|
|
|
| 8 |
tags:
|
| 9 |
- gpt
|
|
|
|
|
|
|
| 10 |
widget:
|
| 11 |
- text: "Where is Zurich? Ans:"
|
| 12 |
- text: "What is the highest mountain? Answer:"
|
|
@@ -16,7 +19,6 @@ widget:
|
|
| 16 |
|
| 17 |
We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference.
|
| 18 |
The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training.
|
| 19 |
-
We fine-tune GPT-J-6B on NI, P3, COT, the pile data.
|
| 20 |
|
| 21 |
# Quick Start
|
| 22 |
|
|
@@ -26,4 +28,27 @@ from transformers import pipeline
|
|
| 26 |
pipe = pipeline(model='togethercomputer/Together-gpt-J-6B-ProxAdam-50x')
|
| 27 |
|
| 28 |
pipe("Where is Zurich? Ans:")
|
| 29 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
- natural_instructions
|
| 6 |
- the_pile
|
| 7 |
- cot
|
| 8 |
+
- Muennighoff/P3
|
| 9 |
tags:
|
| 10 |
- gpt
|
| 11 |
+
pipeline_tag: text-generation
|
| 12 |
+
inference: true
|
| 13 |
widget:
|
| 14 |
- text: "Where is Zurich? Ans:"
|
| 15 |
- text: "What is the highest mountain? Answer:"
|
|
|
|
| 19 |
|
| 20 |
We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference.
|
| 21 |
The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training.
|
|
|
|
| 22 |
|
| 23 |
# Quick Start
|
| 24 |
|
|
|
|
| 28 |
pipe = pipeline(model='togethercomputer/Together-gpt-J-6B-ProxAdam-50x')
|
| 29 |
|
| 30 |
pipe("Where is Zurich? Ans:")
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
# Training Data
|
| 34 |
+
|
| 35 |
+
We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
|
| 36 |
+
- [Natural-Instructions](https://github.com/allenai/natural-instructions)
|
| 37 |
+
- [P3](https://huggingface.co/datasets/Muennighoff/P3)
|
| 38 |
+
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
|
| 39 |
+
- [the pile](https://huggingface.co/datasets/the_pile)
|
| 40 |
+
|
| 41 |
+
The pile is used to keep the general ability of GPT-J.
|
| 42 |
+
Others are instruction-tuning datasets.
|
| 43 |
+
|
| 44 |
+
# Hyperparameters
|
| 45 |
+
|
| 46 |
+
We used AdamW with a learning rate of 1e-5 and global batch size of 64, and train for 5k steps.
|
| 47 |
+
We used mix-precision training where the activation is in FP16 while the optimizer states are kept in FP32.
|
| 48 |
+
We truncate the input sequence to 2048 tokens, and for input sequence that contains less than 2048 tokens, we concatenate multiple sequences into one long sequence to improve the data efficiency.
|
| 49 |
+
|
| 50 |
+
# Infrastructure
|
| 51 |
+
|
| 52 |
+
We used [the Together Research Computer](https://together.xyz/) to conduct training.
|
| 53 |
+
Specifically, we used 4 data parallel workers, each containing 2 \* A100 80GB GPUs.
|
| 54 |
+
Together Research Computer connects clusters at Stanford University, ETH Zurich, Open Science Grid, and University of Wisconsin-Madison.
|