Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -16,42 +16,3 @@ Training a multilingual 176 billion parameters model in the open
|
|
| 16 |
The training of BigScience’s main model started on **March 11, 2022 11:42am PST** and will last 3-4 months on the 416 A100 GPUs of the Jean Zay public supercomputer
|
| 17 |
|
| 18 |
You can follow the training at [https://twitter.com/BigScienceLLM](https://twitter.com/BigScienceLLM)
|
| 19 |
-
|
| 20 |
-
## Summary of the model, dataset, hardware, training and environmental considerations:
|
| 21 |
-
|
| 22 |
-
### **The model**
|
| 23 |
-
|
| 24 |
-
- 176B parameters decoder-only architecture (GPT-like)
|
| 25 |
-
- 70 layers - 112 attention heads per layers - hidden dimensionality of 14336 - 2048 tokens sequence length
|
| 26 |
-
- ALiBi positional embeddings - GeLU activation function
|
| 27 |
-
- **More information**:
|
| 28 |
-
- [Blog post summarizing how the architecture, size, shape, and pre-training duration where selected](https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours)
|
| 29 |
-
- [More details on the architecture/optimizer](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml)
|
| 30 |
-
|
| 31 |
-
## **The dataset**
|
| 32 |
-
|
| 33 |
-
- Multilingual: 46 languages: Full list is [here](https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
|
| 34 |
-
- 341.6 billion tokens (1.5 TB of text data)
|
| 35 |
-
- Tokenizer vocabulary: 250 680 tokens
|
| 36 |
-
- More information:
|
| 37 |
-
- [Blog post detailing the design choices during the dataset creation](https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
|
| 38 |
-
|
| 39 |
-
## **The engineering side**
|
| 40 |
-
|
| 41 |
-
- number of GPU used for the training: 384 A100 GPU with 80 Gb of memory each
|
| 42 |
-
- one copy of the model takes 48 GPUs (using 60 GB of memory on each GPU)
|
| 43 |
-
- checkpoint size: only the bf16 weights are 329GB, the full checkpoint with optimizer states is 2.3TB
|
| 44 |
-
- training throughput: about 150 TFLOPs
|
| 45 |
-
- estimated training time: 3-4 months depending on throughput and unexpected events
|
| 46 |
-
- **More information**:
|
| 47 |
-
- [Blog post on the hardware/engineering side](https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model)
|
| 48 |
-
- [Details on the distributed setup used for the training](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml)
|
| 49 |
-
- [Tensorboard updated during the training](https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard#scalars&tagFilter=loss)
|
| 50 |
-
- [Details on the obstacles overcome during the preparation on the engineering side (instabilities, optimization of training throughput, so many technical tricks and questions)](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md)
|
| 51 |
-
|
| 52 |
-
## **Environmental considerations**
|
| 53 |
-
|
| 54 |
-
- [Jean Zay](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html), the supercomputer we are using for model training, is mostly powered by nuclear energy, which is a low carbon energy source.
|
| 55 |
-
- Significant efforts were made to make sure that the computing infrastructure is as efficient as possible — the heat generated by the hardware even gets used for heating buildings on campus!
|
| 56 |
-
- **More information**:
|
| 57 |
-
- We are currently working on making a precise estimate of the carbon emitted during all of the steps of model training, including intermediate experiments as well as inference. More soon!
|
|
|
|
| 16 |
The training of BigScience’s main model started on **March 11, 2022 11:42am PST** and will last 3-4 months on the 416 A100 GPUs of the Jean Zay public supercomputer
|
| 17 |
|
| 18 |
You can follow the training at [https://twitter.com/BigScienceLLM](https://twitter.com/BigScienceLLM)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|