Update README.md
Browse files
README.md
CHANGED
|
@@ -2,10 +2,11 @@
|
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
| 4 |
|
| 5 |
-
# Tele-FLM
|
| 6 |
Tele-FLM-1T (aka FLM-2-1T) is a 1T open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgement capabilities.
|
| 7 |
-
Built upon the decoder-only transformer architecture, it has been trained on approximately
|
| 8 |
-
Tele-FLM series
|
|
|
|
| 9 |
In addition to sharing the model weights, we provide the core designs, engineering practices, and training details, anticipating their benefits for both academic and industrial communities.
|
| 10 |
|
| 11 |
## Model Details
|
|
@@ -38,7 +39,7 @@ Based on growth technology, the Tele-FLM-1T model training is divided into three
|
|
| 38 |
- Input and output multiplier
|
| 39 |
|
| 40 |
Consequently, Tele-FLM-1T is largely compatible with Llama architecturally.
|
| 41 |
-
To maximize convenience for the community, we made minimal adjustments to Llama's code to adapt it to Tele-FLM and released it as open source.
|
| 42 |
|
| 43 |
|
| 44 |
| Models | layer<br>number | attention<br>heads | hidden<br>size | ffn hidden<br>size | vocab<br>size | context<br>length | params<br>count |
|
|
@@ -56,8 +57,8 @@ All nodes are interconnected via InfiniBand (IB). The training process lasted ar
|
|
| 56 |
|
| 57 |
### Software
|
| 58 |
|
| 59 |
-
Tele-FLM utilizes 3D parallel training, combining the prevailing methodologies: data parallelism, tensor parallelism, and pipeline parallelism.
|
| 60 |
-
The parallel training setup for Tele-FLM is configured as follows: tensor parallel=32, pipeline parallel=28, and data parallel=1.
|
| 61 |
|
| 62 |
### Relate Work
|
| 63 |
[Tele-FLM (52B)](https://huggingface.co/CofeAI/Tele-FLM)
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
| 4 |
|
| 5 |
+
# Tele-FLM-1T
|
| 6 |
Tele-FLM-1T (aka FLM-2-1T) is a 1T open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgement capabilities.
|
| 7 |
+
Built upon the decoder-only transformer architecture, it has been trained on approximately 2.3T tokens.
|
| 8 |
+
Tele-FLM-1T, currently the largest size among Tele-FLM series, is build upon Tele-FLM (52B) with superior performances at its scale, is capable of dealing with even harder tasks with better performances in all likelihood.
|
| 9 |
+
For now, it's still under evaluation due to limited computing resouces.
|
| 10 |
In addition to sharing the model weights, we provide the core designs, engineering practices, and training details, anticipating their benefits for both academic and industrial communities.
|
| 11 |
|
| 12 |
## Model Details
|
|
|
|
| 39 |
- Input and output multiplier
|
| 40 |
|
| 41 |
Consequently, Tele-FLM-1T is largely compatible with Llama architecturally.
|
| 42 |
+
To maximize convenience for the community, we made minimal adjustments to Llama's code to adapt it to Tele-FLM-1T and released it as open source.
|
| 43 |
|
| 44 |
|
| 45 |
| Models | layer<br>number | attention<br>heads | hidden<br>size | ffn hidden<br>size | vocab<br>size | context<br>length | params<br>count |
|
|
|
|
| 57 |
|
| 58 |
### Software
|
| 59 |
|
| 60 |
+
Tele-FLM-1T utilizes 3D parallel training, combining the prevailing methodologies: data parallelism, tensor parallelism, and pipeline parallelism.
|
| 61 |
+
The parallel training setup for Tele-FLM-1T is configured as follows: tensor parallel=32, pipeline parallel=28, and data parallel=1.
|
| 62 |
|
| 63 |
### Relate Work
|
| 64 |
[Tele-FLM (52B)](https://huggingface.co/CofeAI/Tele-FLM)
|