Chat-UniVi commited on
Commit
486dfe8
Β·
verified Β·
1 Parent(s): afde76f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -34
README.md CHANGED
@@ -9,13 +9,6 @@ license: apache-2.0
9
  ## ⚑ Overview
10
  We introduce three types of zero-computation experts: the zero expert, copy expert, and constant expert, which correspond to discard, skip, and replace operations, respectively. Moreover, we leverage gating residuals, enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts.
11
 
12
- <div align=center>
13
- <img src="figures/fig2.png" width="800px">
14
- </div>
15
-
16
- <div align=center>
17
- <img src="figures/fig3.png" width="800px">
18
- </div>
19
 
20
  ### Download URL
21
  <div align=center>
@@ -31,42 +24,19 @@ We introduce three types of zero-computation experts: the zero expert, copy expe
31
  ### πŸ’‘ Low Computing Overhead
32
  For an MoE++ model, its computational complexity is always **less than** that of MoE models with the same number of parameters.
33
 
34
- <div align=center>
35
- <img src="figures/fig4.png" width="800px">
36
- </div>
37
 
38
  ### πŸ”₯ High Performance & High Throughput
39
  Extensive experimental results demonstrate that MoE++ achieves better performance while delivering 1.1~2.1x expert forward throughput compared to a vanilla MoE model of the same size, which lays a solid foundation for developing advanced and efficient MoE-related models.
40
 
41
- <div align=center>
42
- <img src="figures/fig7.png" width="800px">
43
- </div>
44
 
45
  ### πŸ€— Deployment Friendly
46
  Given that zero-computation experts have negligible parameters, we can deploy all zero-computation experts on each GPU, eliminating the significant communication overhead and expert load imbalance associated with FFN experts distributed across different GPUs.
47
 
48
 
49
- ## πŸš€ Main Results
50
- ### Comparisons between MoE++ and Vanilla MoE Models
51
-
52
- <div align=center>
53
- <img src="figures/fig9.png" width="800px">
54
- </div>
55
-
56
- ### Comparisons to LLMs of Equivalent Activated Parameters
57
-
58
- <div align=center>
59
- <img src="figures/fig10.png" width="800px">
60
- </div>
61
-
62
  ## 😍 Why is MoE++ better than MoE?
63
  ### Flexible Computation Allocation
64
  MoE++ allows simple tokens to utilize fewer FFN experts, freeing up more FFN experts to focus on challenging tokens. This results in both **Reduced Computation** and **Enhanced Performance**.
65
 
66
- <div align=center>
67
- <img src="figures/fig5.png" width="800px">
68
- </div>
69
-
70
  * Verbs tend to activate a large number of FFN experts. For example, the verb "touch" activates an average of 1.77 FFN experts across all layers, approaching the upper limit of 2. This likely occurs because verbs often convey rich semantic information and frequently interact with nouns to form more complex semantic structures.
71
  * Nouns typically activate a moderate number of FFN experts, with most nouns averaging between 1.5 and 1.7 FFN expert activations.
72
  * Simple tokens with little semantic tend to activate a small number of FFN experts. For example, word fragments, such as "pper" and "ather", usually activate fewer than 1.5 FFN experts.
@@ -76,10 +46,6 @@ These findings confirm that MoE++ allows simple tokens to utilize fewer FFN expe
76
  ### Stable Routing
77
  Gating residuals effectively establish connections between different MoE++ layers and reduce the variance of routing scores. Meanwhile, the gating residuals do not change the mean and range of values of the routing scores. Consequently, gating residuals contribute to the stable routing of heterogeneous expert architectures in MoE++.
78
 
79
- <div align=center>
80
- <img src="figures/fig6.png" width="800px">
81
- </div>
82
-
83
 
84
  ## πŸ€– API for Base Model Inference
85
  If you want to load the model from the model hub on Hugging Face or on local, you can use the following code snippets.
 
9
  ## ⚑ Overview
10
  We introduce three types of zero-computation experts: the zero expert, copy expert, and constant expert, which correspond to discard, skip, and replace operations, respectively. Moreover, we leverage gating residuals, enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts.
11
 
 
 
 
 
 
 
 
12
 
13
  ### Download URL
14
  <div align=center>
 
24
  ### πŸ’‘ Low Computing Overhead
25
  For an MoE++ model, its computational complexity is always **less than** that of MoE models with the same number of parameters.
26
 
 
 
 
27
 
28
  ### πŸ”₯ High Performance & High Throughput
29
  Extensive experimental results demonstrate that MoE++ achieves better performance while delivering 1.1~2.1x expert forward throughput compared to a vanilla MoE model of the same size, which lays a solid foundation for developing advanced and efficient MoE-related models.
30
 
 
 
 
31
 
32
  ### πŸ€— Deployment Friendly
33
  Given that zero-computation experts have negligible parameters, we can deploy all zero-computation experts on each GPU, eliminating the significant communication overhead and expert load imbalance associated with FFN experts distributed across different GPUs.
34
 
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ## 😍 Why is MoE++ better than MoE?
37
  ### Flexible Computation Allocation
38
  MoE++ allows simple tokens to utilize fewer FFN experts, freeing up more FFN experts to focus on challenging tokens. This results in both **Reduced Computation** and **Enhanced Performance**.
39
 
 
 
 
 
40
  * Verbs tend to activate a large number of FFN experts. For example, the verb "touch" activates an average of 1.77 FFN experts across all layers, approaching the upper limit of 2. This likely occurs because verbs often convey rich semantic information and frequently interact with nouns to form more complex semantic structures.
41
  * Nouns typically activate a moderate number of FFN experts, with most nouns averaging between 1.5 and 1.7 FFN expert activations.
42
  * Simple tokens with little semantic tend to activate a small number of FFN experts. For example, word fragments, such as "pper" and "ather", usually activate fewer than 1.5 FFN experts.
 
46
  ### Stable Routing
47
  Gating residuals effectively establish connections between different MoE++ layers and reduce the variance of routing scores. Meanwhile, the gating residuals do not change the mean and range of values of the routing scores. Consequently, gating residuals contribute to the stable routing of heterogeneous expert architectures in MoE++.
48
 
 
 
 
 
49
 
50
  ## πŸ€– API for Base Model Inference
51
  If you want to load the model from the model hub on Hugging Face or on local, you can use the following code snippets.