Update README.md
Browse files
README.md
CHANGED
|
@@ -75,7 +75,7 @@ However, it is still fundamentally limited by its size for certain tasks. The mo
|
|
| 75 |
|
| 76 |
### Model Efficiency
|
| 77 |
|
| 78 |
-
The two figures below compare the latency and throughput performance of the Phi-4-mini-reasoning and Phi-4-mini-flash-reasoning models under the vLLM inference framework. All evaluations were performed on a single NVIDIA A100-80GB GPU with tensor parallelism disabled (TP = 1). The Phi-4-mini-flash-reasoning model, which incorporates a decoder-hybrid-decoder architecture with attention and state space model (SSM), exhibits significantly greater computational efficiency—achieving
|
| 79 |
|
| 80 |
<div align="left">
|
| 81 |
<img src="lat.png" width="300"/>
|
|
|
|
| 75 |
|
| 76 |
### Model Efficiency
|
| 77 |
|
| 78 |
+
The two figures below compare the latency and throughput performance of the Phi-4-mini-reasoning and Phi-4-mini-flash-reasoning models under the vLLM inference framework. All evaluations were performed on a single NVIDIA A100-80GB GPU with tensor parallelism disabled (TP = 1). The Phi-4-mini-flash-reasoning model, which incorporates a decoder-hybrid-decoder architecture with attention and state space model (SSM), exhibits significantly greater computational efficiency—achieving up-to a 10× improvement in throughput when processing user requests with 2K prompt length and 32K generation length. Furthermore, Phi-4-mini-flash-reasoning demonstrates near-linear growth in latency with respect to the number of tokens generated (up to 32k), in contrast to the quadratic growth observed in Phi-4-mini-reasoning. These findings indicate that Phi-4-mini-flash-reasoning is more scalable and better suited for long-sequence generation tasks.
|
| 79 |
|
| 80 |
<div align="left">
|
| 81 |
<img src="lat.png" width="300"/>
|