DataSnake commited on
Commit
b17b8fb
·
verified ·
1 Parent(s): fa2b71b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -5
README.md CHANGED
@@ -84,6 +84,15 @@ As shown in the following graph, the difference in speed shrinks as context leng
84
  ## Long-context Perplexity
85
  For this test, I split sample texts into \\(n\\)-token chunks and computed perplexity scores for each chunk in all four quantized models: the NVFP4 baseline, one with Four Over Six weight selection, one with the attention tensors in FP8, and the hybrid format. I then recorded the average perplexity score for each model at each chunk size. Sample texts for this step were UTF-8 text files taken from Project Gutenberg, listed below.
86
 
 
 
 
 
 
 
 
 
 
87
  |Tokens|NVFP4|Four Over Six|FP8 Attention|Hybrid|
88
  |-:|-:|-:|-:|-:|
89
  |4096|4.2980|4.1049|3.7679|3.6271|
@@ -98,11 +107,6 @@ For this test, I split sample texts into \\(n\\)-token chunks and computed perpl
98
  While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
99
  ![image/png](perplexity-plot.png)
100
 
101
- ### Sample texts used
102
- - [Pride and Prejudice](https://www.gutenberg.org/cache/epub/42671/pg42671.txt)
103
- - [Frankenstein](https://www.gutenberg.org/cache/epub/84/pg84.txt)
104
- - [Wuthering Heights](https://www.gutenberg.org/cache/epub/768/pg768.txt)
105
- - [Dracula](https://www.gutenberg.org/cache/epub/345/pg345.txt)
106
 
107
  ## Inference
108
  This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine). If using Aphrodite Engine or an older version of vLLM, you'll need to manually update compressed-tensors to 0.14.0 or later, and if you're using Aphrodite Engine 0.10.0 rather than installing it from the latest github commit, you may need to open the file `aphrodite/platforms/interface.py` in your library or venv (if you've followed the [official installation instructions](https://aphrodite.pygmalion.chat/installation/installation/), it will be under `~/venv/aphrodite/lib/python3.12/site-packages`) and comment out or delete lines 487-491.
 
84
  ## Long-context Perplexity
85
  For this test, I split sample texts into \\(n\\)-token chunks and computed perplexity scores for each chunk in all four quantized models: the NVFP4 baseline, one with Four Over Six weight selection, one with the attention tensors in FP8, and the hybrid format. I then recorded the average perplexity score for each model at each chunk size. Sample texts for this step were UTF-8 text files taken from Project Gutenberg, listed below.
86
 
87
+ <details>
88
+ <summary>Sample texts used</summary>
89
+
90
+ - [Pride and Prejudice](https://www.gutenberg.org/cache/epub/42671/pg42671.txt)
91
+ - [Frankenstein](https://www.gutenberg.org/cache/epub/84/pg84.txt)
92
+ - [Wuthering Heights](https://www.gutenberg.org/cache/epub/768/pg768.txt)
93
+ - [Dracula](https://www.gutenberg.org/cache/epub/345/pg345.txt)
94
+ </details>
95
+
96
  |Tokens|NVFP4|Four Over Six|FP8 Attention|Hybrid|
97
  |-:|-:|-:|-:|-:|
98
  |4096|4.2980|4.1049|3.7679|3.6271|
 
107
  While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
108
  ![image/png](perplexity-plot.png)
109
 
 
 
 
 
 
110
 
111
  ## Inference
112
  This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine). If using Aphrodite Engine or an older version of vLLM, you'll need to manually update compressed-tensors to 0.14.0 or later, and if you're using Aphrodite Engine 0.10.0 rather than installing it from the latest github commit, you may need to open the file `aphrodite/platforms/interface.py` in your library or venv (if you've followed the [official installation instructions](https://aphrodite.pygmalion.chat/installation/installation/), it will be under `~/venv/aphrodite/lib/python3.12/site-packages`) and comment out or delete lines 487-491.