| The following accuracy is using lm-eval and HF: |
|
|
| ``` |
| lm_eval \ |
| --model hf \ |
| --model_args pretrained="nm-testing/Llama-3.1-8B-Instruct-QKV-Cache-FP8",dtype=auto,device_map="auto",max_length=100000 \ |
| --tasks "niah_single_1" \ |
| --write_out \ |
| --batch_size 1 \ |
| --output_path "niah_single_1.json" \ |
| --show_config |
| ``` |
|
|
| Replace ```niah_single_1``` with ```niah_single_2```,```niah_single_2```,```niah_multikey_1```, ```niah_multikey_2```, ```niah_multikey_3``` |
|
|
|
|
|
|
| ### Accuracy with HF |
| <table> |
| <thead> |
| <tr> |
| <th>Category</th> |
| <th>Task</th> |
| <th>meta-llama/Llama-3.1-8B-Instruct</th> |
| <th>nm-testing/Llama-3.1-8B-Instruct-QKV-Cache-FP8</th> |
| <th>Recovery (%)</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td rowspan="6"><b>NIAH</b></td> |
| <td>niah_single_1 (100K)</td> |
| <td>OOM</td> |
| <td>0.0</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>niah_single_2 (100K)</td> |
| <td>OOM</td> |
| <td>0.0</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>niah_single_3 (100K)</td> |
| <td>OOM</td> |
| <td>0.0</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>niah_multikey_1 (100K)</td> |
| <td>OOM</td> |
| <td>0.0</td> |
| <td>0.</td> |
| </tr> |
| <tr> |
| <td>niah_multikey_2 (100K)</td> |
| <td>OOM</td> |
| <td>0.0</td> |
| <td>0.0</td> |
| </tr> |
| <tr> |
| <td>niah_multikey_3 (100K)</td> |
| <td>OOM</td> |
| <td>0.0</td> |
| <td>0.0</td> |
| </tr> |
| </tbody> |
| </table> |
| |
|
|
|
|
| The following accuracy is using lm-eval and vLLM: |
|
|
| ``` |
| lm_eval \ |
| --model vllm \ |
| --model_args pretrained="nm-testing/Llama-3.1-8B-Instruct-QKV-Cache-FP8",dtype=auto,add_bos_token=True,max_model_len=131072,tensor_parallel_size=1,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \ |
| --tasks "niah_single_1" \ |
| --write_out \ |
| --batch_size 1 \ |
| --output_path "niah_single_1.json" \ |
| --show_config |
| ``` |
|
|
| Replace ```niah_single_1``` with ```niah_single_2```,```niah_single_2```,```niah_multikey_1```, ```niah_multikey_2```, ```niah_multikey_3``` |
|
|
|
|
| ### Accuracy with vLLM |
| <table> |
| <thead> |
| <tr> |
| <th>Category</th> |
| <th>Task</th> |
| <th>meta-llama/Llama-3.1-8B-Instruct</th> |
| <th>nm-testing/Llama-3.1-8B-Instruct-QKV-Cache-FP8</th> |
| <th>Recovery (%)</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td rowspan="6"><b>LongBench V1</b></td> |
| <td>Gov Report</td> |
| <td>32.16</td> |
| <td>20.82</td> |
| <td>64.73</td> |
| </tr> |
| <tr> |
| <td>2WikimQA</td> |
| <td>17.35</td> |
| <td>18.53</td> |
| <td>106.8</td> |
| </tr> |
| <tr> |
| <td>Qasper</td> |
| <td>18.22</td> |
| <td>26.6</td> |
| <td>145</td> |
| </tr> |
| <tr> |
| <td>MultifieldQA</td> |
| <td>31.11</td> |
| <td>21.95</td> |
| <td>70.55</td> |
| </tr> |
| <tr> |
| <td>HotpotQA</td> |
| <td>16.23</td> |
| <td>4.86</td> |
| <td>29.9</td> |
| </tr> |
| <tr> |
| <td>Musique</td> |
| <td>9.11</td> |
| <td>0.94</td> |
| <td>10.31</td> |
| </tr> |
| <tr> |
| <td rowspan="6"><b>NIAH</b></td> |
| <td>niah_single_1 (4K)</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| </tr> |
| <tr> |
| <td>niah_single_2 (4K)</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| </tr> |
| <tr> |
| <td>niah_single_3 (4K)</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| </tr> |
| <tr> |
| <td>niah_multikey_1 (4K)</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| </tr> |
| <tr> |
| <td>niah_multikey_2 (4K)</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| </tr> |
| <tr> |
| <td>niah_multikey_3 (4K)</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| <td>100.0</td> |
| </tr> |
| </tbody> |
| </table> |
| |