| | --- |
| | license: cc-by-nc-4.0 |
| | base_model: |
| | - CohereForAI/c4ai-command-r-plus |
| | --- |
| | |
| |
|
| | # c4ai-command-r-plus-FP8-KV |
| | - ## Introduction |
| | This model was created by applying [Quark](https://quark.docs.amd.com/latest/index.html) with calibration samples from Pile dataset. |
| | - ## Quantization Stragegy |
| | - ***Quantized Layers***: All linear layers excluding "lm_head" |
| | - ***Weight***: FP8 symmetric per-tensor |
| | - ***Activation***: FP8 symmetric per-tensor |
| | - ***KV Cache***: FP8 symmetric per-tensor |
| | - ## Quick Start |
| | 1. [Download and install Quark](https://quark.docs.amd.com/latest/install.html) |
| | 2. Run the quantization script in the example folder using the following command line: |
| | ```sh |
| | export MODEL_DIR = [local model checkpoint folder] or CohereForAI/c4ai-command-r-plus |
| | # single GPU |
| | python3 quantize_quark.py \ |
| | --model_dir $MODEL_DIR \ |
| | --output_dir c4ai-command-r-plus-FP8-KV \ |
| | --quant_scheme w_fp8_a_fp8 \ |
| | --kv_cache_dtype fp8 \ |
| | --num_calib_data 128 \ |
| | --model_export quark_safetensors |
| | # If model size is too large for single GPU, please use multi GPU instead. |
| | python3 quantize_quark.py \ |
| | --model_dir $MODEL_DIR \ |
| | --output_dir c4ai-command-r-plus-FP8-KV \ |
| | --quant_scheme w_fp8_a_fp8 \ |
| | --kv_cache_dtype fp8 \ |
| | --num_calib_data 128 \ |
| | --model_export quark_safetensors \ |
| | --multi_gpu |
| | ``` |
| | ## Deployment |
| | Quark has its own export format and allows FP8 quantized models to be efficiently deployed using the vLLM backend(vLLM-compatible). |
| | ## Evaluation |
| | Quark currently uses perplexity(PPL) as the evaluation metric for accuracy loss before and after quantization.The specific PPL algorithm can be referenced in the quantize_quark.py. |
| | The quantization evaluation results are conducted in pseudo-quantization mode, which may slightly differ from the actual quantized inference accuracy. These results are provided for reference only. |
| | #### Evaluation scores |
| | <table> |
| | <tr> |
| | <td><strong>Benchmark</strong> |
| | </td> |
| | <td><strong>c4ai-command-r-plus</strong> |
| | </td> |
| | <td><strong>c4ai-command-r-plus-FP8-KV(this model)</strong> |
| | </td> |
| | </tr> |
| | <tr> |
| | <td>Perplexity-wikitext2 |
| | </td> |
| | <td>4.3829 |
| | </td> |
| | <td>4.3253 |
| | </td> |
| | </tr> |
| | </table> |
| | |
| | #### License |
| | Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved. |
| |
|
| | Licensed under the Apache License, Version 2.0 (the "License"); |
| | you may not use this file except in compliance with the License. |
| | You may obtain a copy of the License at |
| |
|
| | http://www.apache.org/licenses/LICENSE-2.0 |
| | |
| | Unless required by applicable law or agreed to in writing, software |
| | distributed under the License is distributed on an "AS IS" BASIS, |
| | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| | See the License for the specific language governing permissions and |
| | limitations under the License. |