add serving instructions for SGLang and TRT-LLM
Browse files
README.md
CHANGED
|
@@ -88,7 +88,7 @@ The integration of foundation and fine-tuned models into AI systems requires add
|
|
| 88 |
|
| 89 |
|
| 90 |
## Inference:
|
| 91 |
-
**Engine:** vLLM <br>
|
| 92 |
**Test Hardware:** B200 <br>
|
| 93 |
|
| 94 |
## Post Training Quantization
|
|
@@ -103,6 +103,14 @@ To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you
|
|
| 103 |
python3 -m vllm.entrypoints.openai.api_server --model nvidia/Kimi-K2-Thinking-NVFP4 --trust-remote-code --tensor-parallel-size 4
|
| 104 |
```
|
| 105 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
## Model Limitations:
|
| 108 |
The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
|
|
|
|
| 88 |
|
| 89 |
|
| 90 |
## Inference:
|
| 91 |
+
**Engine:** vLLM/SGLang/TensorRT-LLM <br>
|
| 92 |
**Test Hardware:** B200 <br>
|
| 93 |
|
| 94 |
## Post Training Quantization
|
|
|
|
| 103 |
python3 -m vllm.entrypoints.openai.api_server --model nvidia/Kimi-K2-Thinking-NVFP4 --trust-remote-code --tensor-parallel-size 4
|
| 104 |
```
|
| 105 |
|
| 106 |
+
To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), you can start the docker `lmsysorg/sglang:latest` and run the sample command below:
|
| 107 |
+
|
| 108 |
+
```sh
|
| 109 |
+
python3 -m sglang.launch_server --model-path /home/scratch.omniml_data_2/HF_model_hub/Kimi-K2-Thinking-NVFP4 --tp 4 --quantization modelopt_fp4 --trust-remote-code
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
To serve this checkpoint with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), please follow instructions in [deployment-guide-for-kimi-k2-thinking-on-trtllm](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/deployment-guide/deployment-guide-for-kimi-k2-thinking-on-trtllm.md)
|
| 113 |
+
|
| 114 |
|
| 115 |
## Model Limitations:
|
| 116 |
The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
|