Update README.md
Browse files
README.md
CHANGED
|
@@ -6,19 +6,22 @@ pipeline_tag: text-generation
|
|
| 6 |
|
| 7 |
## Model Details
|
| 8 |
|
| 9 |
-
This model is a mixed int4 model with group_size 128 and symmetric quantization of [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) generated by [intel/auto-round](https://github.com/intel/auto-round). Please follow the license of the original model.
|
| 10 |
-
|
| 11 |
-
**The model is prone to generating unexpected outputs in some scenarios. Please use it with caution while we investigate the root cause.**
|
| 12 |
|
| 13 |
## How To Use
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
### INT4 Inference
|
| 16 |
|
|
|
|
| 17 |
start a vllm server:
|
| 18 |
```bash
|
| 19 |
vllm serve Intel/Step-3.5-Flash-int4-mixed-AutoRound \
|
| 20 |
--served-model-name step3p5-flash-int4-mixed \
|
| 21 |
-
--tensor-parallel-size
|
| 22 |
--enable-expert-parallel \
|
| 23 |
--disable-cascade-attn \
|
| 24 |
--reasoning-parser step3p5 \
|
|
@@ -26,24 +29,25 @@ vllm serve Intel/Step-3.5-Flash-int4-mixed-AutoRound \
|
|
| 26 |
--tool-call-parser step3p5 \
|
| 27 |
--hf-overrides '{"num_nextn_predict_layers": 1}' \
|
| 28 |
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
|
| 29 |
-
--trust-remote-code
|
|
|
|
| 30 |
```
|
| 31 |
```
|
| 32 |
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' {
|
| 33 |
"model": "step3p5-flash-int4-mixed",
|
| 34 |
"messages": [
|
| 35 |
{"role": "system", "content": "You are a helpful assistant."},
|
| 36 |
-
{"role": "user", "content": "
|
| 37 |
],
|
| 38 |
"temperature": 1,
|
| 39 |
-
"max_tokens":
|
| 40 |
} '
|
| 41 |
```
|
| 42 |
|
| 43 |
## Generate the Model
|
| 44 |
|
| 45 |
```bash
|
| 46 |
-
|
| 47 |
```
|
| 48 |
|
| 49 |
## Ethical Considerations and Limitations
|
|
|
|
| 6 |
|
| 7 |
## Model Details
|
| 8 |
|
| 9 |
+
This model is a mixed int4 model with group_size 128 and symmetric quantization of [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) generated by [intel/auto-round](https://github.com/intel/auto-round) with RTN mode. Please follow the license of the original model.
|
|
|
|
|
|
|
| 10 |
|
| 11 |
## How To Use
|
| 12 |
|
| 13 |
+
```bash
|
| 14 |
+
pip install -U vllm==0.18.0 --torch-backend=auto
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
### INT4 Inference
|
| 18 |
|
| 19 |
+
|
| 20 |
start a vllm server:
|
| 21 |
```bash
|
| 22 |
vllm serve Intel/Step-3.5-Flash-int4-mixed-AutoRound \
|
| 23 |
--served-model-name step3p5-flash-int4-mixed \
|
| 24 |
+
--tensor-parallel-size 2 \
|
| 25 |
--enable-expert-parallel \
|
| 26 |
--disable-cascade-attn \
|
| 27 |
--reasoning-parser step3p5 \
|
|
|
|
| 29 |
--tool-call-parser step3p5 \
|
| 30 |
--hf-overrides '{"num_nextn_predict_layers": 1}' \
|
| 31 |
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
|
| 32 |
+
--trust-remote-code \
|
| 33 |
+
--max-model-len 4096
|
| 34 |
```
|
| 35 |
```
|
| 36 |
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' {
|
| 37 |
"model": "step3p5-flash-int4-mixed",
|
| 38 |
"messages": [
|
| 39 |
{"role": "system", "content": "You are a helpful assistant."},
|
| 40 |
+
{"role": "user", "content": "Write code to fine-tune an LLM."}
|
| 41 |
],
|
| 42 |
"temperature": 1,
|
| 43 |
+
"max_tokens": 2048
|
| 44 |
} '
|
| 45 |
```
|
| 46 |
|
| 47 |
## Generate the Model
|
| 48 |
|
| 49 |
```bash
|
| 50 |
+
auto-round /workspace/models/stepfun-ai/Step-3.5-Flash/ --iters 0 --disable_opt_rtn --scheme W4A16 --ignore_layers "layers.0,layers.1,layers.2,layers.3,layers.4,layers.43,layers.44,layers.45,layers.46" --output_dir /workspace/models/
|
| 51 |
```
|
| 52 |
|
| 53 |
## Ethical Considerations and Limitations
|