Intel
/

Step-3.5-Flash-int4-mixed-AutoRound

Text Generation

4-bit precision

Model card Files Files and versions

xinhe commited on Apr 14

Commit

8cf2bdf

·

verified ·

1 Parent(s): 6a25a97

Update README.md

Files changed (1) hide show

README.md +12 -8

README.md CHANGED Viewed

@@ -6,19 +6,22 @@ pipeline_tag: text-generation
 ## Model Details
-This model is a mixed int4 model with group_size 128 and symmetric quantization of [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) generated by [intel/auto-round](https://github.com/intel/auto-round). Please follow the license of the original model.
-**The model is prone to generating unexpected outputs in some scenarios. Please use it with caution while we investigate the root cause.**
 ## How To Use
 ### INT4 Inference
 start a vllm server:
 ```bash
 vllm serve Intel/Step-3.5-Flash-int4-mixed-AutoRound \
   --served-model-name step3p5-flash-int4-mixed \
-  --tensor-parallel-size 1 \
   --enable-expert-parallel \
   --disable-cascade-attn \
   --reasoning-parser step3p5 \
@@ -26,24 +29,25 @@ vllm serve Intel/Step-3.5-Flash-int4-mixed-AutoRound \
   --tool-call-parser step3p5 \
   --hf-overrides '{"num_nextn_predict_layers": 1}' \
   --speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
-  --trust-remote-code
 ```
 ```
 curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' {
     "model": "step3p5-flash-int4-mixed",
     "messages": [
       {"role": "system", "content": "You are a helpful assistant."},
-      {"role": "user", "content": "Summarize AutoRound in one sentence."}
     ],
     "temperature": 1,
-    "max_tokens": 512
   } '
 ```
 ## Generate the Model
 ```bash
-auto_round stepfun-ai/Step-3.5-Flash/ --ignore_layers share_expert  --enable_torch_compile
 ```
 ## Ethical Considerations and Limitations

 ## Model Details
+This model is a mixed int4 model with group_size 128 and symmetric quantization of [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) generated by [intel/auto-round](https://github.com/intel/auto-round) with RTN mode. Please follow the license of the original model.
 ## How To Use
+```bash
+pip install -U vllm==0.18.0   --torch-backend=auto
+```
 ### INT4 Inference
 start a vllm server:
 ```bash
 vllm serve Intel/Step-3.5-Flash-int4-mixed-AutoRound \
   --served-model-name step3p5-flash-int4-mixed \
+  --tensor-parallel-size 2 \
   --enable-expert-parallel \
   --disable-cascade-attn \
   --reasoning-parser step3p5 \
   --tool-call-parser step3p5 \
   --hf-overrides '{"num_nextn_predict_layers": 1}' \
   --speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
+  --trust-remote-code \
+  --max-model-len 4096
 ```
 ```
 curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' {
     "model": "step3p5-flash-int4-mixed",
     "messages": [
       {"role": "system", "content": "You are a helpful assistant."},
+      {"role": "user", "content": "Write code to fine-tune an LLM."}
     ],
     "temperature": 1,
+    "max_tokens": 2048
   } '
 ```
 ## Generate the Model
 ```bash
+auto-round /workspace/models/stepfun-ai/Step-3.5-Flash/  --iters 0 --disable_opt_rtn --scheme W4A16 --ignore_layers "layers.0,layers.1,layers.2,layers.3,layers.4,layers.43,layers.44,layers.45,layers.46" --output_dir /workspace/models/
 ```
 ## Ethical Considerations and Limitations