xinhe commited on
Commit
8cf2bdf
·
verified ·
1 Parent(s): 6a25a97

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -8
README.md CHANGED
@@ -6,19 +6,22 @@ pipeline_tag: text-generation
6
 
7
  ## Model Details
8
 
9
- This model is a mixed int4 model with group_size 128 and symmetric quantization of [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) generated by [intel/auto-round](https://github.com/intel/auto-round). Please follow the license of the original model.
10
-
11
- **The model is prone to generating unexpected outputs in some scenarios. Please use it with caution while we investigate the root cause.**
12
 
13
  ## How To Use
14
 
 
 
 
 
15
  ### INT4 Inference
16
 
 
17
  start a vllm server:
18
  ```bash
19
  vllm serve Intel/Step-3.5-Flash-int4-mixed-AutoRound \
20
  --served-model-name step3p5-flash-int4-mixed \
21
- --tensor-parallel-size 1 \
22
  --enable-expert-parallel \
23
  --disable-cascade-attn \
24
  --reasoning-parser step3p5 \
@@ -26,24 +29,25 @@ vllm serve Intel/Step-3.5-Flash-int4-mixed-AutoRound \
26
  --tool-call-parser step3p5 \
27
  --hf-overrides '{"num_nextn_predict_layers": 1}' \
28
  --speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
29
- --trust-remote-code
 
30
  ```
31
  ```
32
  curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' {
33
  "model": "step3p5-flash-int4-mixed",
34
  "messages": [
35
  {"role": "system", "content": "You are a helpful assistant."},
36
- {"role": "user", "content": "Summarize AutoRound in one sentence."}
37
  ],
38
  "temperature": 1,
39
- "max_tokens": 512
40
  } '
41
  ```
42
 
43
  ## Generate the Model
44
 
45
  ```bash
46
- auto_round stepfun-ai/Step-3.5-Flash/ --ignore_layers share_expert --enable_torch_compile
47
  ```
48
 
49
  ## Ethical Considerations and Limitations
 
6
 
7
  ## Model Details
8
 
9
+ This model is a mixed int4 model with group_size 128 and symmetric quantization of [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) generated by [intel/auto-round](https://github.com/intel/auto-round) with RTN mode. Please follow the license of the original model.
 
 
10
 
11
  ## How To Use
12
 
13
+ ```bash
14
+ pip install -U vllm==0.18.0 --torch-backend=auto
15
+ ```
16
+
17
  ### INT4 Inference
18
 
19
+
20
  start a vllm server:
21
  ```bash
22
  vllm serve Intel/Step-3.5-Flash-int4-mixed-AutoRound \
23
  --served-model-name step3p5-flash-int4-mixed \
24
+ --tensor-parallel-size 2 \
25
  --enable-expert-parallel \
26
  --disable-cascade-attn \
27
  --reasoning-parser step3p5 \
 
29
  --tool-call-parser step3p5 \
30
  --hf-overrides '{"num_nextn_predict_layers": 1}' \
31
  --speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
32
+ --trust-remote-code \
33
+ --max-model-len 4096
34
  ```
35
  ```
36
  curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' {
37
  "model": "step3p5-flash-int4-mixed",
38
  "messages": [
39
  {"role": "system", "content": "You are a helpful assistant."},
40
+ {"role": "user", "content": "Write code to fine-tune an LLM."}
41
  ],
42
  "temperature": 1,
43
+ "max_tokens": 2048
44
  } '
45
  ```
46
 
47
  ## Generate the Model
48
 
49
  ```bash
50
+ auto-round /workspace/models/stepfun-ai/Step-3.5-Flash/ --iters 0 --disable_opt_rtn --scheme W4A16 --ignore_layers "layers.0,layers.1,layers.2,layers.3,layers.4,layers.43,layers.44,layers.45,layers.46" --output_dir /workspace/models/
51
  ```
52
 
53
  ## Ethical Considerations and Limitations