weiweiz1 commited on
Commit
46e6786
·
verified ·
1 Parent(s): 6e021d3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -1
README.md CHANGED
@@ -3,4 +3,134 @@ base_model:
3
  - zai-org/GLM-4.7-Flash
4
  ---
5
 
6
- Only supports vllm now. Transformers will be supported soom
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  - zai-org/GLM-4.7-Flash
4
  ---
5
 
6
+ ## Model Details
7
+
8
+ This model is an int4 model with group_size 128 and symmetric quantization of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) generated by [intel/auto-round](https://github.com/intel/auto-round). Please refer to Section Generate the model for more details.
9
+ Please follow the license of the original model.
10
+
11
+ ## How To Use
12
+
13
+ ### INT4 Inference
14
+
15
+ #### Transformers (CPU/Intel GPU/CUDA)
16
+
17
+ **Please make sure you have installed the auto_round package from the correct branch:**
18
+ ```bash
19
+ pip install git+https://github.com/intel/auto-round.git@enable_glm4_moe_lite_quantization
20
+ ```
21
+
22
+ ```python
23
+
24
+ from transformers import AutoModelForCausalLM, AutoTokenizer
25
+ import torch
26
+ # default: Load the model on the available device(s)
27
+ model_name = "Intel/GLM-4.7-Flash-int4-AutoRound"
28
+ model = AutoModelForCausalLM.from_pretrained(
29
+ model_name, dtype="auto", device_map="auto"
30
+ )
31
+ messages = [{"role": "user", "content": "hello"}]
32
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
33
+ inputs = tokenizer.apply_chat_template(
34
+ messages,
35
+ tokenize=True,
36
+ add_generation_prompt=True,
37
+ return_dict=True,
38
+ return_tensors="pt",
39
+ )
40
+ inputs = inputs.to(model.device)
41
+ generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
42
+ output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:])
43
+ print(output_text)
44
+ """
45
+ 1. **Analyze the user's input:** The user said "hello". This is a standard greeting.
46
+
47
+ 2. **Determine the intent:** The user is initiating a conversation. They want to know if I'm active and ready to help.
48
+
49
+ 3. **Formulate the response:**
50
+ * Acknowledge the greeting.
51
+ * Offer assistance.
52
+ * Keep it friendly and helpful.
53
+
54
+ 4. **Drafting the response (internal monologue/trial):**
55
+ * *Option 1:* Hello. How can I help? (Simple, direct)
56
+ * *Option 2
57
+
58
+ """
59
+ ```
60
+
61
+ #### vLLM (CPU/Intel GPU/CUDA)
62
+
63
+ ```bash
64
+ VLLM_USE_PRECOMPILED=1 pip install git+https://github.com/vllm-project/vllm.git@main
65
+ pip install git+https://github.com/huggingface/transformers.git
66
+ ```
67
+
68
+ start a vllm server:
69
+ ```bash
70
+ vllm serve Intel/GLM-4.7-Flash-int4-AutoRound \
71
+ --host localhost \
72
+ --tool-call-parser glm47 \
73
+ --reasoning-parser glm45 \
74
+ --enable-auto-tool-choice \
75
+ --served-model-name glm-4.7-flash \
76
+ --tensor-parallel-size 4 \
77
+ --port 4321
78
+ ```
79
+
80
+ send request:
81
+ ```bash
82
+ curl --noproxy '*' http://127.0.0.1:4321/v1/chat/completions \
83
+ -H "Content-Type: application/json" \
84
+ -d '{
85
+ "model": "Intel/GLM-4.7-Flash-int4-AutoRound",
86
+ "messages": [
87
+ {"role": "user", "content": "hello"}
88
+ ],
89
+ "max_tokens": 256,
90
+ "temperature": 0.6
91
+ }'
92
+
93
+ """
94
+
95
+ """
96
+ ```
97
+
98
+ ### Generate the model
99
+ **Please make sure you have installed the auto_round package from the correct branch:**
100
+ ```bash
101
+ pip install git+https://github.com/intel/auto-round.git@enable_glm4_moe_lite_quantization
102
+ ```
103
+
104
+ ```bash
105
+ auto_round \
106
+ --model=zai-org/GLM-4.7-Flash \
107
+ --scheme "W4A16" \
108
+ --ignore_layers="shared_experts,layers.0.mlp" \
109
+ --format=auto_round \
110
+ --enable_torch_compile \
111
+ --output_dir=./tmp_autoround
112
+ ```
113
+
114
+ ## Ethical Considerations and Limitations
115
+
116
+ The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
117
+
118
+ Therefore, before deploying any applications of the model, developers should perform safety testing.
119
+
120
+ ## Caveats and Recommendations
121
+
122
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
123
+
124
+ Here are a couple of useful links to learn more about Intel's AI software:
125
+
126
+ - Intel Neural Compressor [link](https://github.com/intel/neural-compressor)
127
+
128
+ ## Disclaimer
129
+
130
+ The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
131
+
132
+ ## Cite
133
+
134
+ @article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
135
+
136
+ [arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)