ekurtic commited on
Commit
82eb10c
·
verified ·
1 Parent(s): 4197cf0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md CHANGED
@@ -43,6 +43,96 @@ This optimization reduces the number of bits used to represent weights and activ
43
  Weight quantization also reduces disk size requirements by approximately 50%.
44
 
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  ## Deployment
47
 
48
  This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 
43
  Weight quantization also reduces disk size requirements by approximately 50%.
44
 
45
 
46
+ ## Creation
47
+ <details>
48
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
49
+
50
+ ```bash
51
+ python quantize.py --mdoel_path mistralai/Devstral-Small-2507 --calib_size 512 --dampening_frac 0.05
52
+ ```
53
+
54
+ ```python
55
+ import argparse
56
+ import os
57
+ from datasets import load_dataset
58
+ from transformers import AutoModelForCausalLM
59
+ from llmcompressor.modifiers.quantization import GPTQModifier
60
+ from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
61
+ from llmcompressor.transformers import oneshot
62
+ from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
63
+ from mistral_common.protocol.instruct.request import ChatCompletionRequest
64
+ from mistral_common.protocol.instruct.messages import (
65
+ SystemMessage, UserMessage
66
+ )
67
+
68
+ def load_system_prompt(repo_id: str, filename: str) -> str:
69
+ file_path = os.path.join(repo_id, filename)
70
+ with open(file_path, "r") as file:
71
+ system_prompt = file.read()
72
+ return system_prompt
73
+
74
+ parser = argparse.ArgumentParser()
75
+ parser.add_argument('--model_path', type=str)
76
+ parser.add_argument('--calib_size', type=int, default=256)
77
+ parser.add_argument('--dampening_frac', type=float, default=0.1)
78
+ args = parser.parse_args()
79
+
80
+ model = AutoModelForCausalLM.from_pretrained(
81
+ args.model_path,
82
+ device_map="auto",
83
+ torch_dtype="auto",
84
+ use_cache=False,
85
+ trust_remote_code=True,
86
+ )
87
+
88
+ ds = load_dataset("garage-bAInd/Open-Platypus", split="train")
89
+ ds = ds.shuffle(seed=42).select(range(args.calib_size))
90
+
91
+ SYSTEM_PROMPT = load_system_prompt(args.model_path, "SYSTEM_PROMPT.txt")
92
+ tokenizer = MistralTokenizer.from_hf_hub("mistralai/Devstral-Small-2507")
93
+
94
+ def tokenize(sample):
95
+ tmp = tokenizer.encode_chat_completion(
96
+ ChatCompletionRequest(
97
+ messages=[
98
+ SystemMessage(content=SYSTEM_PROMPT),
99
+ UserMessage(content=sample['instruction']),
100
+ ],
101
+ )
102
+ )
103
+ return {'input_ids': tmp.tokens}
104
+
105
+ ds = ds.map(tokenize, remove_columns=ds.column_names)
106
+
107
+ recipe = [
108
+ SmoothQuantModifier(
109
+ smoothing_strength=0.8,
110
+ mappings=[
111
+ [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
112
+ [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
113
+ [["re:.*down_proj"], "re:.*up_proj"],
114
+ ],
115
+ ),
116
+ GPTQModifier(
117
+ targets=["Linear"],
118
+ ignore=["lm_head"],
119
+ scheme="W8A8",
120
+ dampening_frac=args.dampening_frac,
121
+ )
122
+ ]
123
+ oneshot(
124
+ model=model,
125
+ dataset=ds,
126
+ recipe=recipe,
127
+ num_calibration_samples=args.calib_size,
128
+ max_seq_length=8192,
129
+ )
130
+
131
+ save_path = args.model_path + "-quantized.w8a8"
132
+ model.save_pretrained(save_path)
133
+ ```
134
+ </details>
135
+
136
  ## Deployment
137
 
138
  This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.