Update README.md

Browse files

Files changed (1) hide show

README.md +59 -0

README.md CHANGED Viewed

@@ -77,6 +77,65 @@ print(generated_text)
 vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 ## Evaluation

 vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+## Creation
+<details>
+  <summary>Creation details</summary>
+  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
+  ```python
+  from transformers import AutoModelForCausalLM, AutoTokenizer
+  from llmcompressor.modifiers.quantization import QuantizationModifier
+  from llmcompressor.transformers import oneshot
+  # Load model
+  model_stub = "microsoft/phi-4"
+  model_name = model_stub.split("/")[-1]
+  num_samples = 1024
+  max_seq_len = 8192
+  tokenizer = AutoTokenizer.from_pretrained(model_stub)
+  model = AutoModelForCausalLM.from_pretrained(
+      model_stub,
+      device_map="auto",
+      torch_dtype="auto",
+  )
+  def preprocess_fn(example):
+    return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
+  ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
+  ds = ds.map(preprocess_fn)
+  # Configure the quantization algorithm and scheme
+  recipe = QuantizationModifier(
+      targets="Linear",
+      scheme="W4A16",
+      ignore=["lm_head"],
+      dampening_frac=0.01,
+  )
+  # Apply quantization
+  oneshot(
+      model=model,
+      dataset=ds,
+      recipe=recipe,
+      max_seq_length=max_seq_len,
+      num_calibration_samples=num_samples,
+  )
+  # Save to disk in compressed-tensors format
+  save_path = model_name + "-quantized.w4a16
+  model.save_pretrained(save_path)
+  tokenizer.save_pretrained(save_path)
+  print(f"Model and tokenizer saved to: {save_path}")
+  ```
+</details>
 ## Evaluation