QuantFactory
/

LongWriter-glm4-9b-GGUF

@@ -30,6 +30,7 @@ This is quantized version of [THUDM/LongWriter-glm4-9b](https://huggingface.co/T
 LongWriter-glm4-9b is trained based on [glm-4-9b](https://huggingface.co/THUDM/glm-4-9b), and is capable of generating 10,000+ words at once.
 A simple demo for deployment of the model:
 ```python
@@ -39,20 +40,39 @@ tokenizer = AutoTokenizer.from_pretrained("THUDM/LongWriter-glm4-9b", trust_remo
 model = AutoModelForCausalLM.from_pretrained("THUDM/LongWriter-glm4-9b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
 model = model.eval()
 query = "Write a 10000-word China travel guide"
-prompt = f"[INST]{query}[/INST]"
-input = tokenizer(prompt, truncation=False, return_tensors="pt").to(device)
-context_length = input.input_ids.shape[-1]
-output = model.generate(
-    **input,
-    max_new_tokens=32768,
-    num_beams=1,
-    do_sample=True,
-    temperature=0.5,
-)[0]
-response = tokenizer.decode(output[context_length:], skip_special_tokens=True)
 print(response)
 ```
-Environment: `transformers==4.43.0`
 License: [glm-4-9b License](https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/LICENSE)

 LongWriter-glm4-9b is trained based on [glm-4-9b](https://huggingface.co/THUDM/glm-4-9b), and is capable of generating 10,000+ words at once.
+Environment: Same environment requirement as [glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) (`transforemrs>=4.43.0`).
 A simple demo for deployment of the model:
 ```python
 model = AutoModelForCausalLM.from_pretrained("THUDM/LongWriter-glm4-9b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
 model = model.eval()
 query = "Write a 10000-word China travel guide"
+response, history = model.chat(tokenizer, query, history=[], max_new_tokens=32768, temperature=0.5)
 print(response)
 ```
+You can also deploy the model with [vllm](https://github.com/vllm-project/vllm), which allows 10,000+ words generation within a minute. Here is an example code:
+```python
+from vllm import LLM, SamplingParams
+model = LLM(
+    model= "THUDM/LongWriter-glm4-9b",
+    dtype="auto",
+    trust_remote_code=True,
+    tensor_parallel_size=1,
+    max_model_len=32768,
+    gpu_memory_utilization=1,
+)
+tokenizer = model.get_tokenizer()
+stop_token_ids = [tokenizer.eos_token_id, tokenizer.get_command("<|user|>"), tokenizer.get_command("<|observation|>")]
+generation_params = SamplingParams(
+    temperature=0.5,
+    top_p=0.8,
+    top_k=50,
+    max_tokens=32768,
+    repetition_penalty=1,
+    stop_token_ids=stop_token_ids
+)
+query = "Write a 10000-word China travel guide"
+input_ids = tokenizer.build_chat_input(query, history=[], role='user').input_ids[0].tolist()
+outputs = model.generate(
+    sampling_params=generation_params,
+    prompt_token_ids=[input_ids],
+)
+output = outputs[0]
+print(output.outputs[0].text)
+```
 License: [glm-4-9b License](https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/LICENSE)