marksverdhei
/

GLM-4.7-Flash-FP8

@@ -51,15 +51,23 @@ outputs = llm.generate(["Hello, world!"], SamplingParams(max_tokens=100))
 print(outputs[0].outputs[0].text)
 ```
-### vLLM Patches Required
-Until upstream support is added, you may need to patch vLLM:
-1. Add `glm4_moe_lite` to MLA detection in `vllm/config/model.py`
-2. Add registry mapping in `vllm/model_executor/models/registry.py`:
-   ```python
-   "Glm4MoeLiteForCausalLM": ("deepseek_v2", "DeepseekV2ForCausalLM"),
-   ```
 ## License

 print(outputs[0].outputs[0].text)
 ```
+### vLLM Fork Required
+Until upstream vLLM adds MLA detection for `glm4_moe_lite`, use our fork:
+```bash
+pip install git+https://github.com/marksverdhei/vllm.git@fix/glm4-moe-mla-detection
+```
+Or install from source:
+```bash
+git clone https://github.com/marksverdhei/vllm.git
+cd vllm
+git checkout fix/glm4-moe-mla-detection
+pip install -e .
+```
+**Fork**: [marksverdhei/vllm](https://github.com/marksverdhei/vllm/tree/fix/glm4-moe-mla-detection)
 ## License