Why the model size appears to be 1B?
Just curious about that model size degradation in the model card. How could a 30B model be condensed into 1B? Is there a mistake?
This is a display bug in Hugging Face Spaces related to quantized models.
Thank you. Could you also explain why there is this step after running the model with vllm:
Generate the model
Please make sure you have installed the auto_round package from the correct branch:
pip install git+https://github.com/intel/auto-round.git@enable_glm4_moe_lite_quantization
auto_round
--model=zai-org/GLM-4.7-Flash
--scheme "W4A16"
--ignore_layers="shared_experts,layers.0.mlp"
--format=auto_round
--enable_torch_compile
--output_dir=./tmp_autoround
We have already run the model with VLLM; why do we need this step? Sorry if this is inconvenient. I'm not familiar with autoround. Thanks for your guidance.
This is because glm4.7 is a very new model, and the support for it in libraries like Transformers might not be fully developed yet when we uploaded this model. If you can use the quantized model without any issues, you can ignore these steps. Otherwise, if you encounter problems, you can try the methods we provide or report your issues on our GitHub page.