Why the model size appears to be 1B?

by docato - opened Feb 3

Feb 3

Just curious about that model size degradation in the model card. How could a 30B model be condensed into 1B? Is there a mistake?

n1ck-guo

Intel org Feb 3

This is a display bug in Hugging Face Spaces related to quantized models.

docato

Feb 3

Thank you. Could you also explain why there is this step after running the model with vllm:

Generate the model
Please make sure you have installed the auto_round package from the correct branch:
pip install git+https://github.com/intel/auto-round.git@enable_glm4_moe_lite_quantization
auto_round
--model=zai-org/GLM-4.7-Flash
--scheme "W4A16"
--ignore_layers="shared_experts,layers.0.mlp"
--format=auto_round
--enable_torch_compile
--output_dir=./tmp_autoround

We have already run the model with VLLM; why do we need this step? Sorry if this is inconvenient. I'm not familiar with autoround. Thanks for your guidance.

n1ck-guo

Intel org Feb 4

This is because glm4.7 is a very new model, and the support for it in libraries like Transformers might not be fully developed yet when we uploaded this model. If you can use the quantized model without any issues, you can ignore these steps. Otherwise, if you encounter problems, you can try the methods we provide or report your issues on our GitHub page.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment