Can the M2 Ultra Mac Pro with 192GB memory run this model?

#1
by HanningLiu - opened
MLX Community org

.

MLX Community org

Ok, it cannot run. I got answer after asking LLM:

Based on the information I found, the M2 Ultra Mac Pro with 192GB would NOT be able to run this GLM-4.7-8bit-gs32 model. Here's why:

  Model Specifications

  GLM-4.7 is a massive model with approximately 355 billion parameters (355B), making it one of the largest open-source models available.

  Memory Requirements Breakdown:

  | Quantization       | Memory Required |
  |--------------------|-----------------|
  | 4-bit              | ~165-205 GB     |
  | 8-bit (your model) | ~330-410 GB     |
  | Full BF16          | ~1.28 TB        |

  Why 192GB isn't enough

  The 8-bit quantized version you're looking at requires approximately:
  - ~355GB for weights alone (355B parameters Γ— 1 byte per parameter)
  - Plus additional memory for KV cache, activations, and runtime overhead
  - Total: ~360-400GB+

  This exceeds your 192GB unified memory capacity.

  Your Options

  1. Use a more aggressively quantized version: The 4-bit GGUF version (~165-205GB) would fit in your 192GB memory
  2. Offload to disk: You could run the model with partial CPU/offloading, though it would be significantly slower
  3. Consider a smaller model: Models like Llama 3.1 70B or Qwen 2.5 72B would run very comfortably on your system

  The model card even mentions "512 GB Mac Studio" as the target hardware for this quantization level, confirming that 192GB would not be sufficient for this 8-bit version.

  Would you like help finding a more suitable quantization of GLM-4.7 or an alternative model that would work well with your 192GB Mac Pro?
HanningLiu changed discussion status to closed
MLX Community org
β€’
edited Jan 2

Yes, even the 4-bit quant I created at https://huggingface.co/mlx-community/GLM-4.7-4bit is about 200 GB in size. This means you will be able to run a 3-bit quant version in 192 GB. It should be about 155 GB.

MLX Community org
β€’
edited Jan 2

I read somewhere that GLM 4.7 is better than GLM 4.6 when you stay at 4-bit or higher. With a 3-bit quant evidently the older GLM 4.6 version is better.
GLM 4.7 seems to be more sensitive to quantization than GLM 4.6.
Would you be interested in a 3-bit quant?

Although it is actually really easy to create your own versions with MLX.

MLX Community org

Although it is actually really easy to create your own versions with MLX.

@bibproj Could you please share a link to a tutorial? Would love the learn more about it... 🫑

MLX Community org

@McG-221

Hi Barney. I'll assume you are currently using something like mlx-my-repo to create quants of smaller models. That is when it works, as it tends to end up with errors with larger models or newer models. I'll further assume you use something like LM Studio or Ollama to download and run the created quants. This is a very good starting point. With time you start running into the limitations. And you have to learn a few more tools.

If some of my assumptions are wrong, just jump into the list below at the point where you actually are.

  1. On your Mac, open the Terminal app. Enlarge the window as the default size is probably too small.

  2. You will need install python. There are many ways to do this. My method is to use brew.
    If you do not have brew installed yet

  • Type xcode-select --install (to install the tools that brew needs)
  • Type (or just copy and paste) /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" to install brew
  • To ensure your Mac can always find it, type
(echo; echo 'eval "$(/opt/homebrew/bin/brew shellenv)"') >> /Users/yourusername/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"
  • Now install python 3.11 using brew, with brew install python@3.11 (Version 3.11 seems to work fine for everything.)
  1. Now you will need to install MLX. Type pip3.11 install -U mlx mlx-lm llm-mlx mlx-vlm mlx-metal to install the latest versions of mlx-lm (for text LLMs) and mlx-vlm (for vision LLMs).

  2. You also need something to download the models with. I use hfdownloader (found at https://github.com/bodaay/HuggingFaceModelDownloader). Type bash <(curl -sSL https://g.bodaay.io/hfd) -i to install this. Read the front page of the GitHub repository to get an idea how to use the browser UI.

OK, that was the difficult part. Now you have all the tools you need to get going.


Let's assume you want to create the 3-bit quant as described in the earlier messages. The original full-quality model is found at https://huggingface.co/zai-org/GLM-4.7

  1. Create a folder where you will place all your downloaded models, by typing mkdir models (models is the name of your folder. Go to the folder by typing cd models

  2. Download the original full-quality model into that folder using hfdownloader. Let's assume hfdownloader went and put it into zai-org_GLM-4.7 (close enough to the original zai-org/GLM-4.7 folder on Hugging Face)

  3. Now create your 3-bit quant. Type
    mlx_lm.convert --hf-path ai-org_GLM-4.7 --mlx-path GLM-4.7-mlx-3bit -q --q-bits 3
    Your 3-bit quant will be created in the folder called GLM-4.7-mlx-3bit you specified.

  4. You can now use your 3-bit quant by typing,
    mlx_lm.generate --model GLM-4.7-mlx-3bit --prompt "Write a simple to-do program."

  5. If you want to use your quant in LM Studio, you can link to it from LM Studio with something like
    ln -s GLM-4.7-mlx-3bit /Users/youruser/lmstudio-older/models/zai-org/GLM-4.7-mlx-3bit
    You will have to search online for some more details on what folder LM Studio uses on your machine.

MLX Community org

You can also upload your 3-bit quant to the mlx-community that others can also benefit from all your hard work.

On Hugging Face, at the top right, click on your profile picture. On the dropdown menu, select New Model

MLX Community org

Remember to occasionally update MLX to the latest versions by typing pip3.11 install -U mlx mlx-lm llm-mlx mlx-vlm mlx-metal (The -U is for update)

MLX Community org

@bibproj You were right with all your assumptions, I'll give it a try on the weekend. Thanks a lot, much appreciated! πŸ™Œ

MLX Community org

Thank you for your disscussion, I tried this model MiniMax-M2.1-4bit, and it will take about 128GB of memory, which leave me enough space for other tasks, I think I will use it!

MLX Community org
β€’
edited Jan 6

@bibproj @nightmedia I've completed most of the above steps successfully, but now I wonder about mlx_lm.convert -- the --help command is missing a complete list of available bitrates. So... Q2-Q8 (but no Q7?) and then mixed quants... a few are named. Is there a complete list of which quantizations are available? I searched via google, but no results on how to actually use this efficiently (other than 4-bit standard)... 😳

Thanks in advance! πŸ‘‹

MLX Community org

Very good!

Actually the --help is quite complete. MLX does NOT have all the variations that GGUF files have with llama.cpp.

Here are some of the ones I use:

mlx_lm.convert --hf-path MODEL --mlx-path MODEL-mlx-bf16 --dtype bfloat16
You can use float16,bfloat16,and float32 with dtype. I use the bfloat16 is I have small models that I want to use to the maximum quality.

mlx_lm.convert --hf-path MODEL --mlx-path MODEL-mlx-4bit -q --q-bits 4
mlx_lm.convert --hf-path MODEL --mlx-path MODEL-mlx-8bit -q --q-bits 8
You can use 2, 3, 4, 5, 6, and 8 bits with --q-bits
You are correct that there is no Q7!

mlx_lm.convert --hf-path MODEL --mlx-path MODEL-mlx-mixed_4_6 -q --quant-predicate mixed_4_6
With --quant-predicate you can use mixed_2_6, mixed_3_4, mixed_3_6, and mixed_4_6.
I sometimes use mixed_4_6 to create a slightly better 4-bit quant. It mixes in a few % of 6-bit into your 4-bit quant.

Nowadays I can mix my own quality (like a DQ5_K_M or a DQ3_K_M), but for that you have to modify a few lines of the above mixed MLX code. I always document this code in the models where I used it.

mlx_lm.convert --hf-path MODEL --mlx-path MODEL-mlx-8bit-gs32 -q --q-bits 8 --q-group-size 32
Finally you can also add --q-group-size 32 to get a slightly more accurate model. I normally use this with 8-bit quants to try and improve it to the last bits. Excuse the pun. I then normally add -gs32 at the end of the quant's name to indicate that I have used this. There are users who especially go looking for such quants.

This is it. You have now graduated to the next level. πŸ˜„

Also consider you can encode with --quant-mode mxfp4, that does really great in GLMs. I have metrics on the model cards to show what to expect

https://huggingface.co/nightmedia/unsloth-GLM-4.5-Air-mxfp4-mlx

It does take forever, a couple hours at least, and most of the RAM

MLX Community org

Thank you @nightmedia !

By the way, MLX now also has mxfp8

MLX Community org

Yabba Dabba Doo! πŸ₯³
Thanks for all your great infos, very much appreciated and easy to understand for the layperson! Now I even know, what the "mysterious gs32" stands for πŸ™ˆ
I'm currently using the HF Downloader; it works, but is very slow--is this just my internet again, or shouldn't I use the default settings (3 concurrent files with 8 connections) right now, any tips?

Thanks again and have a great day! πŸ™Œ

MLX Community org

It does take forever, a couple hours at least, and most of the RAM

"How much RAM?"

"Yes." πŸ™ˆ

MLX Community org

@nightmedia

I use LLMs for translations. Looking at your list of recommendations at https://huggingface.co/nightmedia/unsloth-GLM-4.5-Air-mxfp4-mlx, where would translations best fit into that list?

Do you have any thoughts on where mxfp8 would work the best?

awwww, how did I miss that? supported, no less... I am so falling behind the times with all the recent merges.

Thanks for the pointer, will do some experiments.

In the GLM context however, and generally larger MoEs (recently did a qx53g quant on the 80B-Decensored), the lower quants perform better than q8 or higher.

In my particular case, I use data stores to 3 bit, attention paths to 5 bit, and that is sufficient to power the model well above q8 levels. I tried quanting GLM on smaller than 3bit, it lost too much to be still usable. The vibe is cleaner too, less confusion, less hallucinations, the model is focused on the task like a bulldog with a steak

If you want a clean no-frills, q6 with group size 32 is usually the best tuned that is also small enough to fit, although I agree that MLX evolved nicely the mxfp4 implementation.

In earlier mxfp4 it clobbered everything to 4 bit, but newer versions leave some layers to 8 bit and that really shows

MLX Community org

In earlier mxfp4 it clobbered everything to 4 bit, but newer versions leave some layers to 8 bit and that really shows

Thanks for that scoop! πŸ“°πŸ˜‰ Already asked myself that question, now I know!

MLX Community org

@McG-221

You can try hf download instead of hfdownloader to see if it works better for you.

See https://huggingface.co/docs/huggingface_hub/main/guides/cli for how to use it.

Yeah, got to keep those packages up to date, sometimes daily... :)

For translations you'd want a high winogrande. Some decent piqa would help too, I tune my models specifically for that

The GLM is really good with piqa/winogrande 0.767,0.728 for mxfp4, very few models approach that. Where it lacks a bit is arc, but that's a systemic issue, and that's what I attempt to solve with my merges. For example, I barely scratch that with https://huggingface.co/nightmedia/Qwen3-30B-A3B-Element6-1M-qx64-hi-mlx

mxfp4    0.556,0.693,0.878,0.750,0.428,0.797,0.690
qx64-hi  0.579,0.756,0.880,0.756,0.458,0.802,0.701
qx86-hi  0.568,0.737,0.880,0.760,0.450,0.803,0.714

The arc numbers on GLM are 0.416,0.440. It shows :)

MLX Community org

Thank you for your pointers!

I seem to get the best translations from quants of DeepSeek. The new Mistral Large 3 seems even better for European languages, but the model is not supported in MLX as yet.

For coding GLM is really good.

MLX Community org

Yabba Dabba Doo! πŸ₯³
Thanks for all your great infos, ...

Pleasure. Happy to help. πŸ˜„ πŸ‘

exactly and you pointed to a very important distinction: the metric shows how well the model reasons and performs on generic tests. For specific content like language corpus, if it does not have it in it matrix and the arc numbers are high, it will hallucinate the most likely result. Sometimes this is funny :)

Sign up or log in to comment