Community MLX conversions for EngGPT2-16B-A3B on Apple Silicon

#3
by robertobissanti - opened

Hi,

first of all, thank you for releasing EngGPT2-16B-A3B.

I have been experimenting with the model on Apple Silicon and created a set of community MLX conversions for local inference with mlx-lm.

The converted models are available here:

Since enggpt_moe is currently a custom architecture not supported by upstream mlx-lm, I also created a patched fork with preliminary support for this architecture:

https://github.com/robertobissanti/mlx-lm-enggpt

The implementation adds support for the main architectural features I found in the model:

  • model_type = enggpt_moe
  • head_dim = 128
  • Q/K RMSNorm inside attention
  • GQA with 32 attention heads and 4 KV heads
  • MoE layers with 64 experts
  • top-8 routing
  • SwiGLU experts
  • routing logic closer to the original PyTorch implementation: softmax over all experts, top-k selection, and optional top-k probability renormalization

I tested the conversions locally on a Mac Studio M1 Ultra with 64 GB unified memory.

Approximate local results:

Version Disk size Generation speed Peak memory
MLX unquantized ~29 GB ~64–68 tok/s ~31.5 GB
MLX 4-bit ~15 GB ~90–94 tok/s ~15.8 GB
MLX 8-bit ~20 GB ~75–80 tok/s ~21.3 GB

The 4-bit version appears to be the most practical one for local Apple Silicon inference.

A typical command is:

python -m mlx_lm generate \
  --model ./EngGPT2-16B-A3B-MLX-4bit \
  --prompt "Explain briefly what a Mixture of Experts model is." \
  --trust-remote-code \
  --chat-template-config '{"enable_thinking": false}' \
  --temp 0.1 \
  --max-tokens 160

The model can also be served through the mlx-lm OpenAI-compatible local server:

python -m mlx_lm server \
  --model ./EngGPT2-16B-A3B-MLX-4bit \
  --host 127.0.0.1 \
  --port 8080 \
  --trust-remote-code \
  --chat-template-args '{"enable_thinking": false}' \
  --temp 0.1 \
  --top-p 1.0 \
  --max-tokens 512

I would be very happy if you could take a look and let me know whether the conversion and the architecture mapping look correct from your side.

In particular, feedback would be very useful on:

  • whether the MoE routing logic matches your intended implementation;
  • whether Q/K normalization and head_dim = 128 are handled correctly;
  • whether there are any special inference details that should be documented;
  • whether the model cards should include additional license or usage notes.

This is an experimental community conversion, and I have clearly documented that it currently requires a patched mlx-lm fork until upstream support for enggpt_moe is available.

Thanks again for the model and for any feedback you may be willing to provide.

Roberto Bissanti

Engineering Group org

Hi Roberto,

Thank you very much for your interest in EngGPT2 and for sharing this impressive MLX conversion. We truly appreciate your work and will definitely give it a try as soon as we have the opportunity.

Best,
EngGPT-Team

Thank you, EngGPT-Team!

I’m really happy to contribute to the Italian AI ecosystem.

The MLX conversion is just the first step: I’m currently working on the GGUF version to bring EngGPT2 to llama.cpp and Ollama. Making these models easy to run locally is key to their adoption.

Looking forward to the PR merge!

Best,
Roberto

Hi EngGPT2 team,

thanks again for the attention and for mentioning the community MLX conversions on LinkedIn. I really appreciated it.

As a follow-up, I created a small public manager repo to make the Apple Silicon setup easier to reproduce on another Mac:

https://github.com/robertobissanti/enggpt2-mlx-manager

The script handles:

  • installing the patched mlx-lm fork with enggpt_moe support
  • downloading the selected MLX variant
  • starting an OpenAI-compatible MLX server
  • starting Open WebUI via Docker
  • selecting 4bit, 8bit, or 16bit
  • passing runtime options like thinking mode and temperature

Example:

./enggpt2-manager.command --install --model 4bit
./enggpt2-manager.command --start --model 4bit --think --temp 0.1

The patched MLX fork is here:

https://github.com/robertobissanti/mlx-lm-enggpt

I pinned the installer to a clean commit of the fork so the setup is reproducible and does not depend on local development files.
This remains a community/unofficial utility, but I hope it can help users who want to try EngGPT2 locally on Apple Silicon with MLX and Open WebUI.
If you prefer any change in wording, attribution, metadata, or links, I am happy to adjust it.

Best,
Roberto

Sign up or log in to comment