Community MLX conversions for EngGPT2-16B-A3B on Apple Silicon

by robertobissanti - opened 16 days ago

Hi,

first of all, thank you for releasing EngGPT2-16B-A3B.

I have been experimenting with the model on Apple Silicon and created a set of community MLX conversions for local inference with mlx-lm.

The converted models are available here:

Unquantized MLX version:
https://huggingface.co/robertobissanti/EngGPT2-16B-A3B-MLX
4-bit MLX version:
https://huggingface.co/robertobissanti/EngGPT2-16B-A3B-MLX-4bit
8-bit MLX version:
https://huggingface.co/robertobissanti/EngGPT2-16B-A3B-MLX-8bit

Since enggpt_moe is currently a custom architecture not supported by upstream mlx-lm, I also created a patched fork with preliminary support for this architecture:

https://github.com/robertobissanti/mlx-lm-enggpt

The implementation adds support for the main architectural features I found in the model:

model_type = enggpt_moe
head_dim = 128
Q/K RMSNorm inside attention
GQA with 32 attention heads and 4 KV heads
MoE layers with 64 experts
top-8 routing
SwiGLU experts
routing logic closer to the original PyTorch implementation: softmax over all experts, top-k selection, and optional top-k probability renormalization

I tested the conversions locally on a Mac Studio M1 Ultra with 64 GB unified memory.

Approximate local results:

Version	Disk size	Generation speed	Peak memory
MLX unquantized	~29 GB	~64–68 tok/s	~31.5 GB
MLX 4-bit	~15 GB	~90–94 tok/s	~15.8 GB
MLX 8-bit	~20 GB	~75–80 tok/s	~21.3 GB

The 4-bit version appears to be the most practical one for local Apple Silicon inference.

A typical command is:

python -m mlx_lm generate \
  --model ./EngGPT2-16B-A3B-MLX-4bit \
  --prompt "Explain briefly what a Mixture of Experts model is." \
  --trust-remote-code \
  --chat-template-config '{"enable_thinking": false}' \
  --temp 0.1 \
  --max-tokens 160

The model can also be served through the mlx-lm OpenAI-compatible local server:

python -m mlx_lm server \
  --model ./EngGPT2-16B-A3B-MLX-4bit \
  --host 127.0.0.1 \
  --port 8080 \
  --trust-remote-code \
  --chat-template-args '{"enable_thinking": false}' \
  --temp 0.1 \
  --top-p 1.0 \
  --max-tokens 512

I would be very happy if you could take a look and let me know whether the conversion and the architecture mapping look correct from your side.

In particular, feedback would be very useful on:

whether the MoE routing logic matches your intended implementation;
whether Q/K normalization and head_dim = 128 are handled correctly;
whether there are any special inference details that should be documented;
whether the model cards should include additional license or usage notes.

This is an experimental community conversion, and I have clearly documented that it currently requires a patched mlx-lm fork until upstream support for enggpt_moe is available.

Thanks again for the model and for any feedback you may be willing to provide.

Roberto Bissanti

enggpt-team

Engineering Group org 15 days ago

Hi Roberto,

Thank you very much for your interest in EngGPT2 and for sharing this impressive MLX conversion. We truly appreciate your work and will definitely give it a try as soon as we have the opportunity.

Best,
EngGPT-Team

robertobissanti

15 days ago

Thank you, EngGPT-Team!

I’m really happy to contribute to the Italian AI ecosystem.

The MLX conversion is just the first step: I’m currently working on the GGUF version to bring EngGPT2 to llama.cpp and Ollama. Making these models easy to run locally is key to their adoption.

Looking forward to the PR merge!

Best,
Roberto

robertobissanti

14 days ago

Hi EngGPT2 team,

thanks again for the attention and for mentioning the community MLX conversions on LinkedIn. I really appreciated it.

As a follow-up, I created a small public manager repo to make the Apple Silicon setup easier to reproduce on another Mac:

https://github.com/robertobissanti/enggpt2-mlx-manager

The script handles:

installing the patched mlx-lm fork with enggpt_moe support
downloading the selected MLX variant
starting an OpenAI-compatible MLX server
starting Open WebUI via Docker
selecting 4bit, 8bit, or 16bit
passing runtime options like thinking mode and temperature

Example:

./enggpt2-manager.command --install --model 4bit
./enggpt2-manager.command --start --model 4bit --think --temp 0.1

The patched MLX fork is here:

https://github.com/robertobissanti/mlx-lm-enggpt

I pinned the installer to a clean commit of the fork so the setup is reproducible and does not depend on local development files.
This remains a community/unofficial utility, but I hope it can help users who want to try EngGPT2 locally on Apple Silicon with MLX and Open WebUI.
If you prefer any change in wording, attribution, metadata, or links, I am happy to adjust it.

Best,
Roberto

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment