Any tips on best way to run across 2 x M3 (512GB) Ultras?

#1
by refreshai - opened

There are various ways to utilise the two M3 Ultras - EXO or MLX Distributed - has anyone direct feedback on best approach with some details on the config / setup process?

One thing that comes to mind is the 6 x TB5 ports which can be aggregated.

You can physically connect multiple Thunderbolt 5 ports from Studio A directly to the same number of ports on Studio B. To utilise the combined bandwidth (e.g., 6×120 Gbps=720 Gbps) - theoretically.

This can be configured in macOS.

The real question is, if this going to have a significant effect on the tokens / sec, - my gut feel is that the answers is no since we are most likely compute bound in the first place.

For instance if we have a 4 bit quant running on a single M3 ultra - and are say, getting 20 t/s, then if we have 2 x M3 ultra's and "run" an 8 bit quant across the two - with zero network delay (extreme hypothetical) - we are highly unlikely to get more than 20 t/s for a smaller (more compute) quant.

So what is the minimum network requirement between the two M3's to not bottleneck the compute?

This is an interesting subject .

Has anyone investigated this in the real world - with physical connections / machines?

MLX Community org
edited 28 days ago

For scaling up TPS, latency is the bottleneck not bandwidth.
This works just fine with EXO, you can scale single request TPS up to 3.2x with 4 nodes.
We have benchmarked this.

Sign up or log in to comment