Apple boosts LLM performance thanks to optimized M5 design Apfelpatient

In its latest Machine Learning Research blog, Apple demonstrates the significant improvements made by the new M5 chip in executing local LLMs. In direct comparison to the M4, the chip achieves noticeably higher speeds. The focus is primarily on how quickly local language models generate the first token and how efficiently subsequent tokens are generated. The report provides clear metrics and explains why the M5 offers advantages in both areas.

To put these results into perspective, it's helpful to look at MLX. Apple released this framework a few years ago to make machine learning natively accessible on Apple Silicon. MLX is open source and built as an array framework. It's based on NumPy and utilizes Apple Silicon's unified memory architecture. This allows operations to seamlessly switch between the CPU and GPU without moving memory. MLX includes packages for neural networks, optimization, automatic differentiation, and graph optimization.

A key component is MLX LM. This package allows Hugging Face models to run locally, including text generation and fine-tuning. MLX LM supports quantization, which reduces the memory requirements of models and enables faster inference. This makes large models viable even on devices with less RAM. Apple bases its comparisons between the M4 and M5 on this.

Background information on MLX and MLX LM

MLX offers a flexible system that covers numerical simulations, scientific computations, and machine learning. For language models, MLX LM provides the appropriate tools. These allow large models to be loaded, executed, and fine-tuned, with quantization playing a key role. Quantization reduces both memory requirements and computational load, and accelerates inference.

MLX fully leverages the unified architecture of the Apple Silicon platform. Support for BF16, mixed-precision formats, and automatic differentiation ensures efficient model execution. The entire memory pool is available for running large models. This is crucial for LLMs, as subsequent tokens are memory-intensive to compute.

M5 compared to the M4

Apple tested several models to highlight the differences between the two chips. These include:

Qwen 1.7B in BF16
Qwen 8B in BF16
Qwen 8B in 4 bit quantization
Qwen 14B in 4 bit quantization
Qwen 30B as a Mixture of Experts model with 3B active parameters in 4 bits
GPT OSS 20B in MXFP4

All benchmarks used a prompt size of 4096 tokens. The measurement includes both the time to generate the first token and the speed at which 128 additional tokens are produced.

The results show that the M5 is significantly faster at generating the first token. This is due to the redesigned GPU with neural accelerators. These perform dedicated matrix multiplications, which are particularly common in LLMs. For subsequent tokens, however, memory bandwidth plays a more significant role. The M5 achieves 153 GB per second, while the M4 reaches 120 GB per second. This represents an increase of 28 percent. Overall, this results in a performance improvement of 19 to 27 percent when generating additional tokens.

A MacBook Pro with 24 GB of RAM can easily store both an 8B model in BF16 and a 30B MoE model in 4-bit. Inference remains below 18 GB and therefore runs stably on both architectures.

Image generation in comparison

Apple also measured image generation performance. Here, the difference is even more pronounced. The M5 performs these tasks more than 3.8 times faster than the M4. The higher bandwidth and optimized units of the new GPU are particularly beneficial in generative image processing.

M5 significantly increases the efficiency of local AI

The M5 shows clear performance gains compared to the M4 in the local execution of large language models. Through improved neural accelerators, higher memory bandwidth, and an optimized GPU, Apple increases the efficiency of LLM inference across the entire Apple Silicon platform. MLX and MLX LM play a central role in this, as they enable the execution of large models in the first place and further accelerate them through quantization.

The results demonstrate that Apple is specifically aligning its hardware with machine learning. LLMs run faster, require less waiting time for the first token, and benefit from a more stable memory connection. The M5 also clearly outperforms the M4 in image generation. This strengthens Apple's use of local AI and expands the possibilities across all devices with Apple Silicon. (Image: Apple)