An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.

Same hardware. Same benchmark. Opposite winner depending on model size. Cross-platform inference numbers on Mac M2 Pro Metal, Linux RTX 2080 Ti CUDA, and Windows RX 6600 XT Vulkan, plus the hardware tier ceiling none of them clear.

martedì 2 giugno 2026 New tab

TL;DRAI

AMD RX 6600 XT beat Mac M2 Pro on Llama 8B (+80%) but lost on Phi-3 mini; hardware choice depends on model size versus cache. Local inference hardware selection is model-dependent: Mac wins on cache-fit small models, discrete GPUs on 7-8B price-per-token, nothing viable for 70B+.

1,021 words~5 min read

I wrote a post yesterday about why GPUs barely help small text embeddings at batch=1. Different workload, same machines. This time I ran a local LLM inference benchmark across the same three boxes. The result complicated my hardware mental model in a way I think is worth sharing.

The setup

Three machines.

A Mac M2 Pro with 16 GB of unified memory, running Metal through llama-cpp-python.

A Linux desktop with an Intel 13700K, 62 GB of RAM, and an RTX 2080 Ti with 11 GB of VRAM. CUDA 13.

An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.

An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.

Other newsrooms on this story

Related reading

Running Flux Schnell (12B) + LLMs on a Legacy AMD RX 580 (8GB) via Native…

I built a Rust inference engine that streams MoE expert weights from NVMe SSDs,…

Two Years of Local AI on a Laptop: When Open Models Outpaced Moore's Law

Cross-Machine Memory Query: About 20 Milliseconds, Most Days

Hardware Guide: What Do You Actually Need to Run Local LLMs?

AMD puts out new slottable GPU for AI-curious enterprises

Other newsrooms on this story

Related reading

Running Flux Schnell (12B) + LLMs on a Legacy AMD RX 580 (8GB) via Native…

I built a Rust inference engine that streams MoE expert weights from NVMe SSDs,…

Two Years of Local AI on a Laptop: When Open Models Outpaced Moore's Law

Cross-Machine Memory Query: About 20 Milliseconds, Most Days

Hardware Guide: What Do You Actually Need to Run Local LLMs?

AMD puts out new slottable GPU for AI-curious enterprises