34 Tokens Per Second on a 48 GB MacBook

Yesterday I wrote about running a 35B mixture-of-experts model on a MacBook Pro at about 1 token per second.

The useful part of that result was never the speed. It was that the MoE compression path worked end to end. The model fit, the routing worked, the output was coherent, and the bottleneck was no longer abstract. It had a shape.

The conclusion at the end of that post was specific: if this was going to become usable, the six MLX operations in the decode path had to collapse into one fused Metal kernel.

That kernel now exists.

On the same hardware class, the same compression method, and the same 3B-active Qwen A3B architecture, the number moved from 1 tok/s to 33.8 tok/s.

Same MacBook. Same underlying idea. Different kernel shape.

What changed

The earlier version walked through the decode path as a chain of separate operations:

take → unpack → centroid lookup → norm scale → reshape → einsum

That worked, but it was the wrong shape for this workload.

Each stage wrote intermediate values to unified memory and then read them back again. On a large MoE model, single-token decode is already a memory-sensitive path. Splitting it into six separate launches turned memory traffic into the dominant cost. mx.compile helped only slightly because the issue was not just launch overhead. Too much data was moving in and out between kernels.

The fused kernel in v0.10.0 does that work in one pass, specialized for batch size 1 decode:

Load the packed 3-bit indices for the selected row from uint8 memory.
Unpack the indices in registers.
Look them up in the codebook.
Apply the learned norm.
Accumulate directly into the output dot product against the rotated input.

That is the change.

The math is the same. The compression is the same. The difference is that the intermediate tensors mostly disappear. What used to be six memory-heavy stages becomes one sweep through packed weights and one accumulation path to output.

On this workload, that is the difference between a proof and a usable system.

The measurement

The current benchmark is on Qwen3-Coder-30B-A3B-Instruct on an M4 Pro MacBook Pro with 48 GB of unified memory.

Prompt:

Write a short Python function that returns the n-th Fibonacci number

Results:

Path	TTFT	Wall time (64 tok)	Decode tok/s
v0.10.0 fused Metal GEMV kernel	0.33 s	2.2 s	33.80
fp32 einsum fallback	17.3 s	94.7 s	0.81

The fallback path is the same general shape as the one from the earlier post. The fused path is what replaced it.

The first number was "it fits, but you would not use it."
This one is different. 33.8 tok/s is fast enough to feel interactive.

One caveat about the comparison

The earlier post used Qwen3.5-35B-A3B, which has 256 experts per layer. This follow-up uses Qwen3-Coder-30B-A3B, which has 128 experts per layer.

They are in the same architecture family. Both are top-8 routed A3B MoE models. But they are not the identical model.

That matters.

The 30B coder variant also carries less resident state, about 11 GB instead of the earlier 19 GB, which reduces memory pressure and helps the final number. So if the question is "how much of the gain came from fusion, and how much came from using the smaller model," the precise answer still needs one more benchmark.

I have not yet re-run the fused kernel on the 256-expert 35B.

What I can say today is narrower and still useful:

On the same model, fused kernel versus einsum fallback is about a 42× speedup.
Across the same Qwen A3B family, the earlier 1 tok/s result and today's 33.8 tok/s result point in the same direction.

The bottleneck really was the unfused memory path. Fusion really does change it.

What 34 tok/s changes

The earlier result proved the compression path.

This one makes the model usable.

At 33.8 tok/s, the first token arrives in about a third of a second. That is fast enough for interactive chat. More importantly, it is fast enough for agent loops to start making sense locally. A tool call that burns 50 or 60 response tokens is no longer an event. It is just part of the loop.

That was the next thing I wanted to test.

So I pointed opencode at http://localhost:8080/v1, gave it a small task, and let it run against the local model.

The task was simple:

List files in scripts/, and give a one-sentence summary of scripts/mac-baseline.sh.

Getting that loop stable on a 48 GB MacBook took a bit more work than the kernel itself.

Three things had to be fixed:

Provider mismatch. opencode's default openai provider uses /v1/responses, not /v1/chat/completions. Switching to @ai-sdk/openai-compatible fixed the wire protocol.
Concurrent requests. mlx_lm.server handles requests on separate threads. opencode sends parallel calls. On a 30B model, the second request could push prefill high enough to crash the Metal command buffer. A single threading.Lock around the POST handler fixed that.
Memory retention across requests. MLX keeps attention scratch around aggressively enough that an opencode session could leak substantial extra memory on top of the pinned weights. Setting explicit wired and total memory limits, then calling mx.clear_cache() after each request, kept the process bounded.

All three fit in about forty lines around mlx_lm.server, in scripts/mac-serve-tq3.py.

With that in place, the agent loop ran cleanly. It listed the files, read the shell script, and summarized it correctly down to the operational detail that it uses nohup and is safe to re-run.

That is the sentence that matters here.

A real agent loop, against a 30B mixture-of-experts model, on a MacBook Pro, using tools, completing a multi-step task.

Yesterday that sentence was still ahead of the implementation. Now it is just a description.

Quality

Speed by itself is not a product. It is only interesting if the bit budget still holds up.

The comparison I cared about was straightforward:

calibration-free TurboQuant 3-bit
versus MLX's built-in 3-bit linear quant
and also against MLX 4-bit, since that is the practical fallback many people would choose

The 20-scenario evaluation ran with a Llama-3.3-70B judge, and the scores landed here:

Config	Bit budget	Overall score
TurboQuant TQ3	3-bit	4.62 / 5
MLX 4-bit	4-bit	4.16 / 5
MLX 3-bit	3-bit	2.29 / 5

The important comparison is the first row against the third.

At the same bit budget, TurboQuant's calibration-free 3-bit path beat MLX's stock 3-bit linear quant by more than two full points on a five-point scale.

Against MLX 4-bit, which uses a third more memory, TQ3 still scored higher.

Nineteen of twenty scenarios scored 4.0 or above. The one notable failure was humor-wit, where the model fell into a repetitive pun loop and did not recover well. That is a useful failure mode to see, but it does not look like a quantization collapse.

There is still one latency caveat on the quality side.

Average turn latency was about 17.3 s, versus roughly 5 s for MLX 4-bit. That is the remaining prefill tax. The fused kernel makes single-token decode fast, but prefill still touches every weight through a slower path. That is the next obvious place to work.

The CUDA parallel

This is not really a MacBook story anymore.

The same kernel shape is what the CUDA path wants too: batch size 1, fused dequant, norm application, and matmul in one sweep.

That is already the direction in turboquant-vllm, where the CUDA bs=1 GEMV path shipped in v0.9.0 for A100, L40S, and RTX 6000 Ada. The scalar quantization path is also in review upstream as vllm-project/vllm#39970.

That convergence is not accidental.

On both Metal and CUDA, the pressure comes from the same place. Single-token decode spends its time following dependent loads through memory. If you want the speed back, you remove intermediate traffic and do the work in one kernel.

Different backend, same answer.

What comes next

There are four obvious next steps:

Run the fused kernel on the 256-expert 35B. That will separate "smaller model" from "better kernel" more cleanly.
Fix prefill. Decode is now fast. Prefill is still the slower path, and that will matter at long context.
Handle bs > 1. The current kernel is specialized for single-user decode. Batch serving needs a different shape.
Keep pushing the same fused path upstream. HIGGS-scalar does not care whether the shader speaks Metal or PTX. The implementation details differ. The architectural requirement does not.

All the code is in varjoranta/turboquant-vllm. The fused Metal path is in turboquant_vllm/mlx_ops.py, the loader is in mlx_loader.py, and the MoE integration is in mlx_model.py. This release is v0.10.0. The matching CUDA bs=1 GEMV path is v0.9.0.

The first post proved the MoE path was real.
This one made it usable.

Update — April 19

A day after this post went up, Dan Alistarh commented on the vLLM PR suggesting we look at FLUTE — the lookup-table GEMM engine HuggingFace transformers already uses for HIGGS inference on CUDA. Fair point: if the canonical HIGGS kernel exists, we should measure it before shipping a custom one.

Spent the day vendoring FLUTE into the plugin (renamed namespace so it coexists with any external flute-kernel install), plumbed CUTLASS 3.5.1 as a submodule, and source-built it against torch 2.5.1 on an A100. Three compile walls hit and patched: rank-0 broadcast assertion in packbits_utils.hpp against CUTLASS 3.9+'s stricter cute::recast, SM90 gencode removed (FLUTE is Ampere/Ada only), and a sys.modules alias so internal bare references resolve under the vendored name.

Direct kernel timing on Qwen3-8B Linear shapes, bs=1 fp16, 200 CUDA-event reps:

Shape	Label	Median
4096 × 4096	q/k/v/o_proj	35.84 µs
4096 × 14336	gate/up_proj (fused)	58.37 µs
14336 × 4096	down_proj	56.32 µs

Per-decoder-block Linear cost ~258 µs × 32 layers = ~8.3 ms linear-only per decoded token. The custom CUDA GEMV currently in the PR hits 17.2 tok/s end-to-end (~58 ms/token, Linear ~85% of that). FLUTE's kernel is about 6× faster at the bits that matter. Extrapolated end-to-end lands in a 60–80 tok/s band for Qwen3-8B bs=1 — close to BF16's 87.6 tok/s baseline on the same hardware.

The full PR comment with caveats (FLUTE's hardware coverage, bf16 regression, model-specific pre-tuning) is at vllm#39970 comment 4275425554. Short version: we're likely adopting FLUTE as an optional fast path under --quantization turboquant, falling back to the custom kernel where FLUTE's pre-tuned set doesn't cover the arch or model. The Metal GEMV kernel this post is about stays put — it's the Apple Silicon equivalent doing the same job on different hardware.

Different backend, same answer, turns out to apply to the canonical kernel too.

A 70 GB model on a 48 GB MacBook When Triton stops being the right tool turboquant-vllm vLLM PR #39970

Update 2026-05-15 — Broader treatment of the quantization landscape (where this work fits, what's worth investing in, May 2026 model snapshot) in The Quantization Stack in May 2026.

More writing →