The hook
Same machine. Roughly 33% more tokens per second.
A couple weeks ago a Llama.cpp update landed that moved local inference enough that I noticed without measuring. Then I measured.
Before the update, my go-to model peaked around 57 to 65 tokens per second and decayed as context grew. Long sessions would drift into the mid-50s, which is right around the point where agent loops start to feel like watching a kettle.
After the update, the same model holds 80 to 90 tokens per second across the whole session, with the worst dip deep into context landing near 78. Even the bad moments are faster than every good moment on the old build.
That's the story. If you've bounced off local models because they felt sluggish during real work, it's worth trying again this week.
Two models resident at once, routed by model_id. The right-hand panel is where the throughput actually comes from.
Why the speed bump matters
Agentic work compounds every token.
I'm not chatting with my local model. I'm running loops. Tool call, review, edit, retry, summarize. A single feature iteration might fire 30 messages across two models. At 57 tok/s and degrading, that kind of loop stalls. You feel every pause. You stop trusting the flow to finish, and you end up babysitting it.
At 80-plus sustained, the loop gets out of your way. The model didn't get smarter this month. The wait cost dropped far enough that I stopped opting out of agent workflows, which is a different kind of improvement.
The two speed tiers
The smaller the model, the bigger the payoff.
The 80-plus I mentioned is my big-model floor. On the small end, the same update pays off even harder. A 2B dense model on this machine now clears 160 tokens per second. That's fast enough that the reply lands the moment you stop typing.
The routing lesson writes itself. Small models handle intake, classification, short replies, anything where the user is waiting on a fast answer. The MoE handles the work that actually needs reasoning. Here are two back-to-back measurements from the LM Studio status bar, same machine, same week.
167 tok/s on a small dense model. Short, structured replies feel effectively instant at this speed.
84.84 tok/s on the MoE for a ~3,300-token answer. Long-form reasoning, still comfortably above the point where agent loops feel alive.
The shape of local has changed. Smaller footprints, better outputs, the kind of jump that's hard to see month to month but obvious when you line up the years.
What to run, by machine
Pick a model that matches the RAM you actually have.
Most of the advice online ignores this. The right model depends on your hardware, and trying to stretch past it is how you end up with 8 tok/s and swap thrash.
Modest hardware, the MacBook Air M-series tier with 16 to 24 GB of unified memory: run one small dense model. I'd pick from Qwen 3.5 4B, Gemma 4 4B, or Nemotron 5 Nano 4B. These are genuinely good for intake, classification, summarization, and short creative work. Coding is okay, not their strength.
Mid-range, 32 to 64 GB of RAM with a dedicated GPU or an M-series Pro or Max: step up to an 8B. Gemma 4 8B, Llama 3.6 8B, or Qwen 3.5 8B all feel noticeably stronger on reasoning and code while staying fast.
Serious workstation, 64 GB-plus unified memory or a high-VRAM GPU: this is where the interesting move is. Skip dense models and run an MoE, specifically Qwen 3.6 35B-A3B. 35 billion total parameters, only about 3 billion active per token. You pay the memory bill of a big model and the speed bill of a small one.
14 models, most of them quantized two or three different ways. I pin one daily driver per machine and treat the rest as task-specific.
LM Studio settings
Turn these on. Defaults leave speed on the table.
The Developer tab's Server Settings panel is the difference between a model that feels fast and one that doesn't. Here's what I actually toggle.
- Flash AttentionFree win
- Lower memory, faster attention. No quality tradeoff on Apple silicon or recent CUDA. Leave this on.
- mmapFree win
- Faster model load, lighter RAM pressure. Makes swapping between models bearable instead of a 30-second stall each time.
- Unified KV CacheMulti-model only
- Cleaner memory accounting when you're running more than one model. If you only load one, it doesn't change much.
- K/V Cache QuantizationTradeoff
- Trade a sliver of quality for real memory savings.
Q8is almost free and should be your default.Q4is a judgment call worth testing on your own prompts. - GPU OffloadSize to fit
- Push as many layers onto the GPU as your VRAM allows. Partial offload is fine. Don't try to fit everything if it means swapping to disk, which is far worse than CPU inference.
- Evaluation Batch SizeBump it
- The default is usually too conservative. Setting it to
2048moves prompt processing noticeably on most machines. - Context lengthRight-size
- Don't set it higher than you use.
32Khandles most work.65Kif you're running long agent sessions. Bigger windows cost memory even when they're empty.
These aren't magic. They're reasonable defaults for serious work, and LM Studio ships with them off or low for safety. Flip them.
The MoE angle
Big model, small active path.
Mixture of Experts is the release I've been waiting on for local. The intuition is simple: it's a big model that only lights up the parts it needs for each token. You get the breadth of a 35B parameter model with the per-token compute of something much smaller.
In practice, Qwen 3.6 35B-A3B feels like a dense 8B in throughput and something in the 24 to 30B class in reasoning. That's a first for me on local. The Discover page pitch flags "stability and real-world utility" and "agentic coding," which is exactly the use case that used to send me back to a cloud API.
It's not free. You still need the RAM to hold all 35B parameters resident. But if you've got 64 GB unified memory or more, this is the one to try.
Staff Pick, and deservedly so. This is the model I'd download first on any workstation that can hold it.
Same prompts, same machine, two Llama.cpp builds back to back. Both lines slope down as context grows (the KV cache doing its usual work), but the new build sustains a markedly higher floor.
Local is fast enough now to run the loops a cloud model can't afford.
— The moment it clicked
The takeaway
Local crossed a line. Pick a model, turn the knobs, try it this week.
A year ago I'd have told you to keep the serious work on a frontier API and treat local as a nice offline backup. That answer changed two weeks ago.
If you've got modest hardware, a 4B gets you further than you'd expect. If you've got mid-range hardware, an 8B is a real daily driver. If you've got a workstation, Qwen 3.6 35B-A3B is the model to try first.
Then open the Developer tab, flip those settings, and run an agent loop. Not a single-turn chat. The whole thing. That's where the speed bump actually pays you back.
Share this post
Keep reading
More worth your time
Blog post · AI, Building, Thoughts
Why I default to local models for bulk work
Cloud APIs are great for deep reasoning. For everything else, a model on my own machine does the job for free.
Read post →
Blog post · AI, Building, Process
The bottom-up edit rule
When a model queues five edits against one file, working top-down is a bug. Here's the order that fixed it.
Read post →Stay in the loop
New posts, same voice.
Get a short email when I publish something new. No weekly digests, no link dumps — just the essays.