Blog · AI, Building, Notes

How I'd set up LM Studio today

A recent Llama.cpp update pushed me from 60 tokens per second to 80-plus on the same machine. Here's what I'd run and what I'd turn on.

LM Studio's Developer tab with two models loaded at once, next to the Server Settings panel showing Flash Attention, mmap, K/V cache quantization, and GPU offload toggles.
Two models loaded, one LM Studio instance, and the settings panel that decides how fast they actually run.

The hook

Same machine. Roughly 33% more tokens per second.

A couple weeks ago a Llama.cpp update landed that moved local inference enough that I noticed without measuring. Then I measured.

Before the update, my go-to model peaked around 57 to 65 tokens per second and decayed as context grew. Long sessions would drift into the mid-50s, which is right around the point where agent loops start to feel like watching a kettle.

After the update, the same model holds 80 to 90 tokens per second across the whole session, with the worst dip deep into context landing near 78. Even the bad moments are faster than every good moment on the old build.

That's the story. If you've bounced off local models because they felt sluggish during real work, it's worth trying again this week.

LM Studio Developer tab with Qwen 3.6 35B-A3B and a smaller 4B model both resident, with the Server Settings panel open on the right showing Flash Attention, mmap, GPU offload, K/V cache quantization, and evaluation batch size.

Two models resident at once, routed by model_id. The right-hand panel is where the throughput actually comes from.

Why the speed bump matters

Agentic work compounds every token.

I'm not chatting with my local model. I'm running loops. Tool call, review, edit, retry, summarize. A single feature iteration might fire 30 messages across two models. At 57 tok/s and degrading, that kind of loop stalls. You feel every pause. You stop trusting the flow to finish, and you end up babysitting it.

At 80-plus sustained, the loop gets out of your way. The model didn't get smarter this month. The wait cost dropped far enough that I stopped opting out of agent workflows, which is a different kind of improvement.

The two speed tiers

The smaller the model, the bigger the payoff.

The 80-plus I mentioned is my big-model floor. On the small end, the same update pays off even harder. A 2B dense model on this machine now clears 160 tokens per second. That's fast enough that the reply lands the moment you stop typing.

The routing lesson writes itself. Small models handle intake, classification, short replies, anything where the user is waiting on a fast answer. The MoE handles the work that actually needs reasoning. Here are two back-to-back measurements from the LM Studio status bar, same machine, same week.

LM Studio status bar showing a completed response at 167.06 tokens per second over 1202 tokens in 0.76 seconds, stop reason: EOS token found.

167 tok/s on a small dense model. Short, structured replies feel effectively instant at this speed.

LM Studio status bar showing a completed response at 84.84 tokens per second over 3354 tokens in 1.81 seconds, stop reason: EOS token found.

84.84 tok/s on the MoE for a ~3,300-token answer. Long-form reasoning, still comfortably above the point where agent loops feel alive.

LM Studio chat window showing a table titled 'efficiency evolution' that compares 2024 local models against 2026 local models on intelligence per GB of RAM and practical use cases.

The shape of local has changed. Smaller footprints, better outputs, the kind of jump that's hard to see month to month but obvious when you line up the years.

What to run, by machine

Pick a model that matches the RAM you actually have.

Most of the advice online ignores this. The right model depends on your hardware, and trying to stretch past it is how you end up with 8 tok/s and swap thrash.

Modest hardware, the MacBook Air M-series tier with 16 to 24 GB of unified memory: run one small dense model. I'd pick from Qwen 3.5 4B, Gemma 4 4B, or Nemotron 5 Nano 4B. These are genuinely good for intake, classification, summarization, and short creative work. Coding is okay, not their strength.

Mid-range, 32 to 64 GB of RAM with a dedicated GPU or an M-series Pro or Max: step up to an 8B. Gemma 4 8B, Llama 3.6 8B, or Qwen 3.5 8B all feel noticeably stronger on reasoning and code while staying fast.

Serious workstation, 64 GB-plus unified memory or a high-VRAM GPU: this is where the interesting move is. Skip dense models and run an MoE, specifically Qwen 3.6 35B-A3B. 35 billion total parameters, only about 3 billion active per token. You pay the memory bill of a big model and the speed bill of a small one.

The LM Studio My Models library showing 14 local models across multiple quantizations, including Qwen, Gemma, Llama, and Nemotron variants.

14 models, most of them quantized two or three different ways. I pin one daily driver per machine and treat the rest as task-specific.

LM Studio settings

Turn these on. Defaults leave speed on the table.

The Developer tab's Server Settings panel is the difference between a model that feels fast and one that doesn't. Here's what I actually toggle.

Flash AttentionFree win
Lower memory, faster attention. No quality tradeoff on Apple silicon or recent CUDA. Leave this on.
mmapFree win
Faster model load, lighter RAM pressure. Makes swapping between models bearable instead of a 30-second stall each time.
Unified KV CacheMulti-model only
Cleaner memory accounting when you're running more than one model. If you only load one, it doesn't change much.
K/V Cache QuantizationTradeoff
Trade a sliver of quality for real memory savings. Q8 is almost free and should be your default. Q4 is a judgment call worth testing on your own prompts.
GPU OffloadSize to fit
Push as many layers onto the GPU as your VRAM allows. Partial offload is fine. Don't try to fit everything if it means swapping to disk, which is far worse than CPU inference.
Evaluation Batch SizeBump it
The default is usually too conservative. Setting it to 2048 moves prompt processing noticeably on most machines.
Context lengthRight-size
Don't set it higher than you use. 32K handles most work. 65K if you're running long agent sessions. Bigger windows cost memory even when they're empty.

These aren't magic. They're reasonable defaults for serious work, and LM Studio ships with them off or low for safety. Flip them.

The MoE angle

Big model, small active path.

Mixture of Experts is the release I've been waiting on for local. The intuition is simple: it's a big model that only lights up the parts it needs for each token. You get the breadth of a 35B parameter model with the per-token compute of something much smaller.

In practice, Qwen 3.6 35B-A3B feels like a dense 8B in throughput and something in the 24 to 30B class in reasoning. That's a first for me on local. The Discover page pitch flags "stability and real-world utility" and "agentic coding," which is exactly the use case that used to send me back to a cloud API.

It's not free. You still need the RAM to hold all 35B parameters resident. But if you've got 64 GB unified memory or more, this is the one to try.

LM Studio Discover page highlighting Qwen 3.6 35B-A3B with a Staff Pick badge, showing model description about stability, real-world utility, and agentic coding.

Staff Pick, and deservedly so. This is the model I'd download first on any workstation that can hold it.

Line chart showing token generation speed over a long session. Before the Llama.cpp update, speed starts around 65 tokens per second and declines monotonically to 55 at 32K of context. After the update, speed starts near 90 tokens per second and declines more gradually to 78 at 32K. The after line sits roughly 33 percent higher than the before line across the entire session. A sidebar note points out that 2B dense models run near 160 tokens per second on the same hardware, off the top of this chart.

Same prompts, same machine, two Llama.cpp builds back to back. Both lines slope down as context grows (the KV cache doing its usual work), but the new build sustains a markedly higher floor.

Local is fast enough now to run the loops a cloud model can't afford.

— The moment it clicked

The takeaway

Local crossed a line. Pick a model, turn the knobs, try it this week.

A year ago I'd have told you to keep the serious work on a frontier API and treat local as a nice offline backup. That answer changed two weeks ago.

If you've got modest hardware, a 4B gets you further than you'd expect. If you've got mid-range hardware, an 8B is a real daily driver. If you've got a workstation, Qwen 3.6 35B-A3B is the model to try first.

Then open the Developer tab, flip those settings, and run an agent loop. Not a single-turn chat. The whole thing. That's where the speed bump actually pays you back.

Share this post

Share on LinkedIn Share on X