Blog · AI, Building, Thoughts

Why I default to local models for bulk work

Cloud APIs are great for deep reasoning. For everything else, a model on my own machine does the job for free.

Backdraft editor showing a design-to-code workflow.
The kind of workflow that runs on a mix of local and cloud models, split by task.

The default

Classify. Summarize. Extract.

Most of what I ask an LLM to do isn't reasoning. It's classify, summarize, extract. Tag this transcript. Pull out the three key points. Decide if this email is personal or business. These are volume tasks — I run them thousands of times in a pipeline, and the cost adds up fast when every call goes to an API.

So I stopped sending them out. LM Studio runs on my laptop. A small model — Qwen 2B, Gemma 2B, something in that weight class — handles this kind of bulk work fine. Latency is lower because there's no network round-trip. Cost is zero. And when the power goes out or the API rate-limits me, the pipeline keeps running.

Backdraft canvas editing code in real time.

A local model does the grunt work on every keystroke. Only the hard requests escalate.

If the task is repetitive and the structure is fixed, a small local model will do it cheaper, faster, and just as well.

— A rule I've stopped questioning

Where cloud still wins

Reserve the big models for the hard parts.

The cloud models earn their keep on deep synthesis. Planning a multi-step workflow. Debugging a gnarly failure. Writing a case study from interview transcripts. These are the places where a frontier model actually thinks differently — not just faster, but better. That's where I reach for Claude or GPT, because the work genuinely benefits from the extra horsepower.

Think of it like a team. The local model is the intern who handles intake. It reads everything, files it, flags what matters. The cloud model is the senior hire who only gets the hard problems.

Overwatch mode showing every breakpoint at once.

Layout pass — intake happens locally.

Canvas with code panel open.

Component edits — mid-tier model, still local if possible.

Backdraft dashboard and community gallery.

Deep synthesis — this is where the cloud model earns its pay.

Three phases, three model tiers. The router decides which brain gets the work.

The practical setup

How it routes in practice.

My agents default to the local endpoint first. If the task fits a schema — classify, extract, summarize — it stays local. If the router detects something open-ended (multi-step planning, ambiguous instructions, creative writing), it escalates to Claude or GPT.

The beautiful part: this split is invisible to the user. I just talk to the assistant. It figures out which brain does the work.

KO II sampler 3D render.

Physical tools on the desk.

OP-1 Field synth 3D render.

Digital tools on the laptop.

Same principle for hardware — right tool for the task, not the most expensive one.

Share this post

Share on LinkedIn Share on X