Blog
Essays and breakdowns on building products, design, AI orchestration, and everything in between. Sometimes polished, sometimes a work-in-progress.
We chased million-token context windows for years. The rot didn't get fixed. It just moved somewhere quieter.
Every time a regulated firm pastes sensitive data into a cloud chatbot, it isn't using AI. It's leaking.
Your phone rings, it's your daughter's voice, and she's panicking. Except she never called.
Thirteen words decode how AI really works, from token to scaling laws. No math degree required.
Every modern vision model already chunks images to understand them. Local models need that chunking made explicit, because they can't paper over a missed detail the way a frontier model can.
When your local model's tool call drops a required parameter and a long file almost gets thrown away.
When a model queues five edits against one file, working top-down is a bug. Here's the order that fixed it.
My agent's context window kept jumping from 22% to 60% in a single turn. The leak wasn't where I was looking.
A recent Llama.cpp update pushed me from 60 tokens per second to 80-plus on the same machine. Here's what I'd run and what I'd turn on.