The context
Frontier models can muscle through sloppy tools. Local models can't.
A frontier model with a 200K-token window and a few trillion parameters of horsepower can absorb a lot of badly-designed tooling. If line numbers shift between edits, it notices and re-reads the file. If a tool returns ambiguous state, it reasons about what probably happened. Headroom covers for a lot of sharp edges.
A 26B model running on a laptop doesn't have that headroom. 32K tokens, not 200K. Re-reading a 400-line file twice during a review cycle eats 15% of the window. Reasoning its way around "what probably happened" costs turns the model can't afford to spend before it forgets the task. Smaller models are also less forgiving of ambiguity in tool contracts. They follow the contract more literally, which is usually a feature, right up until the contract has a flaw.
So on Jarvis v2, the environment does the work the model can't. Every tool call has to produce predictable state. Every multi-step flow has to work deterministically, without the model having to remember what shifted underneath it. This post is about one flow where I got that contract wrong, and what it took to notice.
The setup
Read once, edit many.
Jarvis's big brain has a read_file tool that returns content with line numbers, and a replace_lines tool that takes start_line, end_line, and a replacement body. Normal flow: read the file once, queue several edits against that snapshot, apply them as a batch.
One edit works fine. Two edits usually work. Five edits against a real file is where it falls apart, and for a while I couldn't figure out why.
The failure mode
Line numbers shift the second you apply the first edit.
Say the model reads component.tsx and gets a 140-line snapshot back. The reviewer flagged three things: a function signature, a call site that depends on it, and some prop defaults further down. The big brain queues all three edits against the one snapshot before applying any of them.
What the model sends to the executor before any apply step runs. Every call references line numbers from the same read_file snapshot.
-
R read_file one-time snapshot
- path
- component.tsx
returned 140 lines returned with line numbers. -
1 replace_lines fix the function signature
- start_line
- 45
- end_line
- 48
- body
"export function UserCard({ id, ...rest })"(4 lines in, 6 lines out)
-
2 replace_lines update the call site that uses it
- start_line
- 82
- end_line
- 85
- body
"<UserCard id={user.id} />"
-
3 replace_lines refresh the prop defaults
- start_line
- 120
- end_line
- 123
- body
"const DEFAULTS = { variant: 'compact' }"
All three reference line numbers from that same read_file snapshot. None of them know about each other.
Apply call 1. The replacement is six lines instead of four, so the file is now 142 lines. Line 82 is now line 84. Line 120 is now line 122.
Call 2 fires next. It targets line 82, but that content isn't there anymore. Either you hit a content-mismatch error, or worse, you silently overwrite the wrong thing because the lines near 82 look close enough to the snapshot to pass validation. Call 3 falls the same way.
One bad edit turned the review cycle into a doom loop. Reviewer flags three issues. Big brain queues three fixes. Two land in the wrong place. Reviewer sees new issues (because the file is now half-broken), loops again. Work that should take one pass was taking four.
The order you apply changes in matters as much as what the changes are.
— Lesson I kept relearning, one edit at a time
What I tried first
The wrong fixes.
Re-read the file before every edit. Works. Also burns tokens on a redundant snapshot for every single tool call, which piles up fast on large files.
Make edits atomic, one per tool call, agent decides when to batch. Works. Also slows down any legitimate multi-fix refactor and adds round-trip latency.
Switch to diff-style patches. Works on paper. Harder to prompt for correctly, especially when the model is reasoning about what to change rather than literally writing a diff.
All three are real answers. They just all have tradeoffs I didn't love for a flow that wanted to be fast.
The actual fix
Sort edits by start_line descending, apply bottom-up.
An edit at line 120 only shifts lines below line 120. Everything above is untouched. Apply that edit first, and the snapshot stays valid for every remaining edit that targets earlier lines.
Same three calls as before. Same snapshot. Different apply order.
Reordered at apply time by descending start_line. The model can queue edits in any order it likes. The executor sorts them before running.
-
1 replace_lines bottom edit first applied
- start_line
- 120
- end_line
- 123
- body
"const DEFAULTS = { variant: 'compact' }"
returned Lines 1–119 unchanged. Targets above still valid. -
2 replace_lines middle edit, snapshot still accurate applied
- start_line
- 82
- end_line
- 85
- body
"<UserCard id={user.id} />"
returned Target content matches snapshot exactly. -
3 replace_lines top edit, nothing below ever touched it applied
- start_line
- 45
- end_line
- 48
- body
"export function UserCard({ id, ...rest })"
returned Target content matches snapshot exactly.
No re-reads. No atomic calls. No diff format. Just a sort on the edit queue before apply.
Same three edits. Top-down lands 1 of 3 correctly. Bottom-up lands 3 of 3.
QA
The test that caught it.
The original symptom wasn't obvious. The TUI was logging edit applied events that looked fine. I only caught it when I diffed the output against what I expected and saw edits landing in the wrong places.
I wrote a fixture. A 150-line file and a queue of five edits at spaced-out line ranges. Ran it top-down, diffed the output, counted hits.
- Top-down: 2 of 5 edits landed correctly. The other three either errored or overwrote nearby lines.
- Bottom-up: 5 of 5 landed, every run.
This is the kind of test that doesn't come out of reading the code. You have to put the system through a realistic load and look at what it actually produced, not what it logged.
The one-line fix
A rule in the system prompt.
The fix was a single rule added to the big brain's system prompt (jarvis/llm/prompts.py:30):
When you queue multiple
replace_linesedits against oneread_filesnapshot, sort them bystart_linedescending and apply bottom-up. Line-number shifts from earlier edits won't invalidate later ones this way.
I also added the sort on the apply side, so the model can queue edits in any order and the executor reorders before running. Belt and suspenders. The prompt tells the model the right way. The apply-side sort makes sure it can't get it wrong.
Single-edit cases are unaffected. Multi-edit review cycles now land every change on the first try.
The takeaway
Most LLM tool-call bugs look like prompt bugs until you look closer.
I spent an hour blaming the model for sloppy edits before I tested the assumption that any edit order would work. The model was fine. The tool contract I gave it wasn't.
When a tool call is technically correct but the outcome is wrong, look at the side effects of other calls in the same batch. Check if one call is changing state that a later call assumes. That's where the bug usually lives, and it won't show up in any single call's logs.
Share this post
Keep reading
More worth your time
Blog post · AI, Building, Process
The 12,000-token message I didn't know I was sending
My agent's context window kept jumping from 22% to 60% in a single turn. The leak wasn't where I was looking.
Read post →
Blog post · AI, Building, Thoughts
Why I default to local models for bulk work
Cloud APIs are great for deep reasoning. For everything else, a model on my own machine does the job for free.
Read post →Stay in the loop
New posts, same voice.
Get a short email when I publish something new. No weekly digests, no link dumps — just the essays.