Blog · AI, Building, vision-models
Mosaic: a pre-focus layer for local vision models
Every modern vision model already chunks images to understand them. Local models need that chunking made explicit, because they can't paper over a missed detail the way a frontier model can.
Image models already chunk. Mosaic just does it out loud.
A 4B model on my laptop looked at a Tokyo street photo and read the surname on a pawn-shop sign as 橋本. The actual kanji is 楠本. Wrong character, wrong reading, confidently delivered.
That kind of miss is what Mosaic is for.
Mosaic is a small CLI I wrote that chunks an image into a grid, walks each tile alongside its neighbors, and writes a final review with the whole picture back in view. A pre-focus layer for the model underneath.
The trick isn't novel. Every modern vision model already chunks images internally. ViT proved you could feed an image to a transformer as a sequence of patches. LLaVA-NeXT splits high-resolution photos into sub-images before the encoder sees them. Qwen2-VL packs arbitrary resolution into one sequence with a 2D positional scheme.
The consensus across modern VLMs is the same. To see the picture, see it in pieces.
The reason to surface the chunking at the application layer is the kind of model I'm running. A 4B on a laptop has a smaller attention budget to absorb a high-resolution image whole. Whatever it misses inside the encoder is just gone. Moving the splitting one layer up gives me a place to compensate.
One full pass on a Tokyo street photo at light density (4×2 grid, 8 tiles). The first two tiles run at real pace; the rest cycle through the same focal-plus-neighbor pattern rapidly so the full flow lands in under thirty seconds.
Inside the encoder
Modern VLMs split before they understand.
The architecture-level pattern is settled. Tile the image, encode each piece, glue the pieces back together with positional information, and let attention sort it out from there.
ViT-B/16 chops a 224×224 image into 196 patches plus a CLS token. LLaVA-NeXT's AnyRes splits high-resolution photos into a grid of sub-images before any of them reach the encoder, then concatenates the token sequences. Qwen2-VL drops absolute position embeddings, adds 2D RoPE, and packs variable-resolution images into one sequence with adjacent token blocks compressed 4-to-1.
Pixtral, Molmo, LLaVA-UHD, every recent open VLM does some version of this. They all tile, encode, concatenate. The seams just get hidden better as the model gets bigger.
Diagram by [Cosmia Nebula](https://commons.wikimedia.org/wiki/File:Vision_Transformer.svg), after Zhang, Lipton, Li & Smola's _Dive into Deep Learning_, via Wikimedia Commons. CC BY-SA 4.0.
Where local breaks down
Local models can't paper over the chunking.
If chunking is universal, why surface it at the application layer?
Because the bigger the model, the more invisibly the chunking works. A frontier-grade VLM with a long context and a deep encoder can absorb a high-resolution image, hold every tile's embedding in attention, and stitch a coherent answer out of the whole sequence. You drop a phone photo into ChatGPT and you don't think about how the patches got there.
Local models aren't there yet. Two specific limits matter.
Attention budget per call. A 4B model splitting its attention across hundreds of image tokens and a long instruction starts losing fidelity on small text and dense regions. The model still produces a confident answer. It just papers over the parts it can't read clearly, or hallucinates them outright.
Context rot in the middle. "Lost in the Middle" shows accuracy drops by roughly 20 points for content stuck in the middle of a long context, and the effect persists in models marketed as long-context. The same dynamic applies to images packed into a long sequence. The middle tiles are the easiest for the model to skim past.
Frontier models also feel this. They just have enough capacity left over to compensate. A small local model doesn't.
Mosaic forces the model to look at every tile in isolation, then again paired with each of its neighbors. Each call is short. Each call has the focal tile near the front of the prompt. Each call gives the model minimal context to lose track of.
The pipeline
Read the whole thing. Then the parts. Then the whole thing again.
Mosaic runs in three stages.
Stage 1, orientation. The model sees the full image at 1280px max and writes a short summary: scene type, layout, regions worth inspecting closely. This becomes a text prefix passed to every subsequent call so the model knows where each tile sits in the larger picture.
Stage 2, per-tile inspection. The image is split into a grid (2×2, 3×3, 4×4, or 5×5 depending on density). For each tile, the model is called once with the tile alone, then once more with the tile paired with each of its neighbors. A corner tile gets two paired calls. A center tile gets four. A small text-only synthesis call merges the per-tile observations into one tile report.
Stage 3, synthesis. The model sees the full image again, plus every tile report from stage 2, and writes the final review.
The paired-neighbor calls turn out to matter. A tile alone tells you what's in that crop. A tile paired with the tile to its left tells you what continues across the boundary. That's how you catch a sign whose text spans two tiles, or a shop name written across an awning that runs from one column into the next.
Same image, four model sizes
Nobody read the surname. The bigger models got closer.
No local model exactly read the surname on the pawn-shop sign. Every model got closer with chunking than without — a one-shot pass on the same image produces nothing readable at all. The bigger the model, the closer it hovered. The smallest model started hallucinating once density crossed a threshold.
I ran the street photo through Qwen 3.5 2B, Gemma 4 E4B, Gemma 4 26B-A4B, and Qwen 3.6 35B-A3B at light density (4×2 grid, 8 tiles, 38 calls per run). Same prompt, same pipeline, same image. Total runtime ranged from 1.5 minutes on the 2B up to 9.6 on the 35B.
Then I went back to the source photo and read the signage myself, character by character. The vertical wooden sign on the left building reads 並木 (Namiki). The bright white sign upper right reads roughly 質 / 買取 / 楠本 with a phone number, looks like a pawn shop named Kusumoto. There's no Shibuya Station signage anywhere in the frame, no "Kinbi," no "Konbee."
| Model | Surname read | Verdict |
|---|---|---|
| Qwen 2B (quick density) | 並 |
Honest partial. First kanji of 並木 |
| Qwen 2B (light density) | KONBEE, KINBI, "Shibuya Station South Exit" | Pure hallucination. None of those exist in the image |
| Gemma E4B (light) | 橋本 (Hashimoto) |
Wrong kanji entirely. Different radicals |
| Gemma 26B-A4B (light) | 桶本 (Okemoto) |
Second kanji 本 matches; first is wrong |
| Qwen 35B-A3B (light) | 楠木 (Kusunoki) |
First kanji 楠 correct. Closest reading any model produced |
Gemma E4B did read 買取 correctly elsewhere on the same sign, so the failure is specifically on the surname, not on every character in the frame.
Why partial wins still matter
Close but wrong is still useful.
If no model nailed the surname, was this a failure?
No. Two reasons.
A one-shot pass would have produced nothing. I tested it. Drop the same image into the same models with no chunking, just "describe everything in this image," and the surname doesn't appear at all. The chunking elevated 2B from "there is text on a sign" to 並. It elevated 35B from generic "Japanese signage" to 楠木. Half a reading is a real lead. Zero is not.
Cross-model disagreement is signal. Three models gave three different readings of the same sign. That disagreement tells you the sign is hard. The image is dim. The text is small. The camera angle distorts the strokes. A downstream pipeline (search, human review, a second opinion from a different model family) has somewhere to focus instead of nowhere.
The right framing isn't "local models can OCR Japanese signage." It's "local models converge on enough ground truth that downstream verification becomes tractable."
Density
Match density to model size, or the small ones start hallucinating.
The right density depends on the model.
2×2 quick for sub-3B models. Four tiles is enough to surface coarse detail without overloading attention. The model stays grounded. This is the right floor for any model that hasn't been trained specifically to handle dense visual grids.
3×3 light for the 4-10B class. Eight to nine tiles. Enough to catch street-level signage on a typical photo, while still keeping each call's attention focused.
4×4 dense for 20B and up. Sixteen tiles surfaces fine-grained text, license plates, the kind of detail that needs sustained attention to extract. Don't run a 4B at this density. It will confidently make things up.
What's next
Phase-dependent prompts: zoom, then ask one thing at a time.
The 35B's near-miss on 楠 points at the next axis worth tuning. The model wasn't asked to read text. It was asked to inspect the tile. The text reading came out as a side effect of "describe what's here."
Mosaic already splits the image. The prompt still asks for everything at once. The next version splits the prompt too. Same tile, three calls instead of one: one for text, one for subjects, one for lighting and image quality. Each call narrows the model's attention to a single dimension.
The text-only call would say something like:
Transcribe any text visible in this tile. Romanize and translate non-Latin scripts. If a character is partial or ambiguous, mark it as unclear instead of guessing.
Same insight that made tiling work in the first place, applied to the prompt instead of the image. More time. Less drift per call. For a small model running locally, that's the right trade.
Speed isn't the goal. Quality at local scale is.
If big models need to chunk, small ones need it more.
The chunking has always been there. Inside ViT, inside LLaVA-NeXT, inside Qwen2-VL. Modern vision models see the image in pieces because seeing the whole thing at once has never worked.
On a 4B running on a laptop, the compensation runs out fast. The image arrives at the encoder, the model splits it into patches, and whatever didn't survive that split is gone. There's no extra attention surface to recover it.
Mosaic puts the splitting back in the open. It costs more time to run. What it gives back is a small local model meeting the image halfway, with enough ground truth on the table for a downstream pass (a phase-dependent reread, a different model, a human) to take it the rest of the way. That's the quality the local-model use case actually needs.
Share this post
Keep reading
More worth your time
Blog post · AI, Building, Process
Catching the 7,000-character write
When your local model's tool call drops a required parameter and a long file almost gets thrown away.
Read post →
Blog post · AI, Building, Process
The bottom-up edit rule
When a model queues five edits against one file, working top-down is a bug. Here's the order that fixed it.
Read post →Stay in the loop
New posts, same voice.
Get a short email when I publish something new. No weekly digests, no link dumps — just the essays.