Fitted AI Video Series Case Study — Developer. Designer. Builder.

The Concept

A character cycles through dozens of outfits in ten seconds flat.

The idea was simple: take a person standing in an outfit, then rapidly swap through dozens of outfit combinations while keeping the person, their pose, and the background perfectly consistent. Each clothing change lands with an audible shutter click. The outfits cycle faster and faster until they snap back to the final look. It's the kind of ad that stops a thumb mid-scroll.

The concept came directly from the Fitted AI app. If the app can generate any outfit on any person, then a video showing that capability in rapid-fire succession is the strongest possible proof. No voiceover needed. No explanation. Just watch it work.

The catch is that this kind of video doesn't exist as a standard workflow. There's no "outfit swap video" tool. Every frame had to be generated individually by an AI model, then reassembled into a video, then composited over the original footage to maintain consistency. The entire pipeline had to be built from scratch.

All three videos in the series. Hover to preview, click to play full screen with audio.

Grid of clothing items from the digital closet — tops, bottoms, sweaters, jerseys

A selection of the tops and bottoms from the digital closet. The pipeline alternates between swapping the top and swapping the bottom, cycling through the curated set to create dozens of unique outfit combinations per video.

The Pipeline

A Python script that ran for hours, one frame at a time.

The process started with the original video. I wrote a Python script that split the footage into individual frames, then isolated the middle section where the outfit changes would happen. The opening and closing frames stay untouched so the video has clean bookends.

Frame-by-frame AI generation

Each frame in the target range was sent through Google's Nano Banana image generation API. The script swapped one clothing item at a time — the top would hold for six to eight frames, then the bottom would swap and hold for the same duration, then the top again. This alternating pattern is critical. If every item changed on every frame, the result would be an incomprehensible blur. By holding each piece for several frames before switching, the viewer's eye can register each outfit before the next change hits. The generated image was upscaled to match the source resolution, and each completed frame became the input reference for the next, creating smooth transitions rather than random jumps.

Cost and reliability

The script would run for hours per video. Each run cost between fifteen and thirty-five dollars in API credits depending on how many frames were processed. The success rate sat around 90%. The remaining 10% were failures from intermittent API errors, not problems with the generation itself. The Google image generation API would occasionally timeout or return malformed responses. I couldn't prevent those failures, but I built the script to log every failed frame so I could rerun just the missing ones without reprocessing the entire batch.

Why it had to be scripted

Doing this manually would have been impossible. Hundreds of frames per video, three videos, each frame requiring an API call, an upscale, and a quality check. That's over six hundred individual AI generations plus post-processing. The script automated the entire chain: generate, upscale, verify, log, advance to next frame.

Six frames from the video sequence showing the subject across the full take

Eight frames sampled every 12th frame across the generated sequence. The original outfit is on the far left. Each subsequent frame shows the pipeline swapping one item at a time — the top holds for several frames, then the bottom swaps, then the top again — creating a perceptible rhythm of change rather than an incomprehensible blur.

After Effects composition showing multi-layer masking and timeline

The After Effects composition with eight mask layers isolating face, hair, and background regions. The AI-generated video is overlaid on the original footage, with inverted masks ensuring only the clothing regions show AI output.

Compositing

Layered masking to kill the AI jitter.

Raw AI-generated frames have a problem: jitter. Even with frame chaining, the model introduces subtle variations in areas that should stay perfectly still. The face shifts slightly. The hair moves. The background wobbles. In a still image these variations are invisible. In video, played back at 24 or 30 frames per second, they create an unsettling flicker that immediately reads as artificial.

Why full-image swaps, not region masking

My first approach was to use SAM (Segment Anything Model) to isolate just the clothing regions and only swap those areas. In theory, this would keep the face and background pristine while only replacing the outfit. In practice, the masked regions created hard edges and the AI couldn't generate convincing clothing within the irregular mask boundaries. The outfit looked pasted on. I abandoned that approach and went with full-image swaps instead, letting the model regenerate the entire frame and relying on it to respect the original pose. This produced much more natural-looking results, but it meant the face, hair, and background all had slight AI variations frame to frame.

Solving jitter in After Effects

The solution was compositing in post. I took the AI-generated video and overlaid it on the original footage in After Effects, then built eight mask layers to isolate specific body regions. The face and hair were masked to use the original footage, not the AI output. The background was locked to the original. Only the torso and clothing regions used the AI-generated frames. This selective compositing eliminated the jitter everywhere it mattered while preserving the outfit changes where they needed to happen.

Post-processing

With the masks in place, I added additional processing to sell the final result. Sharpening to counteract the slight softness that AI upscaling introduces. Lens effects to give the footage a consistent camera feel. Film grain and noise to mask any remaining artifacts at the boundaries between masked and unmasked regions. These aren't heavy-handed effects. They're subtle enough that you don't notice them, but they're doing real work to bridge the gap between AI-generated and camera-original footage.

Audio and final assembly

The final assembly happened in Premiere. I composed the audio track with music, timed the shutter click sound effects to land on each outfit change, and added a closing frame with the Fitted AI logo animation. The audio design is a bigger part of the ad's impact than it might seem. The rapid-fire clicking creates a rhythm that sells the speed of the outfit changes and gives the viewer an auditory anchor for what they're seeing.

Key Features

Python script that splits video into frames, runs each through AI generation with frame chaining, remasks, upscales, and reassembles. Fully automated.

API intermittent failures accounted for the remaining 10%. Failed frames were logged and rerun individually without reprocessing the full batch.

Full-image AI swaps overlaid on original footage with eight hand-drawn masks isolating face, hair, and background. Eliminates jitter while preserving outfit changes.

Each 200-frame run cost between fifteen and thirty-five dollars in API credits and several hours of processing time.

Clothing items photographed front and back, auto-categorized by vision models into tops, bottoms, dresses, outerwear, and shoes.

From batch generation and frame chaining to audio remixing and clothing categorization. Each script solved a specific stage of the pipeline.