Frames Per Second (Part 3): Turning a Tiny Diffusion Model into a Traveling Photobooth

Bella Nicholson

Nov 26, 2025 11 min read

Table of Contents
Introduction
A Three-Stage Approach
Lessons from the Edge
Conclusion
- Future Directions
- The Joy of Building Small

Introduction

I’m on a personal mission to recreate the Nano Banana experience on the edge…or at least get as close as physics and open-source tools will allow. In my first blogpost, I explained why Apple’s CoreML Stable Diffusion (SD) is my best bet. In the last post, I broke down how Apple squeezed a 6GB model down to 1.5GB while still delivering sub-10-second generation on iOS.

But there’s catch: Apple’s implementation breaks my use case. If I try a simple img2img portrait edit, the subject’s identity collapses. As seen in Figure 1, once I make the denoising strength high enough to change the background, my subject morphs into a loosely related stranger who just happens to be wearing a similar outfit.

**Figure 1.** Apple's CoreML Stable Diffusion models fails to maintain character consistency. We attempt to transport the lovely Sabrina Carpenter from the 2025 MTV Video Music Awards (image credit) to a festive holiday backdrop. The background and prop swaps are convincing, but the resulting blonde is definitely not Sabrina.

If I rely solely on Apple’s CoreML Stable Diffusion model, I run into an impossible trade-off:

Low strength (0.3-0.5): Character consistency is maintained, but the background barely changes
High strength (0.7-0.9): Background transforms perfectly to align with the given text prompt; however, the person pictures just becomes unrecognizable.

This is hardly a surprise, since (a) the original Nano Banana model (released in August 2025) broke the internet for its ability to maintain character consistency and (b) we’re working with a hyper-optimized version of a 2022 model. The problem with our approach is that Stable Diffusion can’t distinguish between “keep this” and “change that”. It’s trying to equally transform each pixel. Hence, it’s trying to do two conflicting jobs at once: preserve user identity and dramatically transform the background.

The lesson? Stop asking Stable Diffusion to multitask. I need to handle identity preservation and scene transformation separately. This blogpost shows how to do this on a shoestring compute budget.

A Three-Stage Approach

I’m a fan of simple solutions, especially under a tight runtime budget. So, I started with the simplest move possible: I isolated the subject and focused Stable Diffusion’s efforts on background generation. This created a lightweight, three-stage pipeline:

Segment. I used Apple’s Vision framework to perform person segmentation. This process yields (a) a cutout of the person with transparent background, and (b) an inverted mask marking which pixels need regeneration.
Generate. I feed the inverted mask and text prompt into Stable Diffusion’s img2img pipeline. SD regenerates only the masked background regions while leaving the subject’s pixels untouched.
Composite. I then layer the original subject cutout over the newly generated background. In order to deliver a photo booth-like user experience, I also added optional Instagram-style filters to make the final outputs more shareable.

**Figure 2.** My hyper-optimized three-stage pipeline uses Apple's CoreML Stable Diffusion model for on-device conditional image generation. By isolating the heavy diffusion step to background regeneration, the system preserves identity consistency despite the tiny model's quality constraints.

The final result is a lean, fully on-device conditional image generation pipeline that runs within an average of ~27 seconds. This puts me safely below my 60 second limit.

Now, let’s dive into the details. To demonstrate each stage’s output, we’ll successfully transport Sabrina Carpenter from the concert stage to a Winter Wonderland.

**Figure 3.** Let's assume this concert photo of Sabrina Carpenter, performing onstage during the 2024 Coachella Valley Music and Arts Festival, as our running example. Our goal is to transport her to a Winter Wonderland (image credit).

Stage 1: Person Segmentation (~1s)

First, I extract the subject from the background using Apple’s Vision framework, specifically the VNGeneratePersonSegmentationRequest API. This built-in segmentation model ships with iOS and is already optimized for the Neural Engine.

Deploying Apple’s off the shelf solution let’s me focus on core problem without getting distracted by additional deployment overhead. Apple has already hyper-optimized this image segmentation model for their Apple Neural Engine (ANE) hardware accelerator. Meaning, even when I set the preferred quality level to high, the segmentation model still returns a result within ~1 second. I’ve alloted about ~67% of my inference time budget to Stable Diffusion (40 seconds) and the remainder to everything else. Keeping the segmentation step leaves me with plenty of breathing room.

Figure 3 shows an example of this model’s outputs, where its separates the subject from her surroundings.

**Figure 4.** Sample person segmentation results. **(a)** The subject is now isolated against an alpha transparency background. **(b)** A background mask for subsequent conditional image generation, where only white pixels will be repainted.

Now, let’s whisk Sabrina Carpenter off the Coachella stage and drop her straight into a glittery Winter Wonderland for a festive, snow-dusted performance.

Stage 2: Conditional Background Generation

This step is where the magic happens — and where most of my runtime budget disappears. I feed the background mask from Stage 1 (see Figure 4C) and the text prompt into the Stable Diffusion. The mask acts like a stencil: white regions get regenerated, black pixels (the subject) stay untouched. Everything gets resized to 512×512 before inference, since that’s SD’s native training resolution.

For denoising strength, I stayed within the commonly recommended 0.65–0.85 range: low enough to preserve subject boundaries, high enough to meaningfully transform the background. I used the standard 25 DPM-Solver steps and set the default guidance scale to 7.5.

Prompt Engineering

Prompt engineering took longer than I’d like to admit. I wanted to create a visually striking background, so I started with maximalist prompts ("A winter wonderland with snow-covered pine trees, twinkling fairy lights, ice sculptures, frosted windows..."). CoreML Stable Diffusion got overwhelmed and returned incoherent mush. Then I went ultra-minimal ("A winter scene") and got a bleak, featureless white void.

The sweet spot was photography-style phrasing with a few concrete details, like "A glittery winter wonderland with snow, twinkling lights, warm glow". Enough direction, not enough to overwhelm. Along the way, I learned:

Stable Diffusion trims anything past ~75 tokens
Evocative scene vibes are better than itemized lists
Lighting cues, like “warm orange glow” vs. “blue hour twilight”, can set the entire mood

Now, let’s see what all that work actually produces. Here’s the raw background Stable Diffusion generated before the subject gets composited back in (Figure 5).

**Figure 5.** Stable Diffusion’s raw output for the prompt “outside in magical winter village at blue hour lighting, charming snow-covered cottage with glowing windows.” The scene is coherent, though not perfect, details like the opaque “window/door” remain ambiguous.

Thread Safety and Process Survival

Running a Stable Diffusion pipeline on-device means juggling two hard problems:

Thread safety. Segmentation, SD inference, and UI updates all touch the same shared state, creating the perfect incubator for race conditions.
Process survival. I need to keep the UI responsive while SD runs for ~27 seconds in the background. At the same time, iOS locks the screen after 30 seconds of inactivity and suspends the app, which kills image generation.

In short, I had to choose between concurrency or chaos. I enabled Swift 6’s strict concurrency to catch threading bugs at compile time rather than dealing with surprises in production. With strict concurrency, everything needs explicit actor boundaries. The UI state (@Published properties, view model updates) runs on the main thread, while Stable Diffusion inference runs on background threads to prevent freezing the entire app.

// Simplified coordinator pattern
func generateBackground() {
    isProcessing = true  // MainActor UI update

    Task.detached {  // Background thread for heavy work
        let result = await pipeline.generate(...)

        await MainActor.run {  // Back to MainActor for UI
            self.outputImage = result
            self.isProcessing = false
        }
    }
}

Figure 6. Actor coordination pattern in Swift. The thread hopping pattern runs inference on a background thread, then returns to the main thread (@MainActor) for UI updates.

I also registered Stable Diffusion as a background task to ensure image generation continues if the screen locks. Without it, we’d be left with half a cottage and no Winter Wonderland magic. Once the final image is composited, the background task is released.

Stage 3: Compositing & Style Filters (<1s)

With the background generated, I’m ready to layer the isolated subject (Fig. 4a) into the new scene. I use Core Graphics Apple’s low-level 2D rendering framework, to composite these two layers together. This process is fast, clean, and basically free in terms of runtime.

**Figure 7.** Let's assume this concert photo of Sabrina Carpenter, performing onstage during the 67th Annual GRAMMY Awards, as our running example. Our goal is to transport her to a Winter Wonderland (image credit).

Originally, I planned to chain multiple Stable Diffusion generations together to create a flexible style transfer experience, allowing the user to further personalize their images, but each extra pass incurs another ~25 seconds. No one is going to wait 2+ minutes to try on a different look.

So I switched to Core Image filters, which run instantly. I added four curated styles plus an intensity slider, letting users experiment in real time, turning the whole system into a fully on-device, pocket-sized photobooth. Figure 8 highlights a few results, any of which could slide neatly onto Sabrina’s holiday-themed merch.

**Figure 8.** The same composited image with different Core Image filter styles applied. Each filter applies instantly (<1s) with adjustable intensity.

Of course, this system isn’t a one-hit wonder. Figure 9 shows the same pipeline dropping Sabrina underneath a Christmas tree, into San Diego’s Balboa Park, and even into a groovy reimagining of La Jolla Cove, proving that this pocket-sized photobooth travels just as well as she does.

**Figure 9.** Sabrina Carpenter, re-imagined in three different scenes. Show both the raw Stable Diffusion generated backgrounds (top) and the final composites (bottom).

Lessons from the Edge

When I started this project, I knew bringing the Nano Banana experience to mobile would be tough, but I didn’t realize exactly how tough. I learned four valuable lessons along the way:

1. At small scales, hardware-aware wins

Users won’t accept bad images just because the model runs fast. Once we shrink diffusion, quality depends heavily on hardware we don’t control or fully understand. Apple’s vertical integration makes some optimizations look effortless, but replicating them from the outside is anything but.

2. If your prompt loses focus, Stable Diffusion will too

I learned quickly that mixing themes (e.g., Winter Wonderland + robots) just produces incoherent mush. Chaining prompts or tweaking denoising strength didn’t help either: high strength erased the scene, low strength led to incoherent, blurry transformations.

3. State of the art is rarely within reach

Blending multiple semantic concepts is a completely different problem, one tackled by disentangled control methods like ControlNet and IP-Adapter. But those techniques rely on extra conditioning modules that add hundreds of megabytes and several seconds of latency. That’s fine on a workstation, but disastrous for a sub-30-second mobile experience.

4. Architecture solves capability gaps

Splitting the process into Segment → Generate → Composite avoided the pitfalls of an all-in-one model. Segmentation preserved identity, SD produced high-quality backgrounds, and fast filters enabled rapid style iteration. Even if on-device disentangled-control were possible, the gains for the end user would be minimal. The modular workflow already delivers strong speed and consistency. This model has its limitations, but good system design works around them.

Of course, as the base model improves, smarter system design can push our application’s capabilities even further.

Conclusion

I accomplished my original goal of building a Nano Banana styled photobooth that runs entirely on-device. Along the way, I also ended up with a compact computer vision playground (Figure 10) ready for future experimentation.

**Figure 10.** My custom iOS app acts as an edge computer vision playground. In this example, I transform Sabrina Carpenter's 2024 Governors Ball performance (image credit) into an album-cover candidate. All image generation occurs on my iPhone 16's dedicated Neural Processing Unit.

Future Directions

From here, my options are boundless, but they include: automating the tedious prompt engineering process using reinforcement learning techniques like DDPO, recreating aspects of identity-preserving style transfer on-device on-device, or fine-tuning my tiny Stable Diffusion model with LoRA adapters so it finally understands why Sabrina Carpenter is “Man’s Best Friend.” I now have a solid foundation and am free to explore genuinely interesting research questions.

The Joy of Building Small

A GPU cluster would brute-force most of the problems I hit. However, the edge constraints I encountered forced me to be more creative while providing the perfect environment for faster iteration. In the process, I also skipped the overhead of cloud engineering and those painful monthly bills. Starting at the bottom of the scale curve is liberating: any capability I unlock here will only get stronger as the model scales. My setup stays small, but the possibilities don’t.

I’m excited to keep iterating on this work. If you want to dig into the details, the full implementation is on GitHub. Questions, feedback, or wild ideas? Drop a comment or reach out on LinkedIn. I always enjoy meeting people working at the edge of what’s possible — on-device or in the cloud.