Frames Per Second (Part 3): Turning a Tiny Diffusion Model into a Traveling Photobooth
Image credit: Nano Banana ProTable of Contents
Introduction
I’m on a personal mission to recreate the Nano Banana experience on the edge…or at least get as close as physics and open-source tools will allow. In my first blogpost, I explained why Apple’s CoreML Stable Diffusion (SD) is my best bet. In the last post, I broke down how Apple squeezed a 6GB model down to 1.5GB while still delivering sub-10-second generation on iOS.
But there’s catch: Apple’s implementation breaks my use case. If I try a simple img2img portrait edit, the subject’s identity collapses. As seen in Figure 1, once I make the denoising strength high enough to change the background, my subject morphs into a loosely related stranger who just happens to be wearing a similar outfit.

If I rely solely on Apple’s CoreML Stable Diffusion model, I run into an impossible trade-off:
- Low strength (0.3-0.5): Character consistency is maintained, but the background barely changes
- High strength (0.7-0.9): Background transforms perfectly to align with the given text prompt; however, the person pictures just becomes unrecognizable.
This is hardly a surprise, since (a) the original Nano Banana model (released in August 2025) broke the internet for its ability to maintain character consistency and (b) we’re working with a hyper-optimized version of a 2022 model. The problem with our approach is that Stable Diffusion can’t distinguish between “keep this” and “change that”. It’s trying to equally transform each pixel. Hence, it’s trying to do two conflicting jobs at once: preserve user identity and dramatically transform the background.
The lesson? Stop asking Stable Diffusion to multitask. I need to handle identity preservation and scene transformation separately. This blogpost shows how to do this on a shoestring compute budget.
A Three-Stage Approach
I’m a fan of simple solutions, especially under a tight runtime budget. So, I started with the simplest move possible: I isolated the subject and focused Stable Diffusion’s efforts on background generation. This created a lightweight, three-stage pipeline:
Segment. I used Apple’s Vision framework to perform person segmentation. This process yields (a) a cutout of the person with transparent background, and (b) an inverted mask marking which pixels need regeneration.
Generate. I feed the inverted mask and text prompt into Stable Diffusion’s img2img pipeline. SD regenerates only the masked background regions while leaving the subject’s pixels untouched.
Composite. I then layer the original subject cutout over the newly generated background. In order to deliver a photo booth-like user experience, I also added optional Instagram-style filters to make the final outputs more shareable.

The final result is a lean, fully on-device conditional image generation pipeline that runs within an average of ~27 seconds. This puts me safely below my 60 second limit.
Now, let’s dive into the details. To demonstrate each stage’s output, we’ll successfully transport Sabrina Carpenter from the concert stage to a Winter Wonderland.

Stage 1: Person Segmentation (~1s)
First, I extract the subject from the background using Apple’s Vision framework, specifically the VNGeneratePersonSegmentationRequest API. This built-in segmentation model ships with iOS and is already optimized for the Neural Engine.
Deploying Apple’s off the shelf solution let’s me focus on core problem without getting distracted by additional deployment overhead. Apple has already hyper-optimized this image segmentation model for their Apple Neural Engine (ANE) hardware accelerator. Meaning, even when I set the preferred quality level to high, the segmentation model still returns a result within ~1 second. I’ve alloted about ~67% of my inference time budget to Stable Diffusion (40 seconds) and the remainder to everything else. Keeping the segmentation step leaves me with plenty of breathing room.
Figure 3 shows an example of this model’s outputs, where its separates the subject from her surroundings.
Now, let’s whisk Sabrina Carpenter off the Coachella stage and drop her straight into a glittery Winter Wonderland for a festive, snow-dusted performance.
Stage 2: Conditional Background Generation
This step is where the magic happens — and where most of my runtime budget disappears. I feed the background mask from Stage 1 (see Figure 4C) and the text prompt into the Stable Diffusion. The mask acts like a stencil: white regions get regenerated, black pixels (the subject) stay untouched. Everything gets resized to 512×512 before inference, since that’s SD’s native training resolution.
For denoising strength, I stayed within the commonly recommended 0.65–0.85 range: low enough to preserve subject boundaries, high enough to meaningfully transform the background. I used the standard 25 DPM-Solver steps and set the default guidance scale to 7.5.
Prompt Engineering
Prompt engineering took longer than I’d like to admit. I wanted to create a visually striking background, so I started with maximalist prompts ("A winter wonderland with snow-covered pine trees, twinkling fairy lights, ice sculptures, frosted windows..."). CoreML Stable Diffusion got overwhelmed and returned incoherent mush. Then I went ultra-minimal ("A winter scene") and got a bleak, featureless white void.
The sweet spot was photography-style phrasing with a few concrete details, like "A glittery winter wonderland with snow, twinkling lights, warm glow". Enough direction, not enough to overwhelm. Along the way, I learned:
- Stable Diffusion trims anything past ~75 tokens
- Evocative scene vibes are better than itemized lists
- Lighting cues, like “warm orange glow” vs. “blue hour twilight”, can set the entire mood
Now, let’s see what all that work actually produces. Here’s the raw background Stable Diffusion generated before the subject gets composited back in (Figure 5).
Thread Safety and Process Survival
Running a Stable Diffusion pipeline on-device means juggling two hard problems:
- Thread safety. Segmentation, SD inference, and UI updates all touch the same shared state, creating the perfect incubator for race conditions.
- Process survival. I need to keep the UI responsive while SD runs for ~27 seconds in the background. At the same time, iOS locks the screen after 30 seconds of inactivity and suspends the app, which kills image generation.
In short, I had to choose between concurrency or chaos. I enabled Swift 6’s strict concurrency to catch threading bugs at compile time rather than dealing with surprises in production. With strict concurrency, everything needs explicit actor boundaries. The UI state (@Published properties, view model updates) runs on the main thread, while Stable Diffusion inference runs on background threads to prevent freezing the entire app.
// Simplified coordinator pattern
func generateBackground() {
isProcessing = true // MainActor UI update
Task.detached { // Background thread for heavy work
let result = await pipeline.generate(...)
await MainActor.run { // Back to MainActor for UI
self.outputImage = result
self.isProcessing = false
}
}
}
Figure 6. Actor coordination pattern in Swift. The thread hopping pattern runs inference on a background thread, then returns to the main thread (@MainActor) for UI updates.
I also registered Stable Diffusion as a background task to ensure image generation continues if the screen locks. Without it, we’d be left with half a cottage and no Winter Wonderland magic. Once the final image is composited, the background task is released.
Stage 3: Compositing & Style Filters (<1s)
With the background generated, I’m ready to layer the isolated subject (Fig. 4a) into the new scene. I use Core Graphics Apple’s low-level 2D rendering framework, to composite these two layers together. This process is fast, clean, and basically free in terms of runtime.
Originally, I planned to chain multiple Stable Diffusion generations together to create a flexible style transfer experience, allowing the user to further personalize their images, but each extra pass incurs another ~25 seconds. No one is going to wait 2+ minutes to try on a different look.
So I switched to Core Image filters, which run instantly. I added four curated styles plus an intensity slider, letting users experiment in real time, turning the whole system into a fully on-device, pocket-sized photobooth. Figure 8 highlights a few results, any of which could slide neatly onto Sabrina’s holiday-themed merch.
Of course, this system isn’t a one-hit wonder. Figure 9 shows the same pipeline dropping Sabrina underneath a Christmas tree, into San Diego’s Balboa Park, and even into a groovy reimagining of La Jolla Cove, proving that this pocket-sized photobooth travels just as well as she does.
Lessons from the Edge
When I started this project, I knew bringing the Nano Banana experience to mobile would be tough, but I just didn’t realize how tough. I learned four valuable lessons along the way:
1. At small scales, hardware-aware wins
Users won’t accept bad images just because the model runs fast. Once we shrink diffusion, quality depends heavily on hardware we don’t control or fully understand. Apple’s vertical integration makes some optimizations look effortless, but replicating them from the outside is anything but.
2. If your prompt loses focus, Stable Diffusion will too
I learned quickly that mixing themes (e.g., Winter Wonderland + robots) just produces incoherent mush. Chaining prompts or tweaking denoising strength didn’t help either: high strength erased the scene, low strength lead to incoherent, blurry transformation.
Blending multiple semantic concepts is a completely different problem, one tackled by disentangled-control methods like ControlNet and IP-Adapter. But those techniques rely on extra conditioning modules that add hundreds of megabytes and several seconds of latency. That’s fine on a workstation, disastrous for a sub-30-second mobile experience.
3. Architecture solves capability gaps
Splitting the process into Segment → Generate → Composite avoided the pitfalls of an all-in-one model. Segmentation preserved identity, SD produced high-quality backgrounds, and fast filters enabled rapid style iteration. Even if on-device disentangled-control were possible, the gains for the end user would be minimal. The modular workflow already delivers strong speed and consistency. This model has its limitations, but good system design works around them.
Of course, as base model improves, so does the space for smarter systems built around its new limits.
Conclusion
I accomplished my original goal of building a Nano-Banana–style photobooth that runs entirely on-device. Along the way, I also ended up with a compact computer-vision playground (Figure 10) ready for whatever comes next.
Future Directions
From here, my options are boundless but they include: automating the tedious prompt engineering process using reinforcement learning techniques like DDPO, trying to recreate some aspects of identity-preserving style transfer on-device, or fine-tuning my tiny Stable Diffusion model with LoRA adapters so it knows why Sabrina Carpenter is “Man’s Best Friend”. I now have a solid foundation and free to explore what’s genuinely interesting.
The Joy of Building Small
A GPU cluster would brute-force most of the problems I hit, edge constraints push me to invent better solutions. With tiny, local models, I skip cloud overhead, iterate faster, and avoid unwanted surprise bills. Starting at the bottom of the scale curve is liberating: any capability I unlock here will only get stronger as the model scales. And in the end, the setup stays small, but the possibilities don’t.
I’m excited to keep iterating on this work. If you want to dig into the details, the full implementation is on GitHub. Questions, feedback, or wild ideas? Drop a comment or reach out on LinkedIn. I always enjoy meeting people working at the edge of what’s possible — on-device or in the cloud.
