Frames Per Second: Low-Latency Conditional Image Generation on a 2GB Memory Budget

Image credit: Nano Banana Pro

Running state-of-the-art computer vision entirely on-device isn’t easy — and that’s exactly why I had to try it. During last year’s adventure in embedding multimodal LLMs on edge devices, the visceral impact of images really stood out to me. Unlike sequential language, the human brain processes imagery almost instantly, making its effects uniquely potent.

In recent years, the gravitational pull of Large Language Models (LLMs) has dominated the AI space, thanks to the compounding force of attention and scaling laws. That focus began to shift when Google DeepMind released Nano Banana in August 2025, and turned the painstaking process of photography edits into a simple text prompt. Three months later, Nano Banana Pro demonstrated a massive leap in generative computer vision capabilities by rendering perfectly legible text pixel by pixel. The implication? The power to instantly generate designer-quality infographics and slide decks.

Seeing that level of capability running smoothly in the cloud made me wonder: how powerful could a tiny conditional image-generation model be while still fitting into my iPhone (and not crashing it)?

Figure 1. My hyper-optimized three-stage pipeline uses Apple’s CoreML Stable Diffusion model for on-device conditional image generation. By isolating the heavy diffusion step to background regeneration, the system preserves identity consistency despite the tiny model’s quality constraints. The result is lightweight, fully on-device image generation averaging ~27 seconds end-to-end.

That curiosity quickly launched this hands-on experiment. I challenged myself to run high-capacity, conditional image generation entirely on my iPhone’s hardware. Unfortunately, generative quality often breaks down at smaller scales: a tiny model is more likely to turn you into a distant cousin than your doppelgänger. To solve this, I engineered a two-step workflow that preserves subject identity while regenerating complex backgrounds and enforcing visual consistency (with some fun filters included). I initially benchmarked Segmind’s Tiny (~1GB) stable diffusion model due to its tiny memory footprint, but it couldn’t generate high-quality outputs under our strict timing and hardware constraints.

The result is a fully on-device image transformation playground that tests the best of open-source conditional image generation.

(a)
(b)
Figure 2. My custom iOS app. In the original image, Sabrina Carpenter is performing at the 2024 Governors Ball in Queens, New York. My stable diffusion pipeline successfully transports her to the interior of Balboa Park in San Diego, California. One final filter applies a stylized aesthetic, transforming the result into an album-cover candidate.

For the full story, including technical details, check out my corresponding “Frames Per Second” blog series:

The complete source code, benchmarks, and project notes are available on GitHub. Sometimes it only takes a few frames per second to generate the image you want: no cloud, no fuss, no hassle.