Image credit: Nano Banana ProRunning state-of-the-art computer vision entirely on-device isn’t easy — and that’s exactly why I had to try it. During last year’s adventure in embedding multimodal LLMs on edge devices, the visceral impact of images really stood out to me. Unlike sequential language, the human brain processes imagery almost instantly, making its effects uniquely potent.
In recent years, the gravitational pull of Large Language Models (LLMs) has dominated the AI space, thanks to the compounding force of attention and scaling laws. That focus began to shift when Google DeepMind released Nano Banana in August 2025, and turned the painstaking process of photography edits into a simple text prompt. Three months later, Nano Banana Pro demonstrated a massive leap in generative computer vision capabilities by rendering perfectly legible text pixel by pixel. The implication? The power to instantly generate designer-quality infographics and slide decks.
Seeing that level of capability running smoothly in the cloud made me wonder: how powerful could a tiny conditional image-generation model be while still fitting into my iPhone (and not crashing it)?

That curiosity quickly launched this hands-on experiment. I challenged myself to run high-capacity, conditional image generation entirely on my iPhone’s hardware. Unfortunately, generative quality often breaks down at smaller scales: a tiny model is more likely to turn you into a distant cousin than your doppelgänger. To solve this, I engineered a two-step workflow that preserves subject identity while regenerating complex backgrounds and enforcing visual consistency (with some fun filters included). I initially benchmarked Segmind’s Tiny (~1GB) stable diffusion model due to its tiny memory footprint, but it couldn’t generate high-quality outputs under our strict timing and hardware constraints.
The result is a fully on-device image transformation playground that tests the best of open-source conditional image generation.
For the full story, including technical details, check out my corresponding “Frames Per Second” blog series:
- Part 1: The Hunt for a Tiny, High-Quality Diffusion Model
- Part 2: Quantization, Kernels, and the Path to On-Device Diffusion
- Part 3: Turning a Tiny Diffusion Model into a Traveling Photobooth
The complete source code, benchmarks, and project notes are available on GitHub. Sometimes it only takes a few frames per second to generate the image you want: no cloud, no fuss, no hassle.