Frames Per Second (Part 1): The Hunt for a Tiny, High-Quality Diffusion Model

Image credit: Nano Banana Pro

Table of Contents

Introduction

I have a problem: I love testing out and applying the latest ML research, but I really dislike managing my own cloud infrastructure. That’s why I ended up embedding a multimodal LLM on various edge devices last year. However, generated text doesn’t deliver the same immediate, visceral impact that high-quality images do, compelling me to switch domains.

Unfortunately, deploying a state-of-the-art (SOTA) diffusion model on the edge is far harder than its LLM counterpart. LLMs work in a discrete output space (i.e., tokens). Thus, they tolerate the noise of simple compression algorithms relatively well. In contrast, diffusion models operate on a continuous, dense latent space, causing the same amount of noise to more severely degradate model performance. Attempting to shrink a 4GB model to fit the standard 2GB iOS memory budget is a brutal performance problem. For better or worse, this is my favorite type of problem to solve.

Problem Constraints

To keep things interesting, I decided to target deployment directly for my iPhone 16 (~8GB of RAM). If this model can run effectively on my phone, I’ll always have a tiny, powerful image generator right in my pocket. However, this choice immediately imposed a very strict iOS memory budget. iOS apps encounter a dynamic limit from ~2-4 GB (roughly 50-70% of total RAM) that triggers an EXC_RESOURCE RESOURCE_TYPE_MEMORY termination exception and crashes the app.

Of course, compressing the model is only half the battle. If it’s too slow, any reasonable user will just quit the app, rendering the entire point moot. Hence, I set a 60-second end-to-end limit for the pipeline, allotting 40 seconds for model inference.

This high bar for speed and precision demanded a pipeline built around conditional image generation (Image-to-Image or Img2Img). This is essential because it:

  1. Provides superior creative control and output fidelity.
  2. Elevates the ML-side challenge by requiring hands-on control of the model’s neural network sub-components
  3. Delivers a more engaging user experience by actively transforming the source photo

The final, non-negotiable rule was that the output images needed to be of reasonable quality. Meaning, the model needs to generate easily identifiable objects that are in-line with the provided text prompt.

Beyond the Noise: Unpacking the Architecture of Diffusion Models

The path to modern image generation was surprisingly quick. Open AI released DALL-E in January 2021, Stable Diffusion democratized the field in August 2022, and suddenly everyone had access to conditional image synthesis.

This new era of vision models is powered by diffusion, where the model learns to destroy images systematically and then reverses the process.

Figure 1. A visualization of the diffusion process. Noise is added to (forward pass) or removed from the image (reverse processes) (image credit).

We start with the original image $x_0$ and progressively corrupt it by adding Gaussian noise over $T$ timesteps. At each step, we keep some fraction of the previous image and add fresh noise:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$$

where $\beta_t$ controls the noise intensity. This defines a Markov chain, a sequence of random states where each state $x_t$ depends only on the immediately previous state $x_{t-1}$, nothing earlier.

This Markov property ensures that while each image follows a different random path to noise, all images at timestep t share identical noise statistics (same signal-to-noise ratio). This predictable structure lets us train a neural network to predict the noise $\epsilon$ added at any timestep, which we can then subtract to reverse the corruption. Once trained, the model generates images by starting with pure noise and iteratively denoising over 20-50 steps, guided by a text prompt.

During training, we need noisy images at various timesteps to teach the model this denoising function. Stepping through $x_1 \rightarrow x_2 \rightarrow \cdots \rightarrow x_t$ sequentially for every training sample would be computationally infeasible. Fortunately, a key mathematical property of Markov chains with Gaussian transitions is that the entire sequence collapses into a closed-form solution:

$$ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon $$

where $\epsilon \sim \mathcal{N}(0, I)$ is random noise and $\bar{\alpha}_t = \prod_{i=1}^{t}(1 - \beta_i)$ accumulates all the noise scaling factors up to time $t$. This reparameterization trick lets us jump directly to any timestep in one shot, making model training computationally feasible

Model Architecture

The original diffusion paper introduced the denoising process, but it operated directly on pixels, making it computationally expensive. Stable Diffusion changed the game by running diffusion in a compressed latent space, which dramatically reduces computational costs while maintaining quality.

Figure 2. A visualization of how Stable Diffusion's three neural components work together to do text-to-image generation. In image-to-image generation, a vision encoder also maps the original image to a latent space representation (image credit).

Stable Diffusion is a composite of four neural networks:

  1. Text encoder. We use a Contrastive Language-Image Pre-training (CLIP) network to convert text prompts into $77 \times 768$ embedding vectors. These embeddings semantically link language to visual concepts, allowing our model to “understand” our text inputs.
  2. Image encoder. We use a Variational Autoencoder Encoder (VAE) to compress images from pixel space ($3 \times H \times W$) to a compact latent space ($4 \times \frac{H}{8} \times \frac{W}{8}$). This ~64× compression is the key efficiency innovation, since it lets us denoise the compressed representations rather than the full-resolution images.
  3. Image denoiser. The U-Net is the core component that implements the reverse diffusion process. It accepts noisy latents, the current timestep, and text embeddings (via cross-attention) to predict what noise to remove.
    • This is the model’s heaviest component, consuming roughly 85% of the Stable Diffusion’s total memory footprint. The immense size is mandatory because the U-Net must model a universal, continuous denoising function spanning all timesteps and image content. This dense predictive complexity is precisely why the U-Net resists compression.
  4. Image decoder. The VAE decoder decompresses our latents back to high-resolution pixel-space images. This reconstruction is the final output that the model returns.

Even with these efficiency gains, deploying Stable Diffusion on mobile devices requires further aggressive but non-destructive U-Net compression. This architecture supports two distinct generation modes, each with different performance characteristics.

Text-to-Image vs. Image-to-Image Generation

Stable Diffusion supports two generation modes, each starting from a different point in the noise spectrum:

  1. Text-to-Image (T2I) starts with pure random noise and relies solely on the text prompt to guide generation. This maximizes the model’s creative freedom, but means we get little control over the final image’s structure or composition.
  2. Image-to-Image (Img2Img) adds noise to an existing image’s latent representation, then denoises it while being guided by both the original image structure and a text prompt. This trades creative flexibility for precise control over image composition.
    • The strength parameter sets how aggressively the model transforms its input. At $0.0$, the model returns the original image untouched. At $1.0$, the input is completely replaced with noise, making the output nearly identical to Text-to-Image generation.

These two modes establish the foundational mechanics of image creation, but what happens when you scale that process to an unbelievable level of fidelity and control? That’s where Nano Banana Pro enters the frame.

Nano Banana: Today’s State of the Art

In August 2025, Google DeepMind released Nano Banana, a Gemini model natively generates interleaved text and images. It excels at image editing thanks to its strong character-consistency across different image edits. And unlike most image-generation models, which still benefited from long, detailed prompts, Nano Banana performs well with simple instructions.

Figure 3. An example of the viral "Nano Banana" trend, where users generated figures representations of themselves and their favorite film characters, including James Bond (image credit).

These capabilities were striking enough to spark a viral and global social media trend, where users turned themselves into figurines (Figure 3).

Three months later, Nano Banana Pro arrived. It extended its predecessor’s multi-character consistency to handle scenes with 10+ people (Figure 4) and dramatically improved text rendering, enabling designer-level infographics to be generated in minutes.

Figure 4. Nano Banana Pro demonstrates an uncanny level of character consistency with its ability to merge 14 distinct, cute, and fuzzy characters into a cohesive scene (image credit).

Figure 5 shows one such example: “Best Chocolate Around the World: A Global Taste Odyssey,” which illustrates how cocoa is grown, processed, and enjoyed across regions.

Figure 5. Nano Banana Pro can generate beautifully illustrated and ultra-detailed infographic on demand. Consider this delicious example about where cocoa beans are grown and how they're turned into chocolate.

However, both models remain closed-source and proprietary, so we can only infer how they work. We know performs some form of planning-style reasoning for image generation, because it can solve university level phsyics and chemistry problems by generating neatly written, correct solutions directly onto a blank exam page. It is, frankly, impressively capable. And since it is built on Gemini, it’s also probably too large to run on edge devices, even with aggressive model compression and optimization.

Figure 6. Nano Banana Pro was able to successfully generate the correct solutions, including doodles, for university-level Physics and Chemistry exam questions (image credit).

This leaves us with a clear goal: replicate as much of this functionality as possible using open-source, on-device alternatives. Unfortunately, current mobile-friendly diffusion models more closely resemble 2021–2022 Stable Diffusion systems.


Since Nano Banana Pro’s advanced capabilities only emerge at massive scale, we have to accept two harsh realities:

  1. We have to rely on open-source, convertible models that often lag 6-12 months behind industry SOTA; and
  2. The model we choose won’t have the same magical coherence of its cloud-scale counterparts.

Simply put, I need to select a model that’s small enough to run on-device but still capable enough to produce usable results. We can translate our earlier problem constraints into the model specifications shown in Table 1.

RequirementSpecificationReasoning
SizeMemory footprint <2GBiOS apps face strict memory limits (~2-4GB), allocate 2GB primarily for model usage to prevent crashes
PerformanceInference <40 secondsFits within 60-second end-to-end pipeline budget, leaves room for pre/post-processing
Hardware SupportApple Neural Engine (ANE) optimization requiredStandard Metal GPU processing will be too slow, need leverage the iPhone's built-in AI accelerator
MethodologySeparate component access (CLIP, U-Net, VAE Encoder/Decoder)Conditional image-to-image generation is inherently modular, components must be accessed separate for img2img tasks
QualityMaintain human subject identity with high-fidelityCore product requirement: failure to maintain character consistency leads to poor user experience.
Table 1. Model specifications for my edge conditional image generation application

Tiny SD: Starting as Small as Possible

To establish a minimum viable quality baseline, I targeted the smallest available contender: Segmind’s Tiny SD. As a 55% parameter reduction of Stable Diffusion, it is among the most aggressively compressed models from an established maintainer. Since it only consumed half of my tight 2GB iOS memory ceiling, it was the perfect, low-risk candidate to stress-test the absolute lower bound of acceptable quality and performance.

My next move was optimizing tiny SD for speed. I used CoreML , Apple’s dedicated framework for integrating machine learning models into apps, to convert the weights into an Apple Neural Engine (ANE) optimized format. The ANE is the dedicated hardware accelerator built into Apple Silicon, specifically designed to run on-device neural network inference with superior power efficiency. Meaning, if this works, I will be able to conditionally generate images without killing my phone’s battery.

In my initial test, I wanted to validate the model’s basic text-to-image (T2I) generation. To keep this assessment fair, I used the same text prompt Segmind provided in their model card: "Portrait of a pretty girl". Despite my best efforts to meet the 40-second deadline (via 25-30 diffusion steps), the model failed quality control. Instead of images, I was left with psychedelic noise and low-fidelity artifacts (Figure 7).

(a)
(b)
Figure 7. The samples illustrate Tiny SD's quality collapse. (a) Default prompt at recommended guidance scale (7.5). (b) Enhanced prompt and an aggressive guidance (11.0) to force better prompt adherence. Both outputs were generated within the 25–30 step limit and exhibit severe artifacts and image distortions.

Despite the artifacts, rough semantic alignment remains: Fig. 7A shows a framed “portrait” of a woman and Fig. 7B renders a woman with “flowing hair.” Crucially, Figure 8 confirms the original Tiny SD model produces coherent, acceptable (if blurry) outputs. This performance gap strongly suggests our CoreML pipeline is sound, but the model weights are being corrupted during the conversion or loading process.

Figure 8. Official examples of Tiny SD outputs. These samples confirm the original Tiny SD is capable of generating coherent, if slightly blurry, portraits of people (e.g., center-left image), establishing an acceptable quality baseline prior to CoreML conversion (source).

So, where did the weights go wrong? The corruption must have originated from one of these three technical suspects:

  1. Inference step requirements. Distilled models often require higher step counts for convergence than their parent models, a detail missing from the Tiny SD documentation. Our current $25-30$ steps may be too few.

  2. CoreML quantization precision loss. CoreML’s model packaging applies weight quantization (typically FP16 or mixed precision) that could compound errors in an already-distilled model, potentially degrading performance below acceptable limits.

  3. VAE decoder corruption during CoreML conversion. The VAE decoder is the model component most sensitive to weight corruption . It is a critical single point of failure because it performs the final, irreversible $64\times$ spatial upsampling. CoreML conversion might corrupt its transposed convolution weights, and (unlike the self-correcting U-Net) even the slightest VAE decoder corruption turns perfect latents into unusable outputs.

Pinpointing the exact cause of corruption would require a costly series of controlled ablation studies on the teacher PyTorch model: testing step counts, comparing FP32 vs. FP16 precision, and measuring degradation at each CoreML conversion stage.

However, this investigation isn’t needed. This test already validates Tiny SD’s non-viability: I need CoreML/ANE compilation and a small number of diffusion steps to meet my strict sub-40-second latency budget. As seen, tiny SD can’t deliver under these constraints. It’s time to pivot to a model explicitly designed with Apple’s silicon in mind.

CoreML Stable Diffusion

The search for a viable replacement led me to Apple’s CoreML Stable Diffusion. This is a professionally tuned implementation of Runway ML’s stable-diffusion-v1-5 (1.3B parameters) that’s compressed into ~1.5GB.

What makes this model viable where Tiny SD collapsed? Its key advantage is co-design with Apple’s hardware team. This grants engineers access to proprietary optimizations—like deep operator fusion and memory layout—to produce a calibrated FP16 model guaranteeing peak ANE performance unavailable through generic conversions. Crucially, this implementation is also battle-tested for iOS deployment, eliminating the risk of weight corruption I previously faced.

Figure 9. Prototype testing. A generated image from the text prompt "A beautiful landscape with mountains and a lake, golden hour lighting" using Apple’s CoreML Stable Diffusion model. The output is a high-quality, realistic rendering of a rustic mountain range at sunset.

The trade-off is simple: I accept a ~1.5GB footprint (still well within budget) for a solution that guarantees production quality. Sometimes the “smallest” solution isn’t the best one. Hardware-aware optimizations at a reasonable scale beat model over-compression.


Conclusion

In this post, we explored the tight constraints required to deliver a Nano Banana-like experience on the edge. Our initial exploration led us to select Apple’s CoreML Stable Diffusion model due its aggressive hardware co-design.

What’s next?

Before we can attempt to replicate the Nano Banana experience on device, we first need to understand what problems Apple’s CoreML optimizations truly solved and which technical challenges remain. My next blog post covers exactly how Apple safely compressed a diffusion model that is so easy to corrupt.

Bella Nicholson
Bella Nicholson
Machine Learning Engineer

Related