Frames Per Second (Part 2): Quantization, Kernels, and the Path to On-Device Diffusion

Image credit: Nano Banana Pro

Table of Contents

Introduction

After tackling text with my edge multi-modal LLM project last year, I’ve become fascinated with the image side of foundation models. The media frenzy ignited by Google DeepMind’s initial Nano Banana release acts as a testament to how image generation just hits differently than text.

This begs the ultimate question: Can we deliver a Nano Banana-like experience on the edge? The answer isn’t simple. Diffusion models are brutally sensitive to noise. In my last blog post, my naive port of Tiny SD failed spectacularly, yielding noisy and psychedelic outputs.

As a result, I’ve turned my attention to Apple’s CoreML Stable Diffusion model, betting its proprietary, hardware-aware design will save this project. How did Apple’s engineers successfully squeeze a 6GB model into a sub-2GB, sub-10-second iOS package while maintaining quality? Their secret lies in the careful orchestration of quantization, hardware co-design, and architectural compromises.

Figure 1. Prototyping testing. A generated image from the text prompt "A beautiful landscape with mountains and a lake, golden hour lighting" using Apple’s CoreML Stable Diffusion model. The output is a high-quality, realistic rendering of a rustic mountain range at sunset.

This post dissects exactly how Apple pulled off that optimization miracle. By examining which layers got quantized, how Neural Engine constraints shaped the architecture, and where the remaining quality trade-offs live, we’ll be ready to build our edge conditional image generation experience in this blogpost series’s conclusion.

Apple’s Optimization Strategy

Let’s start with the numbers. Apple took Runway ML’s 6GB Stable Diffusion implementation, compressed it by over 70%, and then embedded it onto their proprietary Apple Neural Engine (ANE) hardware. The ANE is the third processor in Apple Silicon, sitting alongside the standard CPU and GPU. It’s a dedicated Neural Processing Unit (NPU) purpose-built for the high-throughput matrix operations that define neural network inference.

Figure 2. An image by generated by the Runway ML's 6GB Stable Diffusion model, which served as the base model for Apple's CoreML implementation (image credit).
Figure 3. Sample images from Apple's official CoreML Stable Diffusion demo. As seen, there's some quality degradation when compared to Fig. 2, but the majority of model quality is preserved (image credit).

This hyper-specialized hardware enabled Apple to achieve blisteringly fast results: each denoising step takes only ~0.37–0.39 seconds and a high-quality images are generated within 20 steps (documented here). Of course, Apple didn’t use a single optimization technique to achieves these impressive runtimes; rather, they employed an entire optimization playbook:

  1. Model precision reduction. Apple applies a crucial two-step precision reduction process of (a) initial quantization from float32 to float16 to halve memory consumption and (b) followed by aggressive palettization to 6-bit weights to yield another ~40-50% reduction. That’s how we went from a 6GB model to a 1.5GB one.

  2. Neural Engine optimization. CoreML’s ANE-specific compilation pipeline fuses common machine learning operators and optimizes tensor memory layouts for the ANE’s specialized compute units. Apple’s own benchmarks show that ANE-optimized models achieve up to 10× speedup with 14× reduction in peak memory consumption compared to unoptimized implementations on the iPhone 13.

  3. Model chunking. The pipeline is split into four independently loadable components (.mlmodelc files). iOS dynamically swaps these components in the reduceMemory option to keep peak memory usage below the 2GB limit. The trade-off is increased end-to-end latency due to this just-in-time loading overhead as seen in their official benchmarks.

  4. Attention implementation. Apple uses a SPLIT_EINSUM attention variant. It breaks multi-head attention into explicit single-head functions and relies on einsum operations to avoid the reshape and transpose steps that trigger memory copies. Because the ANE excels at fixed 4D tensor layouts, keeping data in this shape is crucial. Combined with Apple’s other optimizations, this approach delivered up to 10× faster inference and 14× lower peak memory on their distilbert benchmark.

  5. Scheduler Optimization. The original model was used the PNDM (Pseudo-numerical methods for Denoising Models) scheduler, which requires 50+ computational steps. Apple swapped this for the modern DPMSolverMultistepScheduler. PNDM uses estimates from recent steps to predict the denoising direction, allowing it to learn from recent history to make smarter jumps. This drastically reduces the required denoising steps from 50+ to 20-25 without sacrificing image quality.

Each optimization technique targets a different performance bottleneck, whether it be memory footprint, inference latency, or peak memory usage. Let’s now examine each technique in detail.

Model Compression

Apple engineers pulled off an impressive 70% size reduction through a two-phase compression strategy. They started with a straightforward precision reduction from float32 to float16 and followed up with palettization to jump from float16 to just 6-bit.

Phase 1: Quantization

Model parameters in standard PyTorch models consume 32 bits (float32), making the 1.3B parameter Stable Diffusion consume 5.2GB of RAM. In quantization, we lower the bit-precision used to store each weight, sacrificing some information for a smaller memory footprint.

The first step is to halve the precision to float16. CoreML achieved this using simple datatype casting (no algorithmic quantization, clustering, or lookup tables), which instantly halves memory with minimal accuracy loss. This simple technique works because float16 is considered the “safe” floor for neural networks. This precision level retains enough dynamic range (the span of representable values) and precision to accurately encode most model weights.

This reduces memory consumption from 5.2GB to 2.6GB, which is a great start but above my sub-2GB target. If we naively lower the precision any further, we risk catastrophic information loss and would get corrupted outputs like those from my Tiny SD port. Meaning, we need a more clever quantization technique.

Phase 2: Palettization

Apple wanted to build a generalizable framework for compress any Stable Diffusion checkpoint, including community fine-tunes. Hence, they needed a flexible approach that didn’t depend on access to the original training data. This immediately rules out state-of-the-art methods like AWQ or GPTQ, which require calibration data to analyze activations and identify the most salient (important) weights.

Palettization offers a perfect alternative. It achieves aggressive compression without calibration data by using simple, interpretable k-means clustering. This technique actually originates from color quantization in computer graphics. Essentially, during image compression, we needed to map millions of possible colors in an image to a smaller fixed set of representative values.

Figure 4. An example of color quantization, where the original photograph palette was reduced seven distinct colors (image credit).

The same logic can be applied to model weights. Here’s how it works:

  1. Analyze the distribution of weights in each layer
  2. Cluster similar weights using k-means to create a “palette” of representative values (e.g., 64 values for 6-bit palettization, since $2^6 = 64$)
  3. Replace each original weight with an index pointing to its nearest palette entry
  4. Store the compact palette plus the many small indices instead of full-precision weights

The bit-width determines how many palette entries you get, and thus your compression ratio. Table 1 summarizes the trade-offs in palette entry number selection.

Bit WidthPalette EntriesCompression vs Float16Quality Impact
8-bit256~2×Minimal quality loss
6-bit64~2.67×Acceptable quality trade-off
4-bit16~4×Noticeable degradation
2-bit4~8×Severe quality issues
Table 1. Palettization bit-width options and their associated trade-offs.

For our selected CoreML Stable Diffusion variant, Apple used 6-bit palettization to achieve a final ~1.5GB model size. For even larger models like SDXL (6.94 GB), Apple used mixed-bit palettization for stronger model compression. Here, we assign different bit-widths (1, 2, 4, 6, or 8 bits) to different layers based on a sensitivity analysis.

Neural Engine Optimization

Apple’s Neural Engine (NE) debuted in 2017 inside the iPhone X’s A11 Bionic chip to power Face ID. The TrueDepth camera fires over 30,000 infrared dots to map your face, and handles that data in real time. That stream in real time was too slow and too power-hungry for the GPU, pushing Apple to develop their own NPU.

That first-gen ANE delivered 0.6 teraflops of float16 compute. By 2018, with release of the A12 chip and Core ML, Apple opened the Neural Engine to developers, and today it’s baked into every modern iOS device.

Figure 5. The evolution of the Apple Neural Engine from 2017 to 2021. The 16-core Neural Engine on on the A15 Bionic chip (iPhone 13 Pro) has a peak throughput 26 times higher than its original counterpart. (image credit).

But where does Core ML fit into all this? Since ANE is proprietary hardware, there’s no public API to program it directly. Its architecture, instruction set, and compiler are all trade secrets. With no official documentation on ANE-supported operations or optimization methods, most developer knowledge comes from trial-and-error and reverse engineering. Core ML is the only way iOS developers can access the Neural Engine.

It consists of two parts:

  1. coremltools is an open source Python package that converts models from frameworks like PyTorch and TensorFlow into Core ML’s optimized format
  2. The on-device Core ML framework that loads these compiled models and executes them.

When you convert a model with coremltools, it figures out which operations can run on the ANE versus the CPU or GPU, applies optimizations, and compiles the model into an efficient format. At runtime, Core ML then routes each operation to the right compute unit to maximize performance and minimize power use.

CoreML gives you three ways to run unit neural networks on device:

  1. CPU Only. This is the slowest but most safest option. According to the ONNX Runtime documentation, CPU-only mode is mainly available for debugging and validation, since it avoids precision differences and guarantees predictable results. Community benchmarks suggest it runs approximately 7-8x slower than optimal configurations, making it impractical for real-time generation.

  2. CPU and GPU. This combination is capable but not recommended. GPUs were originally built for desktops with unlimited power, so they’re plausible but not ideal for running heavy models on mobile devices. It’s typically used for Macs with powerful GPUs or as a fallback for older devices without a Neural Engine.

  3. CPU ane ANE. This is Apple’s recommended configuration for deploying intensive models on iPhones and iPads. ANE was specifically designed for ML inference workloads and delivers comparable performance to GPU at a fraction of the power consumption.

In our case, I configured coreml-stable-diffusion-v1-5-palettized to run primarily on the Neural Engine with CPU fallback for unsupported operations. This hybrid approach maximizes performance where it counts while maintaining graceful degradation for edge cases.

Model Chunking

iOS enforces stricter memory constraints than macOS. As noted in Apple’s CoreML optimization guide,

ANE’s specialized architecture comes with strict model size constraints. iOS enforces stricter per-file memory mapping limits than macOS. While the exact limit is undocumented, Apple’s optimization guide suggests it’s around 1GB based on their compression targets for mobile deployment. Meaning, attempting to load our U-Net component, which is roughly 1.5GB in float16 precision, will iPhone triggers memory allocation failure, even if it’d run perfectly on a Mac.

This makes model chunking essential for mobile deployment. The idea is simple: we split huge weight files into smaller slices that fit within iOS’s memory limits, and let the runtime load each slice on demand. Apple’s ml-stable-diffusion repo handles this automatically with the --chunk-unet conversion flag flag, which divides the U-Net weights into multiple files that stay well under the limit. These chunks are stored in the .mlmodelc format, a pre-compiled, ANE-optimized layout that improve loading time.

The beauty of Apple’s setup is that developers never have to think about model chunking. Core ML handles this behind the scenes. While there’s a small cost to pulling in multiple files, we wouldn’t be able to run Stable Diffusion on iOS without this approach.

Attention Variants

Apple’s CoreML conversion tools offer two attention implementations that compute identical mathematical operations but differ critically in their kernel implementation.

1. The Original Attention Mechanism

This implementation uses the standard batched multi-head attention formula:

# Shape: [batch, seq_len, heads * head_dim]
Q, K, V = linear_projections(x)

# Reshape to [batch, heads, seq_len, head_dim]
Q = Q.reshape(batch, seq_len, heads, head_dim).transpose(1, 2)
K = K.reshape(batch, seq_len, heads, head_dim).transpose(1, 2)
V = V.reshape(batch, seq_len, heads, head_dim).transpose(1, 2)

# Batched matrix multiplication across all heads
# [batch, heads, seq_len, head_dim]
attention = softmax(Q @ K.transpose(-2, -1) / sqrt(d_k)) @ V

# Reshape back
output = attention.transpose(1, 2).reshape(batch, seq_len, heads * head_dim)

Figure 6. Pseudocode of the original attention mechanism.

This works well on CPUs and GPUs, which handle dynamic reshaping efficiently. It’s faster on Macs with discrete GPUs (M1 Pro/Max/Ultra) where memory bandwidth isn’t the primary bottleneck.

2. Split Einsum Attention

ANE penalizes non-contiguous memory access, which makes the reshape/transpose operations shown in Figure 6 computationally expensive. Fortunately, we can rewrite matrix multiplication as a series of Einstein summations (einsums) as shown in Equation 1 to better utilize ANE. $$ C_{ik} = \sum_{j} A_{ij} B_{jk} = AB = C \tag{1} $$

By keeping keeps tensors in fixed 3D layouts and using the einsum operation, we avoid generating unnecessary memory copies. The implementation looks something like this:

# Shape: [batch, seq_len, heads * head_dim]
Q, K, V = linear_projections(x)

# Split into explicit per-head tensors (no reshape)
Q_heads = [Q[:, :, i*head_dim:(i+1)*head_dim] for i in range(heads)]
K_heads = [K[:, :, i*head_dim:(i+1)*head_dim] for i in range(heads)]
V_heads = [V[:, :, i*head_dim:(i+1)*head_dim] for i in range(heads)]

# Compute attention per head using einsum (preserves 3D tensor layout)
outputs = []
for Q_h, K_h, V_h in zip(Q_heads, K_heads, V_heads):
    # [batch, seq_len, seq_len]
    scores = torch.einsum('bqd,bkd->bqk', Q_h, K_h) / sqrt(head_dim)
    attn = softmax(scores, dim=-1)
    out = torch.einsum('bqk,bkd->bqd', attn, V_h)  # [batch, seq_len, head_dim]
    outputs.append(out)

# Concatenate (cheap operation)
output = torch.cat(outputs, dim=-1)  # [batch, seq_len, heads * head_dim]

Figure 7. Pseudocode of the einsum attention variant.

The trade-off is lower parallelism, since we’re using explicit per-head loops rather than batched operations. This hurts GPU performance, but ANE performance is bottlenecked by memory bandwidth. Meaning, the memory savings outweigh the costs of reduced parallelism.

Scheduler Optimization

Diffusion models turn random static into art by clearing away noise. The scheduler (also called a sampler or solver) is the control algorithm that orchestrates the denoising loop: it calls the U-Net at each timestep to predict the noise, then uses its mathematical formula to update the image toward a cleaner state. Meaning, the scheduler controls the number of diffusion steps needed for high-quality image generation. If we select a more efficient scheduler, we can improve inference time without degrading quality.

The original Stable Diffusion models used the PNDM (Pseudo Numerical Methods for Diffusion Models) scheduler, which applies a linear multi-step method:

$$ x_{t-1} = x_t + \sum_{i=0}^{k-1} \alpha_i \cdot \epsilon_\theta(x_{t-i}, t-i) \tag{2} $$

where $x_t$ is the current noisy image at timestep $t$, $\epsilon_\theta$ predicts what noise to remove, and $\alpha_i$ are coefficients that weight predictions from the past $k$ steps. As seen in Equation 2, PNDM treats each timestep as a discrete prediction problem, where each step uses local information (the last few predictions). In this context, larger jumps (more noise removal per step) risk error accumulation, which lowers image quality. PNDM tends to require ~50 diffusion steps to yield acceptable outputs.

The DPMSolverMultistepScheduler treats denoising as a continuous process rather than discrete jumps. Since noise is added gradually during training, it can be removed along a smooth, continuous path that written as an Ordinary Differential Equation (ODE):

$$ \frac{dx_t}{dt} = f(t) x_t + g(t) \epsilon_\theta(x_t, t) \tag{3} $$

This makes diffusion a continuous process and allows the DPM Solver to take larger, more informed steps through the denoising trajectory. As a result, 25 steps with DPM Solver produces quality comparable to 50 steps with PNDM, offering a 2× speedup.

This made it the DPM Solver an obvious choice for Apple’s CoreML implementation, where every second of latency matters. The step count creates a direct quality-speed trade-off:

StrategyStepsRuntimeQuality ImpactUse Case
Aggressive15-2010-15 secondsNoticeable artifacts, loss of fine detailsQuick previews, concept iteration
Balanced20-3015-25 secondsHigh-quality results, minimal artifactsProduction deployment
Conservative50+35+ secondsMarginal improvement over 25 stepsNot worth the extra latency on mobile

Table 2. Trade-off between diffusion steps and image quality when using the DPMSolverMultistepScheduler.

After extensive testing, I settled on 25 steps as my default to properly balance my need for quality and speed.


Conclusion

Apple’s CoreML Stable Diffusion represents a masterclass in optimization engineering. With aggressive quantization, ANE-friendly attention kernels, and smart scheduling, Apple squeezed a 6GB model into a 1.5GB package that can generate an image in under 10 seconds on an iPhone. It’s a technical flex that’s hard to overstate.

But here’s an uncomfortable truth: optimization doesn’t expand capabilities. Apple solved how to run Stable Diffusion on mobile — not whether it’s good enough. Strip away the speedups, and we’re left with the a 2022-era model:

  • Poor identity preservation. Try img2img at high denoising strength and watch faces dissolve into uncanny abstractions. The model simply can’t maintain coherent identity while delivering image transformation.

  • Prompt adherence is weak. Compared to SDXL or Flux, SD 1.5 treats our carefully crafted prompt more like a vague suggestion than a clear set of instructions.

As a result, I can’t just port Apple’s approach.

What’s next?

My goal is conditional edge image generation with an explicit need for character consistency. If Apple’s optimizations give us the blueprint for mobile deployment, what architecture actually delivers my required capabilities?

That’s what I’ll cover in my next and final blogpost. My goal isn’t just to run fast; it’s to run fast and look good doing it.

Bella Nicholson
Bella Nicholson
Machine Learning Engineer

Related