Shrinking the Impossible (Part 1): Optimizing Foundation Models for Edge Devices with MLC

Bella Nicholson

Nov 29, 2024 16 min read

Table of Contents
Introduction
How does MLC LLM work?
- Quantization
- Hardware Optimizations
  - Just-in-Time Model Compilation
  - MLC LLM Implementation
Conclusion
- What’s next?

Introduction

Ever since ChatGPT went mainstream, I’ve been captivated by the rapid advancements in large language models (LLMs). As a machine learning engineer, I’ve been eagerly awaiting my chance to experiment with these groundbreaking models. Yet, the reality of deploying and managing the required infrastructure — and its massive cost — always made me pause.

Project Inspiration

This June at the Google I/O Connect Event in Berlin, I realized my vision of working with server-free LLMs wasn’t as far-fetched as I’d thought. Google showcased Geminin nano, a powerful LLM integrated directly into their latest Android devices. While it wasn’t accessible to developers yet, it was a glimpse of what might soon be possible.

Google originally launched Gemini Nano on the Pixel 8 Pro in December 2023. This edge LLM has been gradually rolled out to other popular Android phones, including Samsung's Galaxy S24 series (source, image credit).

Inspired by this progress, I set out to test the limits of edge LLMs. Could I, as a solo enthusiast, deploy a LLM on an edge device like my iPhone? To make the challenge even more intriguing, I decided to aim for a multi-modal LLM. After all, who wouldn’t want a private chatbot that understands your phone’s photo gallery and keeps your secrets safe?

In this 4-part blog series, I document my experiment to prove that you don’t need a sprawling server farm or a high-end workstation to dive into the latest AI technology. With a bit of machine learning knowledge, solid documentation, and plenty of determination, you can get started with state-of-the-art models without a painful cloud bill (looking at you, AWS).

Why Edge Foundation Models Matter

Foundation models have traditionally been too massive to run on edge devices like smartphones, IoT gadgets, or embedded systems. As a result, they’re typically hosted on centralized servers, which introduces several challenges:

Cost barriers. Deploying and serving large-scale models (think 10B+ parameters) in the cloud is prohibitively expensive, often costing millions in infrastructure and energy. This creates a significant barrier for students, hobbyists (like me), and smaller organizations looking to experiment with AI. By running models locally on edge devices, the need for expensive server infrastructure disappears, democratizing access to this cutting-edge technology.
Unreliable performance. Cloud-based inference depends on a steady internet connection to send data to servers and retrieve results. This back-and-forth can cause frustrating delays, especially in areas with poor connectivity. Edge models, which run directly on local devices, bypass these issues. They deliver faster responses and work well even without an internet connection.
Security concerns. Cloud-based systems inherently require sending data to remote servers, which comes with inherent risks. For users, their personal chat data could be exposed in a security breach or misused without consent. Businesses, meanwhile, must navigate strict regulations like GDPR or HIPAA when transferring sensitive data off-device. By processing data locally, edge models eliminate these risks, ensuring that your personal information stays private.

In short, edge foundation models break down cost barriers, improve reliability, and address privacy concerns. They make AI more accessible for curious minds and businesses alike while offering end users more peace of mind.

Challenges of Running Foundation Models on Edge Devices

By now, you might be thinking, Edge foundation models sound amazing! Why isn’t everyone using them? Well, as with most things in life (and AI), there’s a catch. Running foundation models on edge devices isn’t exactly a walk in the park. Let me walk you through some of the challenges, starting with my own cautionary tale.

My Gemma Debacle

About six months ago, I got my hands on Google’s shiny new instruction-tuned Gemma 2B model. Gemma, for those unfamiliar, is the “baby” version of DeepMind’s Gemini family — a lightweight, open-weight LLM designed for resource-constrained environments.

Why Gemma2B? In a nutshell, it's very impressive for its size. In benchmarks tasks, Gemma 2B's performance is comparable those of 7B - 9B LLMs. It also consistently tops the HuggingFace leaderboard for smaller LLMs.

Gemma 2B calls itself a polyglot, claiming fluency in a bunch of languages. If that checks out, it’s a pretty exciting pick for global business apps — just the kind of thing a solo developer (like me) could put to work.

Basically, Gemma 2B was designed for laptops, desktops, and modest cloud setups. It sounded perfect for my trusty MacBook Air M2 (8GB of RAM).

Spoiler alert: it wasn’t.

I excitedly set up Gemma and attempted to serve my first request. My MacBook? It practically waved a white flag and crashed halfway through.

Let’s do the math to see why this happened. The Gemma 2B model has (you guessed it) 2 billion parameters. Using the standard float32 data type, the model parameters alone would require $2B\text{ parameters} * \frac{4 \text{ bytes}}{\text{parameter}} = 8 \text{ billion bytes} = 8 \text{ GB of RAM}$.

But that’s not all my machine needs to handle:

I need extra memory for activations (the intermediate calculations during inference).
The operating system (in my case, macOS) also needs a hefty chunk of RAM to do its thing

In short, my poor MacBook was way out of its depth. Even with more efficient data types like float16 or bfloat16 (which halve memory usage), the combined memory demands of the model, activations, and system processes were just too much. Now, imagine trying to squeeze this kind of workload onto a smartphone with even less RAM. You’d be lucky if your phone didn’t catch fire (kidding…mostly).

The Memory Struggle Is Real

Edge devices are, by design, resource-constrained. They’re great for portability, but they aren’t built to handle the sheer memory and compute demands of large language models. Even lightweight models like Gemma, which aim to close this gap, can still overwhelm devices with limited RAM or processing power.

But don’t despair! Engineers and researchers are tackling these challenges head-on. By using model compression techniques like quantization, pruning, and distillation, they’ve managed to shrink memory and compute requirements significantly. Add to this a new wave of hardware optimization techniques, and edge deployment is more feasible than ever.

Now, innovative tools are building on these advancements to make edge LLMs not just possible, but practical for a wide range of devices. Curious about how these breakthroughs are unfolding in real-world applications? Let me introduce you to one powerful solution: the MLC LLM framework.

MLC LLM: A Quantum Leap in Deploying Edge Foundation Models

Fast forward to November 2024, I decided to try the same task as before but with the Machine Learn Compiler (MLC) LLM Engine. This time, I deployed a pre-quantized version of this Gemma 2B model onto an edge device — specifically, an iOS app. The results blew me away.

For starters, I encountered zero performance issues on my MacBook Air. The pre-quantized and hardware-optimized Gemma model ran smoothly and efficiently, without any of the lag or crashes I had faced with six months earlier.

But here’s where things really got exciting: the quality of the responses. They were practically indistinguishable from the likes of massive, cloud-based LLMs in an everyday conversation. Curious to see how well this mini-model handled other languages, I threw some Spanish and German at it. To my non-discerning eye, the results looked spot-on. (I’d love to hear what native speakers think, though.)

I decided to practice my very rusty Spanish and seek vacation inspiration in one-go. Here are Gemma's recommendations for a visit to Valencia, Spain 🇪🇸 in Winter.

Christmas time means Christmas Markets 🎄 — especially when you live so close to Germany 🇩🇪. So, I decided to see if Gemma could give me any fun suggestions in German.

Now, you might be wondering: How did MLC manage to pull this off? Let’s take a step back and dive into the tech behind this feat.

How does MLC LLM work?

At a high-level, the MLC LLM project makes it possible to embed smaller LLMs (under 10B parameters) on edge devices through a streamlined, three-step process:

Quantize LLM weights as the model is downloaded. This prevented my machine from crashing due to insufficient memory.
Embed the quantized model with hardware-specific optimizations applied during the model compilation stage.
Provide a simple, pre-built user interface to interact with your newly embedded foundation model.

MLC offers a user-friendly, open-source chat application for both Android and iOS. Alternatively, it implements its own version of OpenAI’s Python API, making it easy to integrate the optimized LLM into your own existing projects.

Quantization

MLC LLM caches pre-quantized model weights and compiled model library locally, which means you only need to download and quantize the model once. After that, the quantized model is ready to run on your device without requiring repeated downloads. This saves both time and bandwidth, making the process smoother and more efficient.

What is Quantization?

In simple terms, quantization is the process of reducing the precision of the numbers that represent a model’s parameters. The goal? Shrink the model’s memory footprint while keeping its performance as close to the original as possible. The real magic happens when you see the cost savings—quantization can cut your cloud compute bills by half, a quarter, or even a sixth, without any noticeable drop in performance. For massive models like LLMs, those savings can really add up.

Take the example of yurts, a contractor for the U.S. government. They slashed its monthly cloud computing bill from USD 24,000 to USD 4,000 for a 70B parameter LLM by using a quantization method called Activation-aware Weight Quantization (AWQ). Pretty impressive, right?

Quantization Methods

When it comes to quantizing a model, there are a few common methods, but the two main approaches are:

1. Post-Training Quantization (PTQ)

After a model is trained, you can apply quantization to reduce the bit-width of its weights. The best part? It’s quick, easy, and requires minimal changes to the original model, while still offering significant memory savings.

One common PTQ technique is grouped quantization, where the model’s weights are grouped based on features like their layer or importance. Each group is quantized separately, making the process more tailored and efficient. This method has been around since the late 1990s and continues to evolve as a way to balance performance and memory efficiency.

Some weight groups are more sensitive to quantization errors and need higher precision (more bits) to maintain accuracy. Others can handle lower precision without a noticeable hit to performance.

With the rise of foundation models, more specific implementations of grouped quantization have emerged. For an in-depth look, check out “The case for 4-bit precision: k-bit Inference Scaling Laws” and “The LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models”.

More recently, techniques like Activation-aware Weight Quantization (AWQ) have taken this dynamic quantization approach even further. AWQ uses activation statistics to pinpoint the most important individual weights and ensures they aren’t over-quantized, allowing for better compression without sacrificing performance.

2. Quantization-Aware Training

This method goes a step further by training the model with lower precision in mind from the start. By optimizing the model for reduced precision during training, you often get better results than you would with post-training quantization. Essentially, it allows the model to “learn” how to perform well with less precision, resulting in better overall performance. However, as we focus on deploying pre-trained models, we won’t explore this method further.

Quantizing Transformers

When quantizing Transformers, it’s not just the weights that need attention—activations play a big role too. Activations are the intermediate values generated during the model’s forward pass as it processes the input data. In a Transformer, these are the values produced at each layer as it handles each token. Just like with weights, activations can also be compressed during quantization, which further reduces memory usage.

But memory management doesn’t end with weights and activations. For Transformers, there’s also the key-value (KV) cache — this stores the context of the input sequence as the model processes longer inputs. As the model processes longer and longer inputs, it needs more memory to store the increasing number of keys and values. To keep things efficient, MLC LLM provides additional memory optimization techniques, like sliding windows, which help manage memory usage even when dealing with longer sequences.

The key-value (KV) cache in Transformers preserves the context of processed tokens, enabling the model to "remember" earlier parts of a conversation. Yet, as the conversation grows, the cache scales up rapidly, incurring risks of out-of-memory errors on edge devices if not handled properly. (image credit)

Out of the Box MLC LLM Solutions

As you can probably guess, the MLC Engine only implements post-training quantization (since we have no control over an open sourced LLM’s training process). In particular, MLC LLM implements the grouping quantization methods shown below.

Method Name	Weight Quantization	Activation Quantization	Version No.	Stable?
q0f16	None	16 bits	-	Yes
q0f32	None	32 bits	-	Yes
q3f16_1	3 bits	16 bits	1	Yes
q4f16_1	4 bits	16 bits	1	Yes
q4f32_1	4 bits	32 bits	1	Yes
q4f16_awq	4 bits	16 bits	1	No

Table 1: An Overview of MLC's implemented quantization methods (source).

MLC Enginer also offers an AWQ implementation (called q4f16_awq), but it’s currently unstable so use it at your own risk.

Of course, the folks behind MLC have already gone and quantized most of the very popular open-source LLMs. You can download these pre-quantized model weights from their official MLC AI’s HuggingFace account.

If you want to quantize a new model, then there’s a little more work involved. MLC right now supports quantization of these model types: baichuan, bert, chatglm3, cohere, eagle, gemma, gemma2, gpt2, gpt_bigcode, gpt_neox, internlm, internlm2, llama, llava, medusa, minicpm, mistral, mixtral, orion, phi, phi3, phi3v, qwen, qwen2, qwen2_moe, rwkv5, rwkv6, stable_lm, and starcoder2.

So, if you want to quantize of these model types yourself, then all you have to do is run a few simple commands.

Custom Solutions

If you want to quantize an unsupported model type, you’ll need to extend MLC LLM’s source code. This involves inferring your target model’s architecture from its source config.json file on HuggingFace and wrapping its original Python definition (e.g., from the transformers Python library) with MLC LLM’s wrappers. I ended up having to do this to support multi-modal functional in the 3rd blog post in this series.

Hardware Optimizations

Quantization is just one part of MLC’s bag of tricks. The other? Squeezing every last drop of performance out of your hardware through smart optimizations. See, your LLM model might start as high-level Python code, but it doesn’t interact directly with your device’s hardware. There’s a crucial middle step where MLC translates that model into something your CPU or GPU can actually understand—and it does this in the most efficient way possible.

Just-in-Time Model Compilation

Just-in-Time (JIT) model compilation is the secret sauce behind MLC’s stellar efficiency. Instead of pre-compiling everything in advance or running the model eagerly line-by-line, JIT optimizes your model right before it executes, ensuring it’s perfectly suited to your specific hardware.

JIT strikes a balance between two compiler approaches:

Interpreted execution processes code step-by-step as it runs. This makes the code super flexible and easy to debug, but leaves no room for optimizations. In other words, it’s painfully slow.
Ahead-of-Time (AOT) compilation pre-compiles everything into a fixed version before execution. This is much faster, but comes with a catch: we assume a one-size-fits-all solution. If the model encounters unexpected conditions or hardware variations, AOT’s rigid approach can leave performance on the table because it can’t take full advantage of the specific device running the code.

JIT avoids these pitfalls by waiting until runtime to optimize. It tailors the model’s code to your hardware and execution context just before runtime, ensuring maximum efficiency. Here’s how this process works:

Tracing or scripting. First, the engine analyzes your model’s high-level code and maps out its computation graph and operations. Think of it as creating a blueprint for what the model will do
Optimization. Next, the engine gets to work refining that blueprint. It fuses operations, removes redundancies, and inlines functions, streamlining execution wherever possible. (It’s like an architect revising a blueprint for a more efficient construction process.)
Low-level code generation. Once the optimizations are done, the graph is compiled into low-level machine code tailored to your specific hardware—whether that’s a CPU, GPU, or something fancier.
Execution. Finally, the optimized code is executed, running faster and using less memory thanks to all the pre-launch optimizations.

MLC uses JIT model compilation to get the most out of your edge device’s limited resources. And the best part? This process is abstracted away into a few simple CLI commands.

MLC LLM Implementation

Deep neural networks are computationally demanding. Hence, most deep learning frameworks include built-in JIT compilation extensions. For example, Accelerated Linear Algebra (XLA), the backbone of JAX, offers cross-framework JIT support. Looking specifically at PyTorch, torch.compile provides a general-purpose solution that supports both training and inference.

However, MLC takes it a step further by leveraging Apache’s Tensor Virtual Machine (TVM) for even deeper hardware-level optimizations.

We can think of TVM as an improvement on PyTorch's optimizations, offering advanced optimizations and hardware-specific tuning that PyTorch's JIT lacks. Additionally, TVM is easy to use due to the separation of its compiler and runtime components. This makes it possible for me to compile a ML model on one machine (e.g., a MacBook) and deploy it on another (e.g., a Raspberry Pi). (image credit)

TVM works by providing a Python API for tensor operations like matrix multiplications, sums, and type conversions. It also makes it a breeze to port models from PyTorch. Once we have the model in TVM, we can translate it into C++ as we optimize it for execution.

Here’s how exactly TVM supercharges model optimization:

Operation Fusion. TVM combines smaller operations (like element-wise additions or multiplications) into larger, more efficient ones.
Example. Instead of calculating ReLU(x) followed by Add(x, y), TVM can combine them into a single, efficient kernel, saving memory and time.
Memory Layout Optimization:. TVM fine-tunes memory access patterns to align with the hardware’s strengths. For example, GPUs perform better when accessing data in large, coalesced blocks, while CPUs benefit from loop optimizations that prevent cache misses.
Kernel Selection and Tuning. A “kernel” is a specialized function designed to perform specific operations, like matrix multiplication. TVM either selects the best pre-tuned kernels or auto-tunes them for maximum performance on the target hardware.

These optimizations make it possible to (hypothetically) fit a 7B+ parameter model onto an iPhone. But of course, there’s a trade-off: the more optimizations we apply, the less flexible the model becomes. Debugging also gets trickier — any issues that arise are often low-level errors, especially when input sizes change.

Despite these challenges, the benefits far outweigh the costs. Without TVM, deploying models on edge devices would be much more difficult.

Conclusion

In the past six months, the AI research community has made groundbreaking strides in optimizing foundation models for edge devices. Back in June 2024, my personal machine crashed when I tried to run the Gemma 2B model locally — without quantization or hardware optimizations. But thanks to the rapid progress in this field, even solo enthusiasts like myself can now, as of November 2024, easily deploy the same model (or even larger ones) locally—without needing to become compiler engineers.

In this blog post, I’ve introduced the Machine Learning Compiler (MLC) as a powerful new tool to make this possible. I’ve also walked you through its inner workings and provided essential background knowledge to help you get started.

What’s next?

In my next blog post, we’ll dive into how we can extend the MLC Engine to support embedding LLMs that aren’t natively supported. After all, our goal is to deploy a multi-modal LLM on an iPhone.