Shrinking the Impossible: Deploying My Own Multi-Modal Edge Foundation Model

Dec 1, 2024

Generative AI tooling has become a staple for everyday tasks — from creating presentation visuals to finding the right code syntax. But I’ve always felt a little too uneasy trusting cloud-hosted LLMs with my private chats.

As an ML engineer, I assumed that the only alternative was spinning up an expensive private LLM in the cloud — until the Google I/O Connect event (June 2024). That’s where Google revealed their smallest LLM to date, Gemini Nano, running directly on the Pixel 8. Seeing an ultra-private AI assistant on edge was inspiring, but for solo developers like me, the open-source tools weren’t quite ready yet.

Fast-forward six months, and the landscape has changed. Thanks to projects like the Machine Learning Compiler (MLC) framework, solo developers can now optimize and deploy powerful LLMs on edge devices. Rather than sticking to a unimodal LLM — a well-worn path — I set my sights higher: deploying a multimodal LLM on the smallest edge devices possible.

The end result? I successfully embedded the multilingiual Gemma 2B and the multi-modal LLaVA-OneVision Qwen2 0.5B models on my laptop (among other devices). Take a look for yourself:

As shown, my embedded model can comfortably discuss the content of an image and give some interesting synopses. Pretty impressive, right? Here’s a look at how it all works:

At 800M parameters, the LLaVA-OneVision Qwen2 0.5B is ideal for edge deployment but it struggles with complex user instructions. Thus, I used Google’s ultra-efficient Gemma 2B model as a fallback.

For the full story, including technical details, check out my corresponding “Shrinking the Impossible” blog series:

The source code is also available for your viewing pleasure on GitHub. Consider it your guide to shrinking impossibly large LLMs down to something that fits inside your pocket.

Code