Shrinking the Impossible (Part 2): Teaching Chatbots to See with LLaVA, CLIP, and SigLIP

Bella Nicholson

Nov 30, 2024 13 min read

Table of Contents
Introduction
LLaVA: Chatbots That Can “See”
Machine Learning Compiler Implementation
- The LLaVA Model Family on MLC
- LLaVA OneVision
Conclusion
- What’s next?

Introduction

In my last blog post, I introduced you to the fascinating world of edge foundation models. I was dreaming big, imagining a foundation model that could “see” — well, at least understand the photos I share with it. Let’s be honest, I’ll probably need some help when I’m lost in a new city on my next vacation. A model that understands both text and images? Way more useful when I’m wandering around than a single-modal chatbot!

Recently, things have been getting pretty exciting in the world of multi-modal models. Beyond just text and images, ChatGPT can now “see, hear, and speak” — processing text, images, and audio. There’s also Gemini, Google’s latest powerhouse, which can process everything from text to images to audio — and even long movies, thanks to its multi-million token context window. Sounds pretty impressive, right? But here’s the catch: these models are so large and computationally demanding that it’s nearly impossible to run them on edge devices (like phones and laptops).

In this blog post, we’ll explore some of the latest advancements in small, efficient multi-modal models that can actually be deployed on edge devices. We’ll introduce the LLaVA model family, which combines vision encoders and text decoders to provide a general-purpose multi-modal chatbot.

The LLaVA model family stands out as a high-performance, open-source collection of models. Upon its release, the LLaVA-1.5 series achieved state-of-the-art (SoTA) performance across 11 benchmarks (image credit).

We’ll also take a closer look at the architecture behind LLaVA—specifically the vision encoders and text decoders that make it work. This includes the popular CLIP and SigLIP training frameworks.

Understanding how these models work lays the groundwork for my next blog post.

LLaVA: Chatbots That Can “See”

The LLaVA model family is a collection of vision-language models that use pre-trained Vision Encoders to give pre-trained Large Language Models (LLMs) the ability to understand images. Why all the pre-training? Well, pre-trained models help keep training costs low while still leveraging the latest in foundation model technology.

This combination of pre-trained models has made LLaVA models increasingly popular for multi-modal tasks. These models — some named more creatively than others — are open-sourced in various sizes, ranging from 0.5B to 13B parameters.

Most LLaVA models are named after their base transformer architectures, but this one has been named after a beloved Turkish delicacy (source).

Don’t forget that the Machine Learning Compiler (MLC) Engine (introduced in my last blog post) supports quantizing LLaVA models. In other words, the smallest LLaVA models are promising candidates for my edge multi-modal ambitions.

Visual Instruction Tuning of Text Decoders

Before we jump into how LLaVA works its magic, let’s take a quick look at how plain text-based LLMs operate. A typical Large Language Model (LLM) begins by breaking your input text into discrete units called tokens using a tokenizer. These tokens are then transformed into high-dimensional embeddings — essentially numerical representations the model can “understand”. The embeddings pass through multiple layers of self-attention mechanisms and feed-forward neural networks, which process the information and predict the next token in the sequence. This process continues iteratively until the model generates the desired output text.

But here’s the challenge: when it comes to multi-modal tasks, like combining images and text, your traditional text-based LLM hits a wall — it simply can’t “see”. To fix this, we bring in vision encoders. Vision encoders translate images into embeddings, a deep learning model’s “native language”. Afterwards, a text decoder translates these embeddings into an output text based on both the image and the text input. We align the feature space of the vision encoder with the LLM by adding a trainable linear projection layer on top of the vision embeddings. By fine-tuning the text decoder on a multi-modal dataset, the model learns to generate text that is relevant to the input image.

The LLaVA model family combines a vision encoder with a text decoder to generate human-like text based on images. The vision encoder converts images into embeddings, which are then fed into the text decoder as a visual instruction to generate the output text (source).

This approach is called Visual Instruction Tuning. If you’re familiar with Instruction Tuning, it’s the same idea with a multi-modal twist. In regular instruction tuning, you give the model a text instruction (“Summarize this paragraph”) and train it to produce the desired text output. Visual Instruction Tuning swaps that text instruction for an image. The goal? Train the model to generate text that describes or explains the image.

For example, LLaVA models are trained on datasets that pair images with captions or multi-modal Q&A examples. This forces them to become fluent in both visual and linguistic cues. In these finetuning setups, we commonly keep the vision encoder fixed and only fine-tune the text decoder and projection layer on the multi-modal dataset. his way, we can leverage the pre-trained vision encoder to understand images without the need for additional training data, while also benefiting from the power of the pre-trained LLM to generate human-like text.

Vision Encoders

Essentially, Visual Instruction Tuning swaps out text instructions for images and teach the model to generate text based on what it “sees.” But there’s a big question here: how do we get an image - essentially a 2D grid of pixels — to become compatible with a token-consuming text-based model? While Convolutional Neural Networks (CNNs) have traditionally excelled at extracting features from images, Vision Transformers have shown promising results in processing images as sequences of tokens, similar to text.

Here’s how it works: we divide up an image into smaller, fixed-size patches rather processing it all at once. Each patch is then flattened and mapped into a high-dimensional feature space — a numerical representation that captures the patch’s visual characteristics. To make sure the model doesn’t lose track of where these patches belong in the image, we add positional encodings. A Vision Transformer then consumes these location-annotated image patches, processing them through multiple layers of self-attention and feed-forward neural networks.

This process allows the model to understand the relationships between different parts of the image and distill its content into a sequence of semantically meaningful embeddings. This sequence of embeddings is particularly compatible with classical LLMs because it mirrors the token-like structure LLMs expect for text. Thus, the combination of Vision Transformers with Text Decoders gives us high-functioning multi-modal models.

If you want to learn about Vision Transformers in more detail, I recommend taking a closer look at the University of Amsterdam’s Deep Learning Tutorials. For now, let’s focus on the training frameworks for vision encoders, as they are crucial for the performance of the final LLaVA model.

Training the Vision Encoder

The quality of the final multi-modal LLaVA model heavily depends on the quality of its vision encoder. This is why it is crucial to train the vision encoder on a diverse and large-scale dataset to ensure that it can understand a wide range of images. Two popular training frameworks for vision encoders are CLIP and SigLIP. We describe both of them in a bit more detail here, since we need this knowledge to properly imbed a LLaVA model onto an edge device.

CLIP

Contrastive Language-Image Pre-training (CLIP) is a pre-training framework that teaches the vision encoder to understand images by associating them with text descriptions. Here’s the gist: CLIP trains the vision encoder to predict what the image means using a text description, and vice versa, to predict the image from a text description. By doing this, the model essentially learns a shared “language” that lets it understand both images and text.

CLIP has been shown to achieve state-of-the-art performance on a wide range of vision tasks, making it a popular choice for vision encoders in multi-modal models, in particular due to its alignment of the vision and text feature spaces.

CLIP trains vision encoders to predict image representations that align with text representations of their respective caption. By doing so contrastively over many image-text pairs in a batch, the vision encoder learns strong, semantic image embeddings (source).

A key aspect in CLIP is its use of contrastive learning, where the model is trained to maximize the similarity between positive pairs (image-text pairs that belong together) and minimize the similarity between negative pairs (image-text pairs that do not belong together).

Example. The vision encoders learns to match positive pairs (like a picture of a dog with a caption saying “a dog”) and push apart negative pairs (like a picture of a dog with a caption saying “a cat”).

The similarity is measured by a softmax, which is applied over the dot products between the representation spaces. This allows the model to learn a discriminative representation space that captures the semantic content of images and text.

While this approach is intuitive, it doesn’t scale well. For large representation spaces, we need large batch sizes to effectively capture a large variety of (negative) image-text pairs and improve the training process. This raises challenges in CLIP, in particular with the softmax. To calculate the CLIP loss, we need to apply the batch-level softmax twice to normalize the pairwise similarity scores across all images for training the text encoder, and across all texts for training the image encoder. These passes over the full batch size can be computationally expensive, especially when the batch size is sharded across multiple devices.

In a distribution training setup, we are often using data parallelism, where each device processes a part of the batch. While commonly, each device can do the forward and backward pass independently and only the final gradients are communicated, CLIP needs to already communicate the softmax statistics for the loss between all devices. This creates an efficiency bottleneck, especially when we scale to many devices. This bottleneck prevents CLIP from being scaled efficiently, and it’s the exact problem SigLIP solves.

SigLIP

The Sigmoid Loss for Language Image Pre-Training (SigLIP) replaces the softmax in CLIP with a sigmoid loss, which is applied element-wise to the dot products between the image and text representations. The loss objective is then closer to standard predictive learning with binary cross entropy, training the positive pairs to be $1$ while negative ones are pushed closer to $0$. This allows the model to train on a large batch size without the need for full-batch softmax computation, making it more efficient and scalable.

SigLIP has been shown to achieve similar or even better performance to CLIP, in particular for smaller datasets of image-caption pairs. Furthermore, SigLIP is more computationally efficient at scale, making it a popular choice for training strong vision encoders in multi-modal models.

Inference

Once the vision encoder is trained, we can use it to generate embeddings for images. However, both CLIP and SigLIP train single-vector representations, meaning that the image is represented by a single vector. This is not ideal for multi-modal models, as we want to generate a sequence of embeddings that can be fed into the text decoder. Furthermore, a single vector representation may not capture the full content of the image.

To address this, we can use the internal representations of the Vision Encoder, which is commonly a Vision Transformer, to extract a sequence of embeddings for the image. This allows us to capture a more detailed representation of the image, which can be used by the text decoder to generate more accurate and detailed text descriptions. This is a key aspect of the LLaVA model family, which combines the power of Vision Transformers with Large Language Models to generate human-like text based on images.

In the next blogpost, we will see that there are actually several variations on how a Vision Transformer can be implemented in the CLIP and SigLIP frameworks (e.g. using a separate class embedding, pooling, etc.). These variations can have a significant impact on the performance of the final LLaVA model and require adjusting the model architecture accordingly. Hence, it is important to carefully consider them when implementing your own LLaVA model.

Machine Learning Compiler Implementation

Now that we are familiar with the LLaVA model family and the vision encoders used in these models, let’s discuss how we can deploy these models on edge devices. As we have introduced in Part 1 of this blog post series, we use the Machine Learning Compiler (MLC) project for on-edge deployment. MLC aims to optimize and compile deep learning models for edge devices, enabling efficient and fast inference on resource-constrained hardware. MLC supports a wide range of deep learning models, including LLaVA models, making it a powerful tool for deploying multi-modal models on edge devices.

The LLaVA Model Family on MLC

Out of the box, the MLC LLM supports the quantization of the LLaVA family of vision encoders and text decoders. For this, the LLaVA implementation of the Hugging Face Transformers library has been integrated into the MLC framework using the TVM stack. At the time of writing (November 2024), the default supported text decoders are Llama and Mistral, with a CLIP-trained vision encoder. However, the sizes of these models are often around 7B parameters and larger, making them unsuitable for deployment on small edge devices (and even my M2 MacBook Air). This is why we need to consider smaller models, which are more suitable for our travel recommendation chatbot.

LLaVA OneVision

One of the smallest LLaVA models is the LLaVA OneVision Qwen2 0.5B, which, as the name suggests, uses the Qwen2 language model with 0.5B parameters.

The LLaVA OneVision Qwen2 0.5B model is particularly suitable for deployment on edge devices, as it is small and lightweight. While we can’t expect the same performance that larger models offer, this small LLaVA OneVision model is a good starting point for our travel recommendation chatbot and allows a fast iteration cycle for model development.

In contrast to the original LLaVA models, the LLaVA OneVision model uses a SigLIP-pretrained vision encoder, as it has demonstrated higher multi-modal LLM performance among open vision encoders. While we mentioned that in theory, there are no differences between SigLIP and CLIP during inference, the encoders slighlty differ in their Huggingface Transformers implementation. This is why we need to first port the LLaVA OneVision model and its SigLIP vision encoder to the MLC framework. Once again, I’ll walk you through how to do in the next blog post.

Conclusion

Multi-modal foundation models allow us to apply state-of-the-art models to new and more complex applications. We’ve introduced how we can use a pre-trained vision encoder and text decoder architecture to build a general-purpose vision-language chatbot. The Machine Learning Compiler Project currently supports embedding the LLaVA vision-to-text models on edge devices. However, due to our restricted local device, we will use the smallest LLaVA OneVision model with only 0.5B, which needs to be manually ported to the MLC framework. Thus, we’ve taken some time to understand the intricacies between LlaVA model implementations and what this means during model training and testing time.

What’s next?

This blog post provides you with the background knowledge needed to deploy a multi-modal vision-to-text model on edge devices. In the next blog post of this series, I’ll walk you through how to put this knowledge into practice.