What you need to know about multimodal language models

MLLM, Multimodal Large Language Model -

What you need to know about multimodal language models

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

OpenAI has released GPT-4, the latest edition of its flagship large language model (LLM).

And though few details are available, what we do know is that it will be a “multimodal” LLM, according to a Microsoft executive who spoke at a company event last week.

Basically, multimodal LLMs combine text with other kinds of information, such as images, videos, audio, and other sensory data. Multimodality can solve some of the problems of the current generation of LLMs. Multimodal language models will also unlock new applications that were impossible with text-only models.

We don’t yet know how close multimodal LLMs will bring us to artificial general intelligence (as some have suggested). But what seems certain is that multimodal language models are becoming the next frontier of competition between tech giants battling for domination of the generative AI market.

The limits of text-only LLMs

transformer neural network

At the heart of LLMs such as ChatGPT and GPT-3 is the transformer architecture, a type of deep neural network that is especially useful for processing sequential data. Transformers are especially useful because they can be trained through unsupervised learning on very large datasets of unlabeled text. They’re also scalable, which means their performance improves as they grow larger. They even show emergent abilities at very large scales, accomplishing tasks that were not possible with smaller models.  

The first generation of LLMs was mostly focused on textual data. This focus has had great benefits for applications such as assisting in writing articles and emails or even helping in writing software code. But it has also shed light on the limits of text-only LLMs.

While language is one of the important features of human intelligence, it is only the tip of the iceberg. Our cognitive abilities depend deeply on perception and abilities that we mostly use unconsciously, such as our past experiences and knowledge of how the world works. Even before we learn to speak, we learn about object persistence and gravity and develop a world model. We learn about animate and inanimate objects, agents, intents, and goals.

Language builds on top of those skills. It becomes a compressed means of passing on information and omits most of these abilities that we all share.

Language models have only been trained on text, making them inconsistent on tasks that require common sense and basic world knowledge. Expanding the training corpus sometimes helps, but always leaves holes that crop up unexpectedly. This is where multimodality can help—to some extent.

What is a multimodal language model?

PaLM-E model
PaLM-E multimodal language model (source: Google)

A multimodal LLM is trained on several types of data. This helps the underlying transformer model to find and approximate the relations between different modalities. Research and experiments show that multimodal LLMs can avoid (or at least reduce) some of the problems found in text-only models.  

Not only do multimodal LLMs improve on language tasks, but they can also accomplish new tasks, such as describing images and videos or generating commands for robots.

Deep learning models are usually designed in a way that can only receive one type of input, such as images, tabular data, or text. So how can you mix multiple datatypes? To understand this, let’s briefly overview the input mechanism of transformer models.

Transformers take “embeddings” as input. Embeddings are vectors that are numerical representations of the input data. When you provide a classic LLM with a string of text, a word embedding model transforms the text into multi-dimensional vectors. For example, the embeddings of Davinci, the largest GPT-3 model, have 12,288 dimensions. The dimensions of the vector approximate different semantic and grammatical features of the token.

The transformer model uses attention layers to process the word embeddings it receives as input and determine how they relate to each other. It then uses this information to predict the next token in the sequence. That token can then be translated back to its original form.

Multimodal LLMs use modules that encode not just text but other types of data into the same encoding space. This enables the model to compute all kinds of data through a single mechanism.

There are multiple ways to develop multimodal LLMs. And there are several papers and works of research that explore multimodal LLMs. But two that recently caught my eye were Kosmos-1 by Microsoft and PaLM-E by Google. (There is also Visual ChatGPT, but I will probably address that in a separate post.) Interestingly, Google and Microsoft also seem to be at the forefront of the battle for LLMs and generative AI.


Kosmos-1 was introduced in a paper titled, “Language Is Not All You Need: Aligning Perception with Language Models.”

Microsoft researchers describe Kosmos-1 as “a Multimodal Large Language Model (MLLM) that can perceive general modalities, follow instructions (i.e., zero-shot learning), and learn in context (i.e., few-shot learning). The goal is to align perception with LLMs, so that the models are able to see and talk.”

Kosmos-1 uses the transformer architecture as a general-purpose interface for processing multiple types of sequential data. The model uses a single embedding module to encode text and other modalities. The main LLM is a standard transformer decoder with some enhancements for training stability and long-context modeling. Kosmos-1 is 1.6-billion-parameters large, much smaller than other LLMs and visual reasoning models.

Kosmos-1 architecture
Kosmos-1 by Microsoft

The researcher trained the model from scratch with different types of examples, including single-modal data (e.g., text), paired data (e.g., images and captions), and interleaved multimodal data (e.g., text documents interleaved with images). They trained the model on next-token prediction like other LLMs. They then tuned it with a special dataset that improves its instruction-following capabilities.

Microsoft researchers tested Kosmos-1 on several tasks including standard language understanding and generation, non-verbal IQ tests, image captioning, visual question answering, and image classification. Kosmos-1 showed remarkable improvement over other state-of-the-art models on several tasks, including processing text embedded in images, classifying images, and answering questions about the contents of web pages.

One of the interesting findings is non-verbal reasoning with Raven IQ tests, where the LLM predicts the next item in a sequence of images. These kinds of tasks require abstraction and reasoning capabilities. Randomly choosing answers results in 17 percent accuracy. Kosmos-1 achieved reached 22-26 percent accuracy without seeing any Raven examples during training. This is a significant improvement, but still much below average human performance. “KOSMOS-1 demonstrates the potential of MLLMs to perform zero-shot nonverbal reasoning by aligning perception with language models,” the researchers write.

Kosmos-1 examples
Examples of tasks by Kosmos-1

Another important finding is the ability to transfer knowledge across modalities. Interestingly, when provided with interleaved multimodal input, Kosmos-1 uses cross-modal information to improve its responses.

Overall, Kosmos-1 shows that multimodality enables LLMs to achieve more with less, allowing smaller models to solve complicated tasks. In the future, the researchers will experiment with larger versions of Kosmos-1 and add other modalities such as speech. They will also test Kosmos-1 as an interface for other types of tasks, such as controlling text-to-image generation.


PaLM-E, developed by researchers at Google and TU Berlin, is an “embodied multimodal language model.” The paper describes embodied LLM as a model that directly incorporates “continuous inputs from sensor modalities of an embodied agent and thereby enable the language model itself to make more grounded inferences for sequential decision making in the real world.” For example, the model can integrate a robot’s sensor data to answer questions about the real world or to carry out natural language commands.

The model’s inputs are multimodal sentences that interleave text, visual data, and state estimation. The output of the model can be plain-text answers to questions or a sequence of textual decisions translatable into commands for a robot to execute. The model has been designed for embodied tasks such as robotic object manipulation and mobile robot task planning. However, it is also competent at non-embodied tasks such as visual question answering and normal language generation.

The researchers used the pre-trained PaLM model and combined it with models trained to encode data from different modalities into the embedding space of the main LLM. They tested the model across a variety of robot tasks, including task and motion planning. PaLM-E was able to accomplish novel tasks. “PaLM-E can generalize zero-shot to tasks involving novel object pairs and to tasks involving objects that were unseen in either the original robot dataset or the finetuning datasets,” the researchers write.

The model also showed promising results in mobile manipulation, where a robot must move around an environment and perform tasks such as picking up objects and carrying them to a destination. The researchers’ experiments also show that training the model on a mixture of tasks and embodiments improves performance on each individual task.

PaLM-E in action (source: Google)

One of the important benefits of PaLM-E is the transferability of knowledge. Thanks to its representative knowledge, PaLM-E was able to solve robotic tasks with very few training examples. Data efficiency is very important for robotics, where training data is scarce and hard to collect.

What are the limits of multimodal LLMs

In humans, multimodality is deeply integrated into the body, perceptual capabilities, sensorimotor systems, and the entire nervous system. Our brain grows with the rest of our body. Language is built on top of a vast corpus of knowledge we obtain as children. In contrast, multimodal LLMs either try to learn language and perception at the same time or bring together pre-trained components. While this type of architecture and training regime can speed up the development of the model and lends to scaling, it can also end up developing incompatibilities with human intelligence, which will manifest itself through bizarre behavior.

Multimodal LLMs are making progress on some of the important problems of current language models and deep learning systems. It remains to be seen whether they will solve the deeper problems in bridging the gap between AI and human intelligence.


Leave a comment

Please note, comments must be approved before they are published

#WebChat .container iframe{ width: 100%; height: 100vh; }