Apple's AI research department has introduced another Large Language Model (LLM) called MM1. It works as a so-called MLLM multimodally, so it can handle not only texts, but also images. According to a paper presented last week by the team led by lead authors Brandon McKinzie and Zhe Gan, it can be extended to up to 30 billion parameters. This would mean that MM1 would be significantly more compact than large models such as GPT-4 or Google Gemini and therefore less power hungry. Nevertheless, MM1 should be competitive thanks to special pre-training.
Advertisement
SOTA despite its small size
A careful mix of image captions, image-text, and text-only data is critical to performing large-scale multimodal pre-training that delivers “state-of-the-art (SOTA) results in multiple benchmarks.” achieve. As examples of the possibilities of MM1, the Apple researchers show how the model draws conclusions about their current temperature conditions from images of various locations – including reasons for how this happens. So-called multi-step reasoning across several images is also possible – in the form of a chain of thought.
As part of the work on MM1, we also researched which components of the LLM architecture were particularly effective. This ranges from simple things like the resolution of the training images to the complexity of the visual encoder that MM1 uses. “We hope that the lessons we learned (in creating MM1) will help the AI community develop strong models that go beyond a single specific model architecture or data strategy,” said McKinzie and co.
Apple researchers want more openness
In its paper, the Apple team also criticizes. “Most works on multimodal large language models, both open and closed, reveal next to nothing about the process they went through to arrive at their algorithmic design decisions.” This particularly applies to the important pre-training phase. “To advance research in this area, we believe it is essential to distill principles and lessons for building such models.”
This goes beyond concrete implementations of the individual components. “Therefore, in this paper we document the MLLM creation process and attempt to formulate design lessons that will hopefully be useful to the AI community.” Further information about MM1 can also be found in a separate article from our colleagues at The Decoder.
(bsc)