Research Index

Join the Menttor community

Access accelerated AI inference, track progress, and collaborate on roadmaps with students worldwide.

🐢
Research Decoded/Gemini Team, Google (2023)

Gemini: Native Multimodal Models

Team, G., et al. (2023). Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.

Read Original Paper

In late 2023, Google introduced 'Gemini,' a family of models designed from the ground up to be 'natively multimodal.' While previous 'multimodal' models often consisted of separate vision and language components that were bolted together after training, Gemini was trained simultaneously across text, images, audio, video, and code. This allowed the model to reason across different types of information with a fluidity that mimics human perception. It was a shift from modular multimodality to a single, integrated architecture that treats all data types as first-class citizens.

The Native Multimodal Backbone

The fundamental technical shift in Gemini was the move away from 'modular' multimodality—where a separate vision encoder is bolted onto a language model—toward a natively multimodal architecture. Instead of converting images or audio into text descriptions first, Gemini uses a single, massive Transformer-based decoder that processes all data types as sequences of tokens within the same latent space. Images are tiled and encoded into visual tokens, while audio and video are sampled and mapped into the same sequence as text. This allows the model to perform cross-modal self-attention, where a single text token can attend directly to a specific visual region or a subtle audio cue. It revealed that true multimodal intelligence is an emergent property of a single, unified reasoning engine rather than a collection of specialized modules.

The Tokenization of Reality

How Gemini achieves this integration lies in its sophisticated tokenization process. For visual data, the model does not rely on a fixed-resolution encoder; instead, it uses a variable-resolution approach that preserves the aspect ratio and fine-grained details of an image. Audio is sampled at 16kHz and converted into a sequence of tokens using a dedicated neural mapper, while video is treated as a series of image frames interleaved with precise timestamps. This ensures that temporal dynamics and spatial relationships are preserved as first-class citizens in the model's context. By treating a glass breaking in a video and the word 'break' in a sentence as equivalent units of information, the model can reason across domains with a fluidity that was previously impossible. It suggested that the most effective architectures are those that can ingest the raw complexity of the world without forcing it through a linguistic bottleneck.

Scaling on TPUv4 and TPUv5e

The training of Gemini required a massive leap in infrastructure, utilizing Google's custom TPUv4 and TPUv5e accelerators across multiple data centers. To handle the scale of trillions of tokens and billions of parameters, the researchers implemented a combination of model, data, and pipeline parallelism, ensuring that the workload was distributed with minimal communication overhead. A critical technical challenge was the management of hardware reliability; at this scale, silent data corruption and chip failures are inevitable. Google developed automated recovery systems that could detect a failing unit and restore the training state to a healthy subset of the fleet within minutes. This level of 'infrastructure as code' allowed the training process to remain stable over months of continuous operation. It proved that the success of a foundation model is as much a feat of systems engineering as it is a breakthrough in machine learning logic.

Reasoning Beyond Text

The reasoning behind Gemini was to prove that native multimodality leads to superior performance on complex tasks that require both visual and logical reasoning. Gemini Ultra became the first model to outperform human experts on the MMLU benchmark, but its real breakthrough was in its 'cross-modal reasoning'—the ability to look at a chart, understand the underlying data, and then write code to reproduce it. This revealed that a model's 'intelligence' is amplified when it can see the world through multiple lenses at once. It proved that the future of AI is not in building better 'chatbots' but in building systems that can perceive and act in the world as holistically as humans do. It raises the question of whether the next leap in intelligence will come from even larger models or from a deeper integration of sensory inputs that more closely mimic the human experience.

Dive Deeper