Tuna: A Native Multimodal Model with Continuous Visual Representation

Published on January 06, 2026 | Translated from Spanish
Diagram of the Tuna model architecture, showing the flow of visual data through a VAE encoder and a representations encoder to create a unified feature space, with examples of comprehension and generation outputs.

Tuna: A Native Multimodal Model with Continuous Visual Representation

The field of multimodal artificial intelligence is evolving toward more integrated and efficient systems. Traditionally, models for comprehension and generation of visual content operated separately, generating inefficiencies and information losses. We present Tuna, a revolutionary approach that builds a continuous visual representation space within a single native system, enabling comprehensive and coherent processing of images and videos. 🚀

The Unified Architecture: The Heart of Tuna

The core innovation of Tuna lies in its native architecture. Instead of using independent encoders for distinct tasks, Tuna chains sequentially a VAE encoder (Variational Autoencoder) with a pre-trained representations encoder. This process generates a unified feature space that serves as a lingua franca for interpreting and recreating visual content. Internal coherence eliminates translation issues between disparate representation formats, a common bottleneck in systems with decoupled components. As a result, the information flow is smoother, and quality in analysis and synthesis tasks improves significantly. 🧠

Key advantages of the unified space:
  • Elimination of format mismatches: By avoiding independent encoders, incompatibilities that degrade performance in traditional approaches are overcome.
  • Comprehensive processing: The same representation space handles both images and videos, simplifying the model architecture.
  • Efficiency in data flow: Internal coherence allows for more direct and lossless information exchange between system modules.
The finding that joint training is beneficial suggests a promising path for the development of more generalist artificial intelligences.

Results, Scalability, and Mutual Benefit

Exhaustive evaluations on standard benchmarks confirm Tuna's superiority. The model sets new records in image and video comprehension tasks, content generation, and image editing. These advances not only validate the unified design but also demonstrate its scalability: performance improves systematically when integrating more powerful pre-trained representation encoders. This point underscores the crucial importance of these components in the multimodal ecosystem. 📈

Performance and approach highlights:
  • State-of-the-art performance: Achieves top results in comprehension and generation, demonstrating the effectiveness of the unified paradigm.
  • Proven scalability: The model directly benefits from advances in base encoders, ensuring its future relevance.
  • Synergistic joint training: A crucial discovery is that, within this unified framework, training with comprehension and generation data causes both tasks to enhance each other, rather than interfering or competing for resources.

The Future of Multimodal AI

Tuna represents a significant step toward more generalist and cohesive AI models. Its architecture suggests that the future lies not in siloed departments of "understand" and "create," but in a fluid conversation within the same system. By unifying visual representation, Tuna not only overcomes technical limitations but also paves the way for artificial intelligences capable of interacting with the visual world in a more natural and integral way. The paradigm of continuous representation could be the key to the next generation of creative and analytical tools. ✨