Gemini Omni: edit videos by speaking, like ordering a coffee without milk

Google has introduced Gemini Omni, a model that allows transforming one video into another using natural language in a dialogue format. Unlike the previous Veo, this system edits original frames while maintaining scene coherence and character actions. It currently generates clips up to 10 seconds long with sound, although the company already plans to extend that limit.

photorealistic scene of a video editing interface showing two frames side by side, left frame with a person ordering coffee at a counter, right frame with the same person holding an empty cup while speaking, a glowing digital pipeline connecting both frames, subtle waveform lines and audio equalizer bars floating nearby, cinematic lighting with blue and orange tones, sleek modern workspace with a tablet and stylus on a desk, technical illustration style, clean lines, high contrast, demonstrating real-time video transformation through natural language commands, process of editing without visible text or numbers

Physics and historical context in every frame 🧠

The model relies on the Gemini ecosystem to generate scenes considering historical and scientific contexts. It reproduces phenomena such as gravity or fluid dynamics with precision, allowing, for example, changing the background of a medieval fight to a space storm without the characters floating like balloons. It also includes the creation of personalized digital avatars, using the system's vast knowledge to maintain visual logic.

Every YouTuber's dream: editing without opening After Effects 🎬

Now any mortal will be able to say change that cat for a dancing dinosaur and the video will obey. The downside is that if you ask for an 11-second clip, Gemini will look at you with digital disdain and remind you that it's still in beta. But hey, while you wait, you can create an avatar that does things you would never do, like cleaning the house. Human laziness, finally, has its tool.