Gemini Omni Flash — multimodal video generation, now on Kolbo.AI
Google announced Gemini Omni Flash at I/O 2026, and it is now fully integrated into Kolbo.AI. This is the first model in the Omni family — and the reason it stands apart from every other video model currently available is simple: it accepts any combination of inputs you throw at it, and it actually understands what all of them mean together.
Text. Image. Video. Voice. All at once, all in the same generation.
What Is Gemini Omni Flash?
Most AI video models work with one or two input types. You write a prompt, or you upload an image, and the model generates something that vaguely relates to what you gave it. The logic connecting your input to the output is often opaque — results feel random even when they look good.
Gemini Omni Flash is designed differently. It is a true multimodal model, meaning it processes text, image, video, and voice as a unified understanding of what you are asking for. The "Omni" name is not branding — it reflects the model's architecture. And the "Flash" tier means it delivers that capability at high speed.
It is available in Kolbo.AI inside the Elements tool, image-to-video mode.
Four Input Types, One Model
Text: Direct, Specific, No Prompt Engineering Required
You can drive the entire generation from a written description. Gemini Omni Flash has strong real-world knowledge, so you do not need to explain physics, anatomy, or narrative logic — the model already understands them. Describe a scene and it generates one that behaves the way that scene should behave.
Write "a street photographer walking through a rainy market at dusk" and you get wet reflections on the pavement, natural ambient noise in the motion, and a gait that feels unhurried. The model fills in what you did not say because it understands what you meant.
Image: Your Visual, Brought to Life
Feed it a still image and describe how the scene should move. A product shot becomes a rotating reveal. A portrait becomes a subtle, breathing, expressive moment. A sketch becomes an animated sequence.
The model does not just add motion to a frame — it interprets the image for spatial depth, lighting direction, and subject relationships, then animates it in a way that respects all of that. The result moves like the image was always meant to.
Video: Continue, Remix, or Transform
Upload an existing video clip as a reference. Gemini Omni Flash can use it as a motion reference, a stylistic baseline, or a scene to continue. Describe what changes you want — or what you want to preserve — and the model handles the transformation.
This is where conversational refinement becomes powerful. Generate a first pass, describe what is close but not quite right, and iterate. You are not rebuilding the prompt from scratch each time. The model holds context between steps, so adjustments compound rather than restart.
Voice: Performance That Carries Into the Video
Attach a voice clip — your own, a character's, a reference performance — and the model reads it as direction. Expression, pacing, emotional tone from the voice are reflected in the visual output. A character on screen can feel connected to the voice you gave them, not just lip-synced to it.
For presenter videos, character shorts, or any content where the speaker's presence matters, this changes what AI video generation is capable of producing.
Conversational Editing: Refine Step by Step
One of the things that makes Gemini Omni Flash genuinely different in practice is how iteration works.
Most video generation workflows are one-shot. You write a prompt, generate, evaluate, rewrite the prompt, generate again. Each generation is independent. What you learned from the last one does not carry forward automatically.
Gemini Omni Flash supports conversational refinement. Generate a clip, then describe what to adjust — keep the motion, change the lighting; keep the character, shift the environment; extend the moment by a beat. The model holds the context of what was already working and applies the adjustment without losing it.
For creators who iterate their way to a result rather than describing the final output upfront, this is a meaningfully faster workflow.
Real-World Logic Baked In
The model's knowledge of physics, biology, narrative structure, and cultural context is not a separate layer — it is part of how the model interprets inputs. A burning candle flickers in the direction of drafts. A crowd reacts to something before the something happens. A character's expression matches the emotional register of the scene.
This is what separates outputs that feel intentional from outputs that look technically correct but feel off. Gemini Omni Flash closes that gap significantly.
Digital Avatars and Presenter Video
Gemini Omni Flash has strong support for generating presenter-style and digital avatar content. Attach a voice, provide image or video reference for the character's appearance, and describe the performance. Voice, expression, and on-screen action stay connected — the output reads as a coherent performance rather than stitched-together parts.
For product videos, explainers, social content, or any format where a human presence drives the narrative, this is a direct and practical capability.
Where to Find It
Gemini Omni Flash is live now. Open the Elements tool in your Kolbo workspace, switch to image-to-video mode, and select Gemini Omni Flash from the model picker.
No extra setup. Your credits work as usual.
Gemini Omni Flash is live in your Kolbo workspace right now.
Try Gemini Omni Flash →Best, Zohar Founder, Kolbo.AI



