Gemini Omni and 3.5 land with multimodal demos
Google's new flagships are sold on photos, slides and video. The brands cited in answers will be the ones whose visual assets are machine-readable.
Key takeaways
- Google launched Gemini Omni and Gemini 3.5 with nine multimodal demos, not benchmarks.
- The unit of retrieval is moving from text passages to images, charts, slides and video frames.
- Industrial groups with deep visual archives have an asymmetric advantage if assets are captioned and public.
- Multilaterals risk losing citation credit on their own charts when PDFs lack persistent captions and DOIs.
- Gemini rarely shows sources in voice and app modes; track whether your framing appears in answers, not just whether you are linked.
Google has rolled out two new flagships, Gemini Omni and Gemini 3.5, and the company's own blog leads with nine demos rather than a benchmark table. That choice is the story. The Google AI blog frames both models around live multimodal tasks: reading a whiteboard, parsing a chart in a PDF, narrating what a phone camera sees, stitching reasoning across video frames. Benchmarks come later, if at all.
For anyone tracking how brands surface in AI answers, the format of the announcement matters more than the model names. Google is signalling that the unit of retrieval is shifting from the text passage to the multimodal artefact.
The demo reel is the spec sheet
Nine vignettes, almost all of them video or image-first. A user points a camera at a broken appliance and asks for a fix. Another drops in a research paper and asks Gemini to summarise the figures, not the abstract. A third walks through a slide deck and gets a verbal critique. The model is being sold on its ability to read what is shown, not only what is written.
Two implications follow. First, Google is conceding that the interesting queries in 2026 are not typed. They are spoken, pointed at, or uploaded. Second, the retrieval surface that feeds those answers has to include assets the open web has historically under-indexed: diagrams, slides, product photography, recorded talks, datasheets.
That is awkward for the standard B2B content stack, which is still optimised for the 1,200-word blog post and the gated PDF.
What gets cited when the query is a photograph
Citation behaviour in multimodal mode is not the same as in text mode. When Gemini answers a typed question about, say, Basel III capital requirements, it pulls from a familiar set: regulators, the FT, consultancies, a handful of bank research notes. When the same model answers a photographed question (a screenshot of a term sheet, a picture of a construction site, a frame from a webinar) the retrieval set narrows sharply to whatever it can match visually and semantically at once.
In practice this rewards three things. Source material that is visually distinctive and consistently branded. Documents whose figures and charts carry machine-readable captions and alt text. And video assets with accurate transcripts indexed against timestamps. None of that is new advice. It is now load-bearing.
Industrial groups have an asymmetric advantage here and most are not using it. Holcim, Siemens, Schneider Electric and their peers sit on decades of technical drawings, product imagery and field footage that no competitor and no LLM training set fully covers. If those assets are published with structured captions and accessible transcripts, they become the default visual citation for a category. If they stay locked inside sales portals, a generic stock image wins the answer.
The multilateral problem
Multilaterals and policy institutions face the opposite bind. UNDRR, the World Bank, the IMF and the OECD produce the canonical charts on disaster risk, poverty, inflation and trade. Those charts already appear in AI answers, often without attribution, because they live inside long PDFs that models extract but rarely cite cleanly.
Gemini's new emphasis on parsing PDFs visually raises the stakes. A model that can read a figure on page 47 of a World Bank working paper will quote that figure. Whether it names the World Bank depends on whether the figure carries a persistent caption, a DOI, and a citation string the model recognises as authoritative. Institutions that treat their charts as design objects rather than citable data points will keep losing credit to the aggregators who repackage them.
Financial services: the screenshot problem
Banks and asset managers have spent a decade building locked research portals. Gemini Omni does not care. When a user photographs a Bloomberg terminal or screenshots a research note and asks the model to interpret it, the answer is generated from whatever the model can see in the image plus whatever public commentary exists around the same topic.
That public commentary is now the brand surface. Firms that publish accessible explainers, glossaries and chart libraries on the open web will be quoted as the interpretive layer over screenshots they did not produce. Firms that keep everything behind a login will watch competitors annotate their own data.
What the demos do not show
Nowhere in the nine vignettes does Google publish a citation. The model answers; sources are implied. This is consistent with how Gemini has handled attribution across 2025: present in Search's AI Overviews, sparse in the Gemini app, almost absent in voice mode. The product direction is towards confident synthesis, not visible sourcing.
For brands, that means the question "did Gemini cite us?" is becoming harder to answer and less useful to ask. The better question is whether the model's answer carries language, framing or figures that originated with you. Measuring that requires prompt-level testing against your own corpus, not dashboard checking.
The nine demos are a product launch. They are also a quiet instruction to anyone who wants to be seen inside the next generation of answers: publish for the camera, caption for the model, and stop assuming the text passage is the unit that wins.