AI agents confirm what they know, rarely truly search
Leading AI search agents mostly validate pretraining memory rather than research the live web, reshaping which brands earn citations and when.
Key takeaways
- AI search agents largely confirm pretraining memory rather than genuinely research the web.
- On a 90 day recency benchmark, top model rankings collapsed and reshuffled.
- Incumbent brands with deep indexed archives win citations by default.
- Dated, specific, recent content is the main opening for newer entrants.
- Procurement teams are benchmarking enterprise AI search on the wrong metric.
On LiveBrowseComp, a benchmark restricted to events from the last 90 days, the leading AI search agents collapse. The Decoder reports that researchers at the Harbin Institute of Technology built the test precisely to strip models of their pretraining crutch, and the rankings duly reshuffled once GPT-5.4, Kimi K2.6 and their peers could no longer answer from memory. The agents were not researching the web. They were checking it.
This matters because the industry has spent two years selling "agentic search" as a qualitative break from retrieval. The Harbin work suggests the break is narrower than advertised. On standard benchmarks, where questions concern facts the model has already digested, the browsing step functions as confirmation theatre: a quick fetch, a citation, an answer the model would have produced anyway. Remove the memory, and the apparatus wobbles.
What "confirmation" actually means for citations
If an agent's default mode is to validate a prior, the pages it cites are not the pages that informed the answer. They are the pages that matched the answer. That is a meaningfully different selection problem from classical search, and it has direct consequences for which brands appear in LLM outputs.
Three follow on points.
First, recency is undervalued by the models themselves, not just by their training cutoffs. An agent that prefers confirmation will gravitate to canonical, older, heavily linked sources, because those are the ones most likely to agree with what it already believes. New explainers, fresh research notes and updated guidance lose to Wikipedia, to the Financial Times archive, to a 2022 McKinsey PDF. For any institution publishing time sensitive analysis, that is a tax on freshness.
Second, the citation slot rewards semantic alignment with the model's prior, not informational lift. A multilateral publishing a position that contradicts the prevailing pretraining consensus, say, UNDRR on disaster loss accounting, or CGAP on agent banking economics, will be cited less often than a blander source that says what the model expects. The penalty for being early or contrarian is invisibility.
Third, the gap between benchmark performance and real world performance is now quantified, and it is large. When Harbin's team forced genuine browsing, the leaderboard reordered. That means the models marketed as best at search are partly best at remembering, and the procurement teams at banks, insurers and industrial groups currently choosing an enterprise AI search layer are optimising against the wrong metric.
The brand implication
For B2B brands competing for citation share, the practical read is uncomfortable. Producing more content does not help if the model's first instinct is to confirm rather than discover. What helps is becoming the source the model already believes. That is a pretraining problem, not a publishing problem, and it favours incumbents with deep archives, dense inbound linking and long standing presence in the corpora the major labs scrape.
Financial services firms with decades of indexed research (JPMorgan, BlackRock, the Bank for International Settlements) are structurally advantaged. So are the UN system's older agencies, whose documents have been crawled, cross referenced and quoted for twenty years. The losers are newer entrants, rebranded institutions, and anyone whose authoritative content sits behind logins or in PDFs the 2023 to 2024 training runs missed. They will be under cited even when they are right, because the agent is not really looking.
There is a narrow opening. The same Harbin result implies that on genuinely recent questions, the 90 day window, the confirmation strategy fails and the agent has to browse properly. Brands that dominate fresh, specific, dated content, quarterly data releases, named research notes with publication dates in the title, indexed press statements, can win the citations the incumbents cannot fake from memory. That is where the marginal visibility is, and it is the slice of the query mix growing fastest as users push agents toward live questions.
What the labs will do next
Expect the major vendors to respond by tuning agents to browse more aggressively on questions flagged as time sensitive, and to weight recency more explicitly in retrieval. OpenAI has already shipped iterations in this direction; Anthropic's research previews suggest similar work. The benchmark will move. The behaviour will lag.
In the meantime, the citation economy inside LLM answers is closer to a recency arbitrage than the labs admit. Brands that publish dated, specific, machine readable analysis on questions younger than the training cutoff will be over represented relative to their authority. Brands relying on evergreen thought leadership will be under represented relative to theirs. The agents are not searching. They are remembering. Plan accordingly.