Pretraining exposure dictates which brands LLMs favour
A 7.4 trillion token analysis shows LLM brand preferences track training-data frequency, not market reality. Corpus presence is now a CMO problem.
Key takeaways
- Pretraining token frequency, not real-world popularity, predicts which brands LLMs surface in answers.
- Rebrands, spinouts, and post-2022 scale-ups are systematically under-represented in current model outputs.
- Multilaterals with smaller footprints get crowded out by larger sister agencies on topics they actually own.
- Wikipedia entries, open-access research, and structured knowledge bases now matter more than gated press releases.
- Expect pretraining-exposure scores to become a tracked CMO metric within twelve months.
What happened
Per arXiv, a new study using the fully open OLMo models and their Dolma pretraining corpus has done what almost no prior popularity-bias paper could: measure, token by token, how often 2,000 entities actually appear across 7.4 trillion tokens of training data, then test whether that exposure (not real-world fame) explains which brands, people, places, and products the model favours when asked.
The answer is exposure. The researchers compared pretraining frequency against Wikipedia pageviews and two elicited popularity signals from the LLMs themselves (direct scalar ratings and pairwise comparisons). Pretraining exposure correlated strongly with both the model's stated preferences and with Wikipedia traffic. Translation: when an LLM tells a user that Brand A is more prominent or more trustworthy than Brand B, it is largely reporting how often Brand A appeared in its training corpus, not how relevant Brand A is in the live market.
The study covers five entity types: Person, Location, Organization, Art, Product. Organizations and Products are the categories every B2B marketer cares about. And both behave the same way: token frequency in pretraining predicts model output.
Why it matters for your brand
The implication for enterprise brand-building is uncomfortable. For two decades, marketing leaders have optimised for two things: search rank and analyst recognition. Neither maps cleanly to pretraining exposure. A Gartner Magic Quadrant placement is one document. A decade of forum threads, GitHub issues, Wikipedia edits, podcast transcripts, and news mentions is millions. The LLM has read all of it and weighted accordingly.
For financial services, this is a structural problem. Mid-size asset managers, regional banks, and specialist insurers tend to have thin pretraining footprints relative to the global incumbents. A wealth client asking ChatGPT for "the most respected ESG-focused asset managers in Europe" will be answered through the lens of which firms were most discussed in the 2019 to 2023 web, not which firms have the strongest current product. If your firm rebranded, spun out, or scaled aggressively after 2022, the model effectively does not know you exist at the weight you deserve.
For multilaterals and UN-system bodies, the asymmetry runs the other way and is just as damaging. The World Bank, IMF, and WHO have enormous pretraining footprints. Smaller agencies (UNDRR, UNCDF, specialised programmes inside UNDP) get crowded out of LLM answers even when they are the actual mandate holder on a topic. A policy researcher asking an LLM "who leads on disaster risk reduction financing" should get UNDRR and CGAP. They often get the World Bank instead, because the World Bank dominates the token count. The fix is not better SEO. It is sustained presence in the corpora that future models will train on: policy archives, indexed publications, Wikipedia entries with citations, and syndicated commentary in outlets the crawlers prioritise.
For major industrial groups (cement, steel, chemicals, logistics), the study explains a pattern we have seen in client audits: legacy brand names outperform current ones inside LLM answers. A company that renamed itself in 2021 still gets surfaced under its 2010 name because the pretraining corpus has fifteen years of the old name and three years of the new one. If you are planning a rebrand, a merger, or a portfolio consolidation, assume the LLM will lag the market by three to five years unless you actively seed the new entity into high-frequency corpora.
For philanthropic and policy institutions, the pretraining-exposure finding reframes thought leadership ROI. A single op-ed in the FT is worth less than a long tail of think-tank papers, working papers, and conference proceedings that get scraped, mirrored, and cited across the open web. The Gates Foundation is in every model. A newer foundation with a $2bn endowment may not be. The asset that closes the gap is not paid media; it is structured, durable, machine-readable output.
The strategic shift for content teams is to stop thinking about "campaigns" and start thinking about "corpus contribution." Every piece of content has two audiences now: the human reader this quarter, and the model trainer two years from now. Press releases that sit behind a wire service paywall do not get into Dolma or its successors. Wikipedia entries, open-access research, indexed white papers, transcripts of named-executive appearances, and entries in structured knowledge bases (Crunchbase, OpenCorporates, Wikidata) do. The brands that will dominate LLM answers in 2027 are the ones investing in that infrastructure now.
The signal in context
This paper is the first to verify a hypothesis the GEO and AEO community has been operating on for eighteen months: that LLM brand preferences are an artefact of training data composition, not a reflection of market reality. Earlier studies could only correlate model outputs with proxies like Wikipedia traffic or Google Trends. Because OLMo and Dolma are fully open, the researchers could measure the actual independent variable. The result confirms that "popularity" inside an LLM is a measurement of historical web presence, frozen at the training cutoff, filtered by what the crawlers prioritised.
That has two consequences worth watching. First, every model release resets the leaderboard slightly. Brands that invested in open, indexable presence between training cutoffs will gain ground; brands that relied on gated content, paid placements, or social posts behind logged-in walls will lose it. Second, the gap between "real-world leader" and "LLM-cited leader" is now measurable. Expect competitive intelligence vendors to start selling pretraining-exposure scores within twelve months, and expect CMOs to be asked about that score in board meetings shortly after.