Field note·Citation patterns·21 May 2026·3 min read

Study: LLM memory predicts which papers get cited

If repetition across the open web shapes what models remember, gated PDFs and unindexed reports are invisible to the systems your buyers now ask.

15 of 17

LLMs showed positive memory-citation link

arXiv study, 549 CS papers, 2023 to 2024

Key takeaways

LLM memory of academic papers correlates with their citation counts across 17 models from six vendors.
The mechanism generalises: textual exposure on the open web shapes what models internalise about any entity.
Gated PDFs and document portals are invisible to training corpora; HTML mirrors and third-party explainers are not.
The Matthew effect compounds in LLM memory: established frameworks get remembered more, new entrants face a steeper visibility cliff.
Expect vendor tooling within 18 months that probes models for what they remember about your brand.

What happened

Per arXiv, a new paper proposes "LLM-Metrics," a research-impact measure built not from citation counts but from what large language models remember. The authors tested 549 computer science papers from 2023 to 2024 across 17 LLMs ranging from 0.5B to 72B parameters, from six vendors. They probed each model on four recognition tasks: title, author, method, and venue.

The result: 15 of 17 models produced positive predictions, 9 significant at p less than 0.05, with an overall Spearman correlation of rho = 0.1495 (p = 0.0004) against traditional citation counts. The hypothesis is straightforward. High-impact papers get more textual exposure on the open web. That exposure flows into training data. Models then encode those papers more deeply in parametric memory.

In plain terms: what an LLM remembers correlates with what the academic community cites. Not perfectly. But measurably, repeatably, and across vendors.

Why it matters for your brand

The correlation is modest (rho = 0.1495) but the direction of travel is the story. We now have empirical support for something the GEO community has asserted on instinct: repetition across the open web shapes what LLMs internalise, and what they internalise shapes what they surface. The paper studies academic citations, but the mechanism applies to any entity a model might be asked about: a company, a methodology, a framework, a person, a standard.

For financial services brands, this reframes the thought-leadership question. A white paper that sits behind a gated PDF and gets shared on LinkedIn is invisible to training corpora. The same paper, posted as HTML, summarised on three industry blogs, referenced in a Wikipedia entry, and discussed on Reddit, enters parametric memory. When a CFO asks ChatGPT about counterparty risk frameworks, the bank whose framework is remembered wins the citation. The bank whose framework was gated does not exist in the answer.

For multilaterals and UN agencies, the implication is sharper. Institutional knowledge in this sector tends to live in long PDFs, often hosted on slow document portals, often poorly indexed. UNDRR's Sendai Framework is well-remembered by LLMs because it has been written about thousands of times in textual form by third parties. Newer frameworks, published as PDFs with no HTML mirror and no third-party explainer ecosystem, will not be remembered. The asymmetry between "published" and "memorised" is now measurable.

For major industrial groups, the LLM-Metrics logic suggests that technical authority is built less by publishing original research and more by ensuring that research is discussed, summarised, and cross-referenced across the surfaces models actually train on. Holcim's work on low-carbon cement is a useful test case: the underlying science matters less to an LLM than the volume of third-party textual coverage of that science. The brands winning AI visibility in industrial categories are the ones who have seeded an explainer ecosystem around their IP.

For philanthropic and policy institutions, the finding is uncomfortable. The Matthew effect the authors flag in traditional citation, where well-cited papers get cited more, is reproduced in LLM memory. Established frameworks compound. New entrants face a steeper visibility cliff than they did in the pre-LLM web. A foundation publishing a novel methodology in 2025 cannot rely on the methodology being discoverable in 2027 unless it has been written about, in HTML, by sources outside the foundation's own domain.

Source: arXiv: LLM citation behaviour

AI-authored, editor reviewed

Study: LLM memory predicts which papers get cited

What happened

Why it matters for your brand

The signal in context