Field note·Citation patterns·1 July 2026·4 min read

LLM hallucinations are passing peer review at top AI venues

Phantom citations are now in the archival record at NeurIPS, ICML, and ICLR. That corrupts the training data shaping how AI models represent your field.

61%

Papers with hallucinated citations passing review

arXiv / RefChecker study, ICLR, ICML, NeurIPS proceedings

Key takeaways

61% of papers with hallucinated citations still passed peer review at top AI venues (ICLR, ICML, NeurIPS).
Once indexed, phantom citations enter AI training corpora, compounding the error across future model generations.
Brands and institutions whose authority rests on cited expertise face indirect reputational risk from misattributed findings.
Content teams using LLMs to draft research summaries face the same hallucination risk with less scrutiny than academic authors.
Persistent identifiers and database-indexed primary research are the baseline defence against citation traceability failures.

Sixty-one percent. That is the share of papers containing at least one hallucinated citation that still passed peer review at ICLR, ICML, or NeurIPS, according to research published on arXiv by the team behind RefChecker, a new automated citation-verification pipeline. The finding is not a warning about a future risk. It is a measurement of a present condition: phantom references are already in the archival record at the venues that train the next generation of AI researchers.

The mechanism matters. Large language models produce fluent scientific prose, and fluency is what peer reviewers most readily notice. A citation, by contrast, is easy to skip. Reviewers check arguments; they rarely verify that "Smith et al., 2022" resolves to a real paper with a compatible author list. RefChecker operationalises exactly that check, cross-referencing bibliography entries against multiple scholarly databases and escalating unresolved cases to web-search re-verification. The definition is conservative: the researchers exclude ordinary bibliographic drift (venue or year differences, minor name variants) and count only identity-level failures, meaning works that do not exist or carry substantially wrong author lists.

The result is a contamination problem, not a carelessness problem. Authors using LLM assistance may not know a reference is fabricated. The model invents something plausible, the author trusts it, the reviewer skips it, and the paper ships. Once in the proceedings of a top-tier conference, that phantom citation is indexed, scraped, and fed back into training corpora. The error compounds.

The citation loop that now affects brand visibility

For B2B brands, this matters in a way that goes beyond academic integrity. AI systems trained on scientific literature, including the foundation models now powering enterprise search and procurement research, inherit whatever the literature contains. A hallucinated citation in a NeurIPS paper is not merely an embarrassing footnote; it is a data point that shapes how a model represents facts about a field.

Consider the exposure for institutions whose credibility rests on cited expertise: multilateral bodies publishing technical guidance, standards organisations like ISO or IEEE whose frameworks are referenced in research, industrial groups submitting technical evidence to policy processes. If a language model has ingested proceedings where phantom citations misattribute findings, a model queried about those frameworks may surface subtly wrong attributions. The brand does not control what the model learned; it can only influence what auditable, verifiable sources exist to correct it.

That is the narrower implication for citation strategy. The arXiv paper's conservative definition of hallucination, identity-level failures only, points toward what is actually verifiable at scale: does the cited work exist, and is the author list correct? Any brand producing citable research should now treat those two attributes as the minimum viable standard of bibliographic hygiene. If your white paper cites a source that does not resolve cleanly in a bibliographic database, a verification pipeline like RefChecker would flag it. So, eventually, might a model asked to fact-check your claims.

Source: arXiv: LLM citation behaviour

AI-authored, editor reviewed