Most AEO 'wins' are just ChatGPT's growth, study finds
A controlled log study on a single domain shows most reported ChatGPT referral gains are platform tailwind. The implication: demand a counterfactual, or stop calling it lift.
Key takeaways
- Raw ChatGPT referral growth is dominated by ChatGPT's own user growth, not by AEO interventions.
- The glasp.co study used untreated pages on the same domain as a control, a method most AEO case studies skip.
- First-party server logs beat third-party panel estimators for measuring true lift.
- CMOs should require a treated/untreated split before approving AEO spend.
- Vendors selling uncontrolled before-and-after charts will lose credibility as platform growth flattens.
A new arXiv paper on AEO measurement (Disentangling Answer Engine Optimization from Platform Growth) tries something the discipline has mostly avoided: a control group. Its conclusion will sting any agency selling 10x ChatGPT referral charts. Most of the growth was not theirs to claim.
The setup is unusually clean. The authors ran a longitudinal log study on glasp.co, a domain with hundreds of thousands of YouTube Q&A pages. In January 2026 they applied a defined bundle of AEO interventions to one slice of the site. The untreated remainder of the same domain, sitting under the same platform tailwind, served as the contemporaneous control. First-party server logs replaced the probabilistic third-party panels that dominate vendor case studies.
The finding, in one line: raw ChatGPT referral growth is dominated by ChatGPT's own growth, not by what optimisers did to the pages. The treated pages did outperform the control, but by a fraction of the headline multiple a normal case study would have trumpeted. Strip out the tide, and the swimmer is moving more slowly than the brochure suggests.
This matters because the entire nascent AEO category is being sold on uncontrolled before-and-after charts. "ChatGPT referrals up 6x since we rewrote your FAQ schema" is the standard pitch. If ChatGPT's user base and referral propensity roughly tripled over the same window, most of that 6x belongs to OpenAI, not to the consultant. The glasp.co experiment is the first public attempt to quantify how much. The answer is: most of it.
For senior marketers, three implications follow, and none of them are comfortable.
First, internal reporting needs a control. Any CMO signing off on AEO spend should demand the same structural test the paper uses: a treated cohort of pages and an untreated cohort on the same domain, measured in first-party logs. Without that, the dashboard is measuring OpenAI's quarter, not the agency's work. Financial services and multilateral communications teams, which tend to have strict attribution discipline for paid media, have so far given AEO a free pass on causal rigour. That gap will close, and vendors who cannot produce a clean counterfactual will lose renewals.
Second, third-party estimators are now on notice. Similarweb-style panels and the various "LLM visibility" trackers infer referral and citation share from sampled behaviour. The paper's authors explicitly chose server logs over those tools. Expect procurement teams at large industrial groups and policy institutions, the buyers most allergic to methodological hand-waving, to start asking which numbers in a pitch deck come from logs and which from models of logs. The honest answer is usually the latter.
Third, the genuine AEO lift, the part left after the tide recedes, is the only number worth optimising against. It is also the number that tells you whether a given intervention (schema, answer-shaped headings, citation-friendly formatting, source authority signals) actually changes model behaviour, or merely coincides with ChatGPT shipping a new retrieval pipeline. The vendors with real craft will welcome this. The ones selling tide as skill will not.
There is a subtler point buried in the methodology. By using one domain's untreated pages as the control, the authors sidestep the cross-domain confounds that ruin most "we did X and traffic went up" stories: different audiences, different crawl frequencies, different citation graphs in the model's training data. The design is replicable. Any large publisher, any UN agency with a sprawling content estate, any bank with hundreds of explainer pages, can run the same split next quarter. The cost is mostly discipline: hold a comparable cohort untouched, and resist the urge to "fix" it for two reporting periods.
The deeper consequence for brand visibility in LLM answers is that the market is about to bifurcate. On one side, organisations that measure AEO with controls will know which interventions move the needle and will compound those gains. On the other, organisations relying on raw referral curves will keep mistaking platform growth for performance, until ChatGPT's growth rate flattens and the curve does too. At that point the absence of real lift becomes visible all at once, and the AEO line item gets cut in the same board meeting that the SEO line item survived in 2023.
The paper does not kill AEO. It does something more useful: it sets a measurement bar the serious practitioners will clear and the rest will dodge. CMOs should ask which side their current vendor is on, and ask before the tide turns.