LLMs return different brand reputations depending on query language
A 35,640-response study shows LLMs reconstruct brand reputation differently by language family, making English-only monitoring a structural blind spot.
Key takeaways
- LLMs produce significantly different brand sentiment by query language family (F=268.5), not just translation differences.
- Germanic languages, including English, return the most critical brand assessments; Uralic and Baltic return the most positive.
- Lower-recognition brands show greater cross-language variability, hitting B2B and regional brands hardest.
- English-only LLM reputation monitoring is structurally misleading for brands operating across European language markets.
- Multilateral institutions face credibility risk if their mandate is described differently across official languages in AI responses.
English-only LLM monitoring is not a conservative choice. It is a blind spot with a measurable cost.
A study published on arXiv this month queried GPT-4.5, Gemini 3.1 Pro, and Perplexity Sonar Pro about 66 brands across twelve European languages, generating 35,640 responses. The finding is blunt: AI-constructed brand reputation shifts systematically depending on which language is used to ask the question. Mean cross-language cosine similarity sat at 0.825, which sounds reassuring until you examine the variance. Sentiment scores differed significantly across language families (F = 268.5, eta-squared = 0.077), and the direction of that difference is not random noise. Germanic languages, including English, return the most critical assessments. Uralic and Baltic languages return the most positive ones. Same language family, more similar reputation; cross-family, meaningfully different.
That 0.077 effect size is modest by some standards. In brand reputation terms, it is not. A company whose products are sold across Northern, Baltic, and Central European markets is being described differently to different audiences by the same three models, and in most cases nobody in that company's communications function knows it.
The tier problem compounds the language problem
The arXiv study also stratifies brands by recognition tier, and the results push further in one direction: lower-recognition brands show greater cross-language variability. Well-known multinationals have enough training signal in multiple languages that models produce more consistent outputs. Regional or mid-tier brands, exactly the kind that dominate B2B industrial, financial services, and policy sectors, lack that density. Their reputation in Finnish or Latvian is more likely to be assembled from whatever fragments exist in those corners of the training corpus, producing outputs that diverge sharply from the English baseline.
For a company like a Nordic industrial group or a Central European insurance provider, this means the English-language reputation audit its communications team runs quarterly may bear little relationship to what a procurement officer in Warsaw or a regulator in Riga encounters when they ask the same question in their own language. The model is not translating; it is reconstructing, and it reconstructs differently.
Three concrete consequences
First, multilingual content investment produces unequal citation returns. A brand that publishes extensively in English but maintains only thin local-language digital presence gives models little to work with in those languages. The model fills the gap with whatever it has, and that material may be older, more negative, or simply less representative than the brand's current positioning.
Second, sentiment monitoring built on English queries will systematically over-report criticism for brands operating in Germanic-language markets and under-report it for brands in Uralic or Baltic ones. A financial institution tracking its AI reputation in German for its German retail business and in English for everything else is comparing two distributions that the study shows are not equivalent.