2025-ahmed-llm-censorship-bias
findings extracted from this paper
-
Across all 7 LLMs tested (GPT 4o, GPT 4o Mini, Gemini 1.5 Flash, Gemini 1.5 Pro, Llama 3.2, Claude 3.5 Haiku, Claude 3.5 Sonnet), statistically significant evidence of censorship bias was found in at least one evaluation metric per model: responses to Simplified Chinese prompts were more neutral, more similar to sanitized text, and less opinionated than semantically identical Traditional Chinese prompts (p < 0.05 across refusal-rate, sentiment, CensorshipDetector classification, and word-embedding analyses).
-
CensorshipDetector, an XLM-RoBERTa model fine-tuned on 587,819 Baidu Baike articles (censored) and Chinese Wikipedia (uncensored), achieved 91% accuracy on a held-out validation set of Chinese news articles, correctly classifying 93% of Chinese state media articles as censored and 87% of New York Times Chinese articles as uncensored, with average censorship scores of 0.93 and 0.13 respectively.
-
A systematic search of the Common Crawl dataset — the training corpus attributed to most major LLMs including Llama, GPT, and Gemini — found content from 325 of 326 Chinese government and state media domains searched, confirming that sanitized content is pervasive in LLM pretraining data and providing a concrete mechanism for how Chinese information controls propagate into Western-built models.
-
Using English as a pivot language (prompting the model in English while requesting Chinese-language responses) reduced but did not eliminate censorship bias: CensorshipDetector scores showed less bias in English-pivoted responses than in direct Simplified Chinese prompts, but sentiment analysis and word-embedding analyses still found statistically significant bias in most models, indicating censorship bias is a function of both prompt language and response language.
-
The study finds that LLM censorship bias affects Chinese-speaking diaspora populations who reside outside mainland China: because user language — not user location — determines exposure to sanitized outputs, Chinese speakers globally receive information shaped by CCP information controls when using popular AI chatbots in Simplified Chinese, constituting an extraterritorial export of domestic censorship infrastructure.