2025-ahmed-llm-censorship-bias
An Analysis of Chinese Censorship Bias in LLMs
Abstract
When a large language model (LLM) is trained on text shaped by state
censorship, those biases implicitly impact the outputs of the model.
The authors define this phenomenon as censorship bias: a model trained
on sanitized content is less likely to reflect prohibited views and
more likely to reflect permitted ones, particularly when interacted
with in a language predominantly used in a region with strong
censorship laws. They introduce a methodology for identifying and
measuring censorship bias and apply it to popular LLMs, including
building CensorshipDetector, a Chinese-language classifier that
distinguishes sanitized from non-sanitized text with 91% accuracy.
The evaluation finds evidence of censorship bias across all models
tested and discusses harms (notably the export of domestic information
manipulation to diaspora populations) and mitigations.