FINDING · EVALUATION

A systematic search of the Common Crawl dataset — the training corpus attributed to most major LLMs including Llama, GPT, and Gemini — found content from 325 of 326 Chinese government and state media domains searched, confirming that sanitized content is pervasive in LLM pretraining data and providing a concrete mechanism for how Chinese information controls propagate into Western-built models.

From 2025-ahmed-llm-censorship-biasAn Analysis of Chinese Censorship Bias in LLMs · §3.3 · 2025 · Proceedings on Privacy Enhancing Technologies

Implications

Tags

censors
cn
techniques
keyword-filtering

Extracted by claude-sonnet-4-6 — review before relying.