2024-ruo-lost
findings extracted from this paper
-
Across five popular translation services available in China (Alibaba, Baidu, Tencent, Youdao, and Microsoft Bing), researchers discovered 11,634 unique censorship rules in total. Every service — including the American-operated Bing Translate — implemented automatic censorship that silently omits content, with only Alibaba displaying any notification ('Query csi check not pass') to the user.
-
Alibaba and Bing Translate scan only the user's input text for censorship triggers, not the translation output, while Baidu and Tencent apply the same censorship rules to both input and output. Youdao censors input and output using different rule sets. Because Chinese-language censorship rules dominate all services' blocklists, users translating from a non-Chinese language into Chinese using Alibaba or Bing experience materially less censorship than users of the other services.
-
Among 286 randomly sampled censorship rules across all five services, only one rule targeted erotic content, while the vast majority targeted political dissidents, CCP leaders, Tiananmen Square, Falun Gong, and government criticism. The paper interprets this near-total absence of pornography censorship as evidence that the censors did not anticipate their rules being audited, or are no longer interested in concealing the overtly political agenda of Chinese information control.
-
On Tencent Translate, 15 distinct representations of Xi Jinping's name — including romanizations (xijinping, XiJinping, XIJINPING, xIDaDa, xidada), character variants (习近平, 习大大, 习主席, 习书记, 习总书记, 近平习, 反习大大), and a romanized reversed form (JinpingXi, jinpingxi) — each triggered censorship of the translator's entire output rather than just the offending sentence. Between 4–5% of Tencent's discovered rules were inconsistently enforced, which the paper attributes to load-balanced servers implementing different rule sets or rapid rule churn.
-
Evidence from Youdao Translate suggests it deploys a machine-learning or NLP-based classifier alongside keyword rules: measured rules included repeated components (e.g., 螺+螺+螺+螺+螺+螺+蟢+D+哒+大) and nonsensical multi-token sequences that no human rule author would write, yet which consistently triggered censorship. Youdao returned 9,414 unique rules from the general test set — the most of any service — while also producing the most structurally anomalous rule patterns.