2019-xiong-efficient
findings extracted from this paper
-
The component-aware binary splitting algorithm (CompAwareBinSplit) requires on average 35.47 messages per article to isolate a sensitive keyword combination — 10.3% as many as the 342.72 required by the previously used algorithm — and is the only evaluated algorithm that correctly handles overlapping keyword components and multiple co-occurring combinations.
-
WeChat, Alibaba Wangwang, Zhihu, and Sina Weibo all implement keyword combination filtering — messages are blocked only when every component of a blacklisted combination appears simultaneously, regardless of order. This allows censors to target sensitive contexts (e.g., 习近平 + 三连任 [Xi Jinping + three consecutive terms]) without filtering neutral mentions of individual terms.
-
The previously used bisection algorithm required an average of 342.72 messages per news article to isolate a triggering keyword combination, and produced incorrect results in 44% of test cases — primarily because the Unilateral Elimination Flaw caused it to miss components that appeared multiple times in an article.
-
Server-side keyword enumeration on Chinese platforms has become increasingly uneconomical: platforms now require non-virtual phone numbers for account registration, and test accounts are banned after sending a threshold volume of sensitive content. The paper's 5,521-article dataset and 1,956 confirmed keyword combinations were collected via sample testing between September 2017 and October 2018, with registration costs being the primary limiting factor for research scale.
-
WeChat censors messages even when keyword components overlap within the message text — e.g., the combination 帶來 + 調整 + 整體 + 領域 triggers filtering in the fused form 帶來abc調整體xyz領域 where 調整 and 整體 share a character. No previously published algorithm correctly identified overlapping components; only CompAwareBinSplit resolves this by advancing the search window from index i+1 rather than past the full matched span.