2019-xiong-efficient

An Efficient Method to Determine which Combination of Keywords Triggered Automatic Filtering of a Message

Ruohan Xiong, Jeffrey Knockel · Free and Open Communications on the Internet · 2019

canonical link →

Tags

censors: cn
techniques: keyword-filtering

findings extracted from this paper

The component-aware binary splitting algorithm (CompAwareBinSplit) requires on average 35.47 messages per article to isolate a sensitive keyword combination — 10.3% as many as the 342.72 required by the previously used algorithm — and is the only evaluated algorithm that correctly handles overlapping keyword components and multiple co-occurring combinations.

§5.4, §6, Table 1 evaluation keyword-filteringmeasurement-platform cn
WeChat, Alibaba Wangwang, Zhihu, and Sina Weibo all implement keyword combination filtering — messages are blocked only when every component of a blacklisted combination appears simultaneously, regardless of order. This allows censors to target sensitive contexts (e.g., 习近平 + 三连任 [Xi Jinping + three consecutive terms]) without filtering neutral mentions of individual terms.

§3 detection keyword-filtering cn
The previously used bisection algorithm required an average of 342.72 messages per news article to isolate a triggering keyword combination, and produced incorrect results in 44% of test cases — primarily because the Unilateral Elimination Flaw caused it to miss components that appeared multiple times in an article.

§4, §6 evaluation keyword-filteringmeasurement-platform cn
Server-side keyword enumeration on Chinese platforms has become increasingly uneconomical: platforms now require non-virtual phone numbers for account registration, and test accounts are banned after sending a threshold volume of sensitive content. The paper's 5,521-article dataset and 1,956 confirmed keyword combinations were collected via sample testing between September 2017 and October 2018, with registration costs being the primary limiting factor for research scale.

§1, §7 deployment keyword-filteringmeasurement-platform cn
WeChat censors messages even when keyword components overlap within the message text — e.g., the combination 帶來 + 調整 + 整體 + 領域 triggers filtering in the fused form 帶來abc調整體xyz領域 where 調整 and 整體 share a character. No previously published algorithm correctly identified overlapping components; only CompAwareBinSplit resolves this by advancing the search window from index i+1 rather than past the full matched span.

§4.2, §5.4 detection keyword-filtering cn