2015-hiruncharoenvate-algorithmically
findings extracted from this paper
-
Blocking all homophones of 422 censored keywords would generate approximately 47,000 false-positive weibos per day per keyword, totaling roughly 20 million false positives daily — approximately 20% of Sina Weibo's daily message volume — making blanket homophone blocklisting operationally infeasible without massive collateral censorship of innocent traffic.
-
Homophone-transformed weibos lasted on Sina Weibo an average of 3.94 hours (σ=5.51) before removal, versus 1.3 hours (σ=1.25) for unaltered originally-censored posts — a threefold difference (W=1830, p<0.01) — while ultimate censorship rates were not significantly different between conditions.
-
Falling back to human review to defeat the homophone technique would cost the Sina Weibo censorship apparatus more than 15 additional human-hours per day per censored keyword — derived from an efficient censorship worker reading approximately 50 weibos per minute (Zhu et al. 2013) applied to ~47,000 daily false-positive matches per keyword — a burden that scales with the number of simultaneously banned keywords, which may number in the thousands.
-
Replacing censored keywords with algorithmically-generated homophones increased the initial publication rate on Sina Weibo from 90.79% for unaltered posts to 94.74% for transformed posts (χ²=6.219, p=0.01), demonstrating that the technique successfully bypasses automatic keyword matching at the publication gate even when posts are ultimately censored at similar rates.
-
Native Chinese-speaking Amazon Mechanical Turk workers understood the content of 605 out of 608 homophone-transformed posts (99.51%), with only 2.85% of all impressions (52/1,824) reporting difficulty; workers unable to identify transformed keywords were significantly more likely to report confusion (p<0.001 for original keywords, p=0.03 for transformed keywords).