2011-espinoza-automated
findings extracted from this paper
-
During a two-month run in 2011 that coincided with the Jasmine Revolution protests, China's HTTP GET request backbone blacklist showed no additions or removals of keywords on a daily, weekly, or even monthly basis. Numerous current-event terms that triggered search engine censorship produced zero GET request RST responses, indicating the two censorship mechanisms operate on entirely different update timescales.
-
A maximum entropy named entity extraction (NEE) model trained on Chinese-language Wikipedia achieved 89.63% recall and 83.44% specificity for person names, 96.3% recall and 69.80% specificity for place names, and 87.56% recall and 88.40% specificity for organization names. Despite 0.42% precision for person names, the system reduces the number of words requiring censorship probes by nearly an order of magnitude while retaining nearly 90% of actual named entities.
-
To measure Chinese search engine censorship independently of backbone GET request filtering, the authors split each search engine HTTP GET request across multiple TCP packets so the server would reassemble the full query but routers performing single-packet keyword inspection would not see a complete match. This technique allowed ground-truth measurement of search engine responses free of backbone RST injection interference.
-
A controlled probe of two Chinese search engines found that the query 'fuck' triggered a legal notice that results had been removed, while 'fuck you' did not, suggesting that search engine censorship suppresses websites where a sensitive term appears prominently rather than matching exact byte strings in the query itself. The paper concludes this mechanism is topical and website-removal-based, not a static keyword blacklist.
-
During the 2011 Jasmine Revolution, words such as 'Jasmine Flower,' terms linked to Liu Xiaobo's Nobel Prize, and numeric references to presidential rent criticism triggered Chinese search engine censorship (results-removed warnings) but produced no HTTP GET request RST injections. This demonstrates that search engine filtering and backbone keyword filtering are independently operated layers that diverge sharply for rapidly evolving current-event content.