2018-hounsel-automatically
findings extracted from this paper
-
China actively censors websites far outside the popular-traffic tier: many discovered censored domains appear in the tail of the Alexa Top 1,000,000, and some are absent from Alexa entirely. This demonstrates the GFW pursues content-classified hosts regardless of traffic rank, not only high-visibility platforms.
-
The 1,125 newly discovered censored domains span a broad taxonomy: Chinese human rights organizations, Tibetan rights outlets, Falun Gong and religious freedom sites, minority news, privacy-enhancing technology providers, and sources covering Tiananmen and the 1989 democracy movement—none appearing on the Alexa Top 1,000 or FilteredWeb's blocklist. Privacy-enhancing technology providers appear explicitly as a censored category alongside political and religious content.
-
Multi-word Chinese phrases as search seeds discover qualitatively different censored sites than individual English words: the phrase 'Chinese human rights violation' surfaces Chinese activist homepages and culture-specific outlets, while individual constituent words return only well-known Western media. TF-IDF scoring against a Chinese corpus ranks culturally rare phrases (e.g., '自由亚洲电台' / Radio Free Asia) as high-signal seeds and discards common filler phrases.
-
Using NLP phrase extraction on Chinese-language censored pages, the system discovered 1,125 new censored domains not present on any publicly available blocklist, producing a list 12.5× larger than the standard Citizen Lab list (220 web pages, 85 domains). Across three evaluations (unigrams, bigrams, trigrams, each capped at 1,000,000 URLs), only 3 of the top 50 discovered domains overlapped with FilteredWeb's top 50.
-
Culturally specific Chinese phrases are strong predictors of censorship: unigrams for controversial figures—Wang Qishan (74%), Li Hongzhi (64%), Guo Boxiong (62%), Hu Jintao (56%)—returned the highest block rates. Trigrams such as 'Beidaihe meeting' (54%), 'CCP's religious policy' (42%), and 'Tiananmen Square demonstrations' (32%) showed similar patterns, confirming King et al.'s finding that references to collective political dissent are disproportionately targeted.