2024-tang-automatic
findings extracted from this paper
-
The automated probe list generation system discovered 45.79 potentially blocked domains per 1,000 domains crawled, compared to 4.11 for FilteredWeb — over 10× higher efficacy. It uncovered 1,490 potentially blocked domains in crawls of just 71,960 URLs, versus 1,255 blocked domains found by Hounsel et al. in crawls of 1,000,000 URLs, with 1,473 of the 1,490 domains not overlapping with prior work.
-
GFW verification tests confirmed over 90% of OONI-detected DNS anomalies as true blocks: 429/457 domains in Beijing and 422/461 in Shanghai. In total, 527 unique domains were confirmed censored via DNS, HTTP, and HTTPS filters; an additional 718 domains suspected blocked due to IP-address-level blocking of their hosting servers rather than domain-level entries.
-
Only 36.66% of the 139,957 source list URLs (51,313) survived sanitization as live, meaningful pages, with 18,911 URLs removed for lack of content and many more for dead links — underscoring how rapidly manually curated probe lists decay. In Beijing and Shanghai, over 20% of known domains were consistently inaccessible, versus fewer than 4.5% at all other vantage points, and over 68% of known domains remained blocked, suggesting censored topics stay sensitive even as URLs go stale.
-
Among inaccessible URLs that also triggered OONI anomalies, approximately 58% were generated by the Top2Vec-Trends pipeline (combining Top2Vec topic modeling with Google Trends keyword expansion), while LDA-TFIDF and Top2Vec alone each accounted for only 13–14%. BERTopic-generated pages were least effective at producing censored candidates.
-
VPS-based vantage points in Singapore and India detected censorship patterns similar to 'free' locations, failing to observe blocking known to be enforced by local ISPs following government directives. This occurred because ISP-level censorship is implemented per-carrier rather than centrally, and the VPS provider's ISP did not enforce those blocks — confirmed by re-testing from a residential IP that did observe the expected blocks.