2017-darer-filteredweb
findings extracted from this paper
-
High-power seed domains including uyghuramerican.org, dw.com, hrw.org, and eastturkistaninfo.com each produced TF-IDF descriptive tags that led to discovery of more filtered URLs from other domains than the total number of URLs crawled from those seeds themselves. Content-category analysis of the 1,355 poisoned domains showed filtering-avoidance tools, news, educational content, and human-rights sites among the most heavily targeted categories.
-
Sending DNS queries to eight non-DNS IP addresses within the Chinese IP range reliably detects GFW DNS poisoning: any response indicates the censor intercepted and replied to the query, since a legitimate non-DNS server would not respond. This external vantage-point technique discovers poisoned domains without in-country volunteers or local infrastructure.
-
Approximately 95% of the 115,337 filtered URLs discovered in China were concentrated in just 15 large domains; the overall hit rate across the full crawl was 4.11 poisoned domains per 1,000 domains crawled. This concentration means aggregate filtered-URL counts in existing lists are dominated by a few major platforms while the broader tail of blocked domains remains largely undiscovered.
-
FilteredWeb discovered 1,355 DNS-poisoned domains and 115,337 filtered URLs in China through 54,000 web searches by February 2017 — 30 times more poisoned domains than the most widely-used published filter list (Citizen Lab, which identified 44 domains). Of the 1,355 domains, 759 fell outside the Alexa Top 1,000, demonstrating that automated search-based discovery surfaces obscure filtered content missed by manual and volunteer-driven lists.