2021-rambert-chinese
findings extracted from this paper
-
Only 8% of keywords censored by Chinese chat clients (WeChat, Sina Weibo — ~63,200 total terms) are also censored by GFW packet inspection, demonstrating independently maintained blocklists. The GFW's packet-inspection chat-derived blocklist contains up to 1,221 distinct censored keywords for outbound traffic; just 68 keyword components account for all censored terms from Beijing, with 「六四」(June Fourth) alone responsible for more than half.
-
The GFW only inspects two locations within an HTTP request for censored keywords: the path component of the request line and the Host header, in UTF-8 and GB 18030 encodings (with %-decoding applied). Cookie headers, custom headers (e.g., X-Tension), and POST body fields are not monitored. Even in monitored positions, only approximately 75% of requests containing censored keywords actually trigger a TCP RST disconnection.
-
After a censored connection, 50–75% of subsequent connections from the same client IP to the same server IP and port are blocked for 90 seconds even without censored keywords ("penalty box"). The penalty box is strictly scoped to the (client IP, server IP, server port) triple — other ports at the same server IP or other server IPs are unaffected. The GFW monitors HTTP keyword traffic on every TCP port, not just port 80.
-
The GFW enforces SNI-based blocking on every TCP port (not just 443), triggering TCP RST injection and a penalty box for known-censored hostnames (e.g., facebook.com, zh.wikipedia.org) in the TLS ClientHello. The SNI blocklist is separate from the HTTP keyword blocklist — keyword-derived subdomains in the SNI did not trigger censorship. No evidence was found for indiscriminate HTTPS decryption or certificate substitution.
-
The GFW maintains two HTTP keyword sublists: 15 terms censored unconditionally, and approximately 60–63 additional terms censored only when the English word "search" also appears in the request URL. No other English word among the 10,000 most common, no Chinese search synonym (搜索, 查找, 关键词), and no common URL parameter abbreviation ("q", "kw", "s") replicates this expanded-censorship trigger.