2023-tran-crowdsourcing
findings extracted from this paper
-
The proposed crowdsourced system runs multiple isolated Geneva training pools on a controlled server — one pool per censorship system (initially China and Iran) — and instructs volunteer browsers via JavaScript to send forbidden requests to isolated ports, with no download or software installation required from the user. The server monitors per-strategy success or failure to drive genetic evolution entirely from the server side.
-
Browsers cannot independently set the HTTP Host header or TLS SNI field, blocking the standard censorship-trigger methods used in Geneva training. The paper proposes two workarounds: (1) keyword-based HTTP censorship triggers using forbidden strings in URL parameters, limited to censors that employ keyword filtering; and (2) registering domains whose strings contain a censored substring to exploit censor overblocking via overbroad regular expressions (e.g., registering a domain matching torproject.org's regex to also catch mentorproject.org).
-
The system is designed to protect crowdsourced volunteer privacy by storing only AS-level granularity alongside randomized short-lived client identifiers, explicitly discarding source IP addresses and any browser-identifying information. AS-level resolution is sufficient for server-side evasion because strategies are evolved per-censor-ASN rather than per-user.
-
Server-side censorship evasion strategies require zero client-side changes: clients bypass censorship without installing software or even being aware of the evasion, and this approach has been adopted in production tools including Psiphon's packetman. The packet manipulations exploit weaknesses in how censors track or tear down TCP connections, occurring entirely at the server during the three-way handshake.
-
All existing automated server-side strategy discovery tools — Geneva, Alembic, and SymTCP — require researcher control of a client during training, even when the discovered strategies are deployed exclusively server-side. This dependency makes it infeasible to train against censors in networks where researchers cannot place a controlled machine.