FINDING · DETECTION
ICLab's semi-automated block page discovery — combining HTML tag-frequency vector clustering with locality-sensitive hashing (LSH) of page text — identified 48 previously unknown block page signatures from 13 countries: 15 via structural clustering across 5 countries and 33 via textual similarity clustering across 8 countries. The system seeds from 308 manually verified regular expressions and uses a URL-to-country ratio sort (largest ratio discovered: 286) to prioritize candidates for manual review, eliminating reliance on brittle hand-maintained regex lists alone.
From 2020-niaki-iclab — ICLab: A Global, Longitudinal Internet Censorship Measurement Platform · §IV-C · 2020 · Symposium on Security \& Privacy
Implications
- Censors routinely deploy block pages with minor textual variations — different legal citations, ISP names, court references — that defeat exact-match regex; circumvention clients inferring 'blocked' state should use structural similarity (tag frequencies, fuzzy text hashing) rather than string matching to avoid missing censor-served error pages.
- The URL-to-country ratio signal (many URLs mapping to the same response across few countries) can cheaply identify when a server IP has been redirected to a censor-controlled block page host, making it a useful lightweight probe for circumvention infrastructure.
Tags
Extracted by claude-sonnet-4-6 — review before relying.