2022-waheed-darwin-s

Darwin's Theory of Censorship: Analysing the Evolution of Censored Topics with Dynamic Topic Models

Asim Waheed, Sara Qunaibi, Diogo Barradas, Zachary Weinberg · Workshop on Privacy in the Electronic Society · 2022

canonical link →

Tags

censors: generic
techniques: keyword-filtering

findings extracted from this paper

D-LDA detected event-driven shifts in Indian censorship without prior knowledge: the word 'violence' disappeared from the 'Riots in India' topic cluster between months 6 and 14 of the measurement period, and 'killing' did not appear until month 16, consistent with the absence of actual riots during that window. Similarly, the 'Danish cartoonist' topic shifted from cartoon-focused discourse to broader Islamic-rights framing ('freedom,' 'speech') approximately 18 months in.

§3.2, Figure 3 evaluation measurement-platformkeyword-filtering in
Data gaps severely degrade D-LDA accuracy: erasing every other month reduced the corpus from 4,577 to 1,919 documents and caused the model to lose detection of 'Religion-motivated killing,' 'Religious websites,' 'Muslim Violence,' and 'Homicide' topics entirely. Erasing one in three months (1,479 documents) caused further topic loss, and even removing one random month altered topic evolution trajectories. For 25% of pages, the gap between Wayback Machine snapshots and ICLab observations exceeds one year.

§3.3, Figures 4–5, Tables 2–3 evaluation measurement-platform in
Dynamic LDA applied to ICLab longitudinal data for India (2016–2020) successfully identified 14 distinct censored topic clusters—including religious conflict, piracy, educational fraud, and political dissent—from 677 overtly-censored URLs out of 6,012 tested (11.3% overtly censored at least once). The model required monthly time-slice granularity; daily and weekly granularities produced unstable results due to wild swings in document counts.

§3.1, Table 1 evaluation measurement-platformkeyword-filtering in
India's censorship apparatus, while less aggressive than China's, legally mandates ISP-level blocking capability and has deployed it regularly. Of 6,012 URLs in ICLab's India test list observed since 2016, only 677 (11.3%) were ever overtly censored (block-page redirect); the majority of anomalies were covert (connection disruption mimicking network faults) and excluded from analysis due to ambiguity. Censorship topics include not only political dissent but copyright enforcement, indicating infrastructure originally deployed for political control is routinely repurposed.

§2, §3.1 policy measurement-platformkeyword-filtering in