2024-gao-extended
findings extracted from this paper
-
CenDTect (Tsai et al., NDSS 2024) uses decision trees and a novel clustering method on Censored Planet plus OONI data to identify blocking policies and provide interpretable insights at local and country levels. A separate approach (Duncan & Chen, 2023) applies sequence-to-sequence models and CNN image classification — treating network reachability data as grayscale images — to distinguish censored from uncensored content.
-
Brown et al. (2023) combined supervised ML models trained on expert-labeled data with unsupervised models establishing a baseline of 'normal' behavior to detect DNS-based censorship from Satellite and OONI datasets, achieving high true-positive rates for both known and new DNS censorship instances. The hybrid supervised/unsupervised approach is proposed as a template for the LLM-based system.
-
The proposed LLM-based censorship detection system plans to use ICLab as the primary dataset for its semantic richness across all network-stack levels, then cross-reference with OONI and Censored Planet to reduce false negatives. The paper explicitly notes ICLab lacks the scale and geographic coverage of OONI/Censored Planet but offers richer per-measurement context suited to LLM feature learning.
-
The daily volume of network reachability data collected by censorship monitoring platforms such as ICLab, OONI, and Censored Planet surpasses the 16 GB Books Corpus and English Wikipedia that BERT was trained on. This scale mismatch motivates applying LLMs — which thrive on large unlabeled corpora — to censorship measurement data rather than hand-labeling for rule-based systems.
-
Rule-based censorship detection systems rely on predefined regular expressions designed by human experts and fail to adapt to evolving censor techniques, leading to false negatives and poor scalability as data volume grows. In contrast, learning-based models are described as thriving on large data volumes and offering contextual understanding that rule-based systems lack.