2024-gao-extended

Extended Abstract: Leveraging Large Language Models to Identify Internet Censorship through Network Data

Tianyu Gao, Ping Ji · Free and Open Communications on the Internet · 2024

canonical link →

Tags

censors: generic
techniques: ml-classifier

findings extracted from this paper

CenDTect (Tsai et al., NDSS 2024) uses decision trees and a novel clustering method on Censored Planet plus OONI data to identify blocking policies and provide interpretable insights at local and country levels. A separate approach (Duncan & Chen, 2023) applies sequence-to-sequence models and CNN image classification — treating network reachability data as grayscale images — to distinguish censored from uncensored content.

§2 Related Works evaluation measurement-platformml-classifier generic
Brown et al. (2023) combined supervised ML models trained on expert-labeled data with unsupervised models establishing a baseline of 'normal' behavior to detect DNS-based censorship from Satellite and OONI datasets, achieving high true-positive rates for both known and new DNS censorship instances. The hybrid supervised/unsupervised approach is proposed as a template for the LLM-based system.

§2 Related Works evaluation dns-poisoningmeasurement-platformml-classifier generic
The proposed LLM-based censorship detection system plans to use ICLab as the primary dataset for its semantic richness across all network-stack levels, then cross-reference with OONI and Censored Planet to reduce false negatives. The paper explicitly notes ICLab lacks the scale and geographic coverage of OONI/Censored Planet but offers richer per-measurement context suited to LLM feature learning.

§4 Proposed Future Works evaluation measurement-platformml-classifier generic
The daily volume of network reachability data collected by censorship monitoring platforms such as ICLab, OONI, and Censored Planet surpasses the 16 GB Books Corpus and English Wikipedia that BERT was trained on. This scale mismatch motivates applying LLMs — which thrive on large unlabeled corpora — to censorship measurement data rather than hand-labeling for rule-based systems.

§2 Related Works evaluation measurement-platformml-classifier generic
Rule-based censorship detection systems rely on predefined regular expressions designed by human experts and fail to adapt to evolving censor techniques, leading to false negatives and poor scalability as data volume grows. In contrast, learning-based models are described as thriving on large data volumes and offering contextual understanding that rule-based systems lack.

§1 Introduction evaluation measurement-platformml-classifier generic