2025-sivan-sevilla-probing
findings extracted from this paper
-
152 of 5,478 crawled domains (approximately 2.8%) deployed active bot-detection measures—captcha delivery or perimeter protection—that blocked automated OpenWPM crawling entirely. The authors note this disproportionately excludes untrustworthy sites, biasing the training dataset toward well-resourced trustworthy outlets and limiting recall on the untrustworthy class.
-
A Random Forest classifier trained solely on structural features of third-party request trees achieves ROC AUC of 0.81 and 72% balanced accuracy across 4,660 news domains with ≥50 daily observations. Performance degrades to ROC AUC 0.78 and 0.68 for domains requiring ≥100 and ≥150 daily observations respectively, driven by reduced training-set size rather than feature quality.
-
The five most important predictive features are: (1) average children per non-leaf tree node, (2) 7-day rolling average of maximum tree breadth, (3) 7-day rolling average of average breadth, (4) average children per parent, and (5) 7-day rolling average of third-party requests. Temporal stability features (rolling means and daily deltas) rank ahead of most static snapshot features, indicating that behavioral consistency over time is more discriminative than point-in-time structure.
-
Of 8,004 unique third-party domains identified across 3,410 crawled news sites, 997 appear exclusively on untrustworthy websites and 2,992 appear exclusively on trustworthy ones. Domains disproportionately associated with untrustworthy sites include Yandex, Zamanta, and PayPal; domains exclusive to trustworthy sites are predominantly small-to-medium advertising and analytics actors rather than major platform giants.
-
Trustworthy news sites show dramatically more complex third-party structures than untrustworthy ones: mean MaxBreadth 39.22 vs 19.63, mean ThirdPartyRequests 137.45 vs 74.12, and mean unique third-party domains 44.15 vs 20.31. This finding reverses prior work (Han et al. 2022) and the authors attribute it to untrustworthy sites being under-resourced and optimized for content spread rather than user experience.