2026-lipphardt-dual

Dual Standards: Examining Content Moderation Disparities Between API and WebUI Interfaces in Large Language Models

Friedemann Lipphardt, Moonis Ali, Anja Feldmann, Devashish Gosain · Free and Open Communications on the Internet · 2026

canonical link →

Tags

censors: generic
techniques: dpi ml-classifier

findings extracted from this paper

When the researchers attempted to use Gemini 2.5 Flash as a third independent LLM judge via its API for evaluating moderation decisions, Gemini automatically blocked all judging attempts citing safety reasons. This occurred even though the research task (judging whether a response is more or less moderated) does not itself produce harmful content. The incident illustrates that LLM safety systems can over-block legitimate research use cases, and that different LLM providers have different thresholds— Claude Haiku 4.5 and GPT-4o completed all judging tasks without safety refusals.

§3.3.3 detection ml-classifierkeyword-filtering generic
Category-level analysis of 100 statements across 5 sensitive content categories found that interface-based moderation gaps vary significantly by topic. Sexuality showed the strongest WebUI/API gap (WebUI 7.0× more likely to be moderated than API per GPT-4o judge for Gemini). Political ideology followed at 2.0×, then hate speech at 1.0×. Miscellaneous offensive topics showed the inverse pattern (API more moderated at 0.3×). Religious content showed WebUI moderation with no API moderation. The pattern suggests public-facing WebUI interfaces prioritize reputational risk management for high-scrutiny categories.

§4.5, Figure 8 evaluation ml-classifierkeyword-filtering generic
API and WebUI interfaces show statistically significant response length differences in opposite directions across models. Gemini API responses averaged 2,333 characters vs. 1,746 for WebUI (34% longer API; t=5.028, p<0.0001, Cohen's d=0.50). ChatGPT WebUI responses averaged 2,752 characters vs. 1,389 for API (98% longer WebUI; t=-9.800, p<0.0001, d=-0.98). The divergent direction across models suggests fundamentally different generation parameters rather than simple post-hoc filtering, indicating architectural or policy-level differences at the provider level.

§4.6 evaluation ml-classifier generic
An empirical study of 100 sensitive statements tested on Gemini (2.5 Flash) and ChatGPT (GPT-5) found that WebUI interfaces are systematically more restrictive than their API counterparts. According to GPT-4o judge: WebUI was moderated 18% of the time vs. 9% (Gemini API) and 13% (ChatGPT API). DeBERTa classifier found 82% of WebUI responses moderated vs. 58% of API responses. The Gemini WebUI:API ratio ranged from 2.0:1 (GPT-4o) to 7.0:1 (Claude), and ChatGPT from 1.4:1 (GPT-4o) to 15.6:1 (Claude). Neither Google nor OpenAI discloses these interface-specific policies.

§4.3, Table 2 evaluation ml-classifierkeyword-filtering generic