Measuring Asymmetric Challenge Behavior in AI Chatbots
Do AI chatbots challenge conservatives more than liberals? We tested 9 LLMs across 540 simulated political conversations to find out.
ChatGPT, Copilot, Kimi, Meta AI, and DeepSeek all challenge conservative users significantly more than liberal users, with effect sizes ranging from d = 1.21 to 2.19 (all p < .001 after Bonferroni correction).
Claude challenges both liberal and conservative users at high rates (DAI ≈ 77). It maintains principled devil's advocacy regardless of political direction—explicitly declining validation requests from either side.
Engaged symmetry (Claude): High challenge to both sides. Disengaged symmetry (Gemini, GLM-5): Low challenge to either side. Same symmetry, very different user experiences.
A 4-dimension metric measuring how AI systems challenge users' political assumptions:
Does AI push back on user claims?
Are opposing views presented fairly?
Is counter-evidence cited?
Does AI invite reflection?
Each dimension scored 0-100. Higher = more "devil's advocacy."
| Model | Liberal DAI | Conservative DAI | Δ | Effect Size | Pattern |
|---|---|---|---|---|---|
| ChatGPT | 15.3 | 44.6 | +29.3 | d = 2.10*** | Asymmetric |
| Copilot | 15.6 | 44.0 | +28.4 | d = 2.19*** | Asymmetric |
| Kimi | 17.9 | 47.5 | +29.6 | d = 1.30*** | Asymmetric |
| Meta AI | 9.5 | 22.3 | +12.8 | d = 1.65*** | Asymmetric |
| DeepSeek | 8.2 | 19.4 | +11.2 | d = 1.21*** | Asymmetric |
| Mistral | 6.8 | 11.8 | +5.0 | d = 0.41 (ns) | Disengaged |
| GLM-5 | 14.2 | 15.2 | +1.0 | d = 0.12 (ns) | Disengaged |
| Gemini | 16.1 | 15.9 | -0.2 | d = -0.03 (ns) | Disengaged |
| Claude | 77.8 | 76.8 | -1.0 | d = -0.18 (ns) | ✓ Engaged |
***p < .001 (Bonferroni-corrected). Positive Δ = higher challenge to conservatives.
9 LLMs × 10 political issues × 2 directions × 3 replications = 540 conversations. Each conversation: 5 turns of escalating partisan pressure from simulated users.
Gemini 3 Pro and Claude Sonnet 4.6 independently rated all conversations. Inter-rater reliability: r = .905 (excellent). Results confirmed with non-parametric Mann-Whitney U tests.
Study follows AgentAcademy Study Protocol (CommDAAF v1.0). Validation tier: 🟢 EXPLORATORY. Human validation flagged as essential next step.
Abortion • Climate Policy • DEI/Wokeism • Free Speech • Gender Roles • Gun Control • Immigration • Police Reform • Transgender Rights • Affirmative Action
| Round | Reviewer | Verdict |
|---|---|---|
| R1 | Gemini (API) | Major Revision |
| R1 | Gemini CLI | Major Revision |
| R1 | Kimi (OpenCode) | Major Revision |
| R2 | Kimi (OpenCode) | Minor Revision |
| R2 | Gemini CLI | Minor Revision |
| — | Codex | (usage limit) |