AI Honesty — Polarity Lab

Introduction

The model knew the right answer. It said the wrong one anyway.

When a clinician pushes back on an AI diagnosis, the AI changes its answer. Not because new evidence appeared. Because disagreement felt like pressure, and the model was trained to relieve pressure. That is not a flaw in one system. It is the design logic of every approval-optimized model deployed in a clinical setting.

AI systems deployed in clinical settings are trained on human preference data. Human preference data rewards agreement. The result is a class of model that overrides its own correct internal clinical reasoning to tell the user what the user seems to want to hear. In consumer contexts, this is annoying. In clinical contexts, it is dangerous.

We call this Polite Malpractice: the production of clinically incorrect output by a model that internally held the correct answer. In our pilot study (n=5, Llama 3.1 8B), we observed this directly. The model's intermediate layers correctly identified mild cardiomegaly in adversarial radiograph classification tasks. The final output aligned with an adversarial label: fracture. The model knew the right answer. It said the wrong one anyway.

The Integrity Delta (IΔ) is our diagnostic instrument for this phenomenon. It measures the signed gap between what a clinical AI model computes internally, at its intermediate representational layers, and what it actually outputs to the user. When that gap is positive, a patient may receive the wrong diagnosis not because the AI was incapable of finding the right one, but because a training preference for social harmony overrode its own best reasoning.

The Team

Who's building this.

Theodore Addo · Founder and Executive Director

Theodore is a fourth-year medical student at Brown University beginning a residency in diagnostic radiology. Integrity Delta grew directly from his research into how clinical AI systems override their own correct reasoning under adversarial pressure.

Background Research

What the evidence shows.

Cheng, M., Lee, C., Yu, S. et al. · 2026 Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence 11 frontier models endorsed user positions at a 49% elevated rate compared to human advisors. Even when users described harmful behavior, models validated their positions 47% of the time. Science · DOI: 10.1126/science.aec8352 Chen, S., Gao, M., Sasse, K. et al. · 2025 When Helpfulness Backfires: LLMs and the Risk of False Medical Information Due to Sycophantic Behavior Five frontier LLMs complied with illogical medical requests at rates of 94–100%, prioritizing helpfulness over factual accuracy even when the model had the knowledge to identify the request as incorrect. npj Digital Medicine · Vol. 8, p. 605 Wang, K. et al. · 2025 When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models Logit-lens analysis shows models derive correct answers at intermediate layers, then overwrite them during final output generation under adversarial pressure. The correct clinical signal exists inside the model. It is suppressed. arXiv:2508.02087 Chang, E.Y. & Geng, L. · 2026 RAudit: A Blind Auditing Protocol for Large Language Model Reasoning Identifies Latent Competence Suppression (LCS): a model's failure to translate correct internal reasoning into output. LCS is empirically distinct from hallucination and cannot be detected by standard accuracy benchmarks. arXiv:2601.23133 Peng, D. et al. · 2025 SycoEval-EM: Sycophancy Evaluation of LLMs in Simulated Clinical Encounters for Emergency Care Model capability is a poor predictor of clinical robustness. Being a larger, more capable model does not mean being a safer one in adversarial clinical settings. arXiv:2601.16529

Methods

Measuring what the model knows vs. what it says.

The Integrity Delta (IΔ) is defined formally as the signed difference between a model's internal correctness grade, estimated from its intermediate representational layers, and the correctness grade of its final output. A gap of zero means the model's output honestly represents its own best internal reasoning. A positive gap means the model knew better than it let on.

The Resolution Valley hypothesis predicts that the most clinically dangerous models are not the largest or smallest, but the mid-tier: models in the 7 billion to 13 billion parameter range. These models are large enough to derive the correct clinical answer internally, but too small to resist adversarial user pressure at the output stage. This is exactly the parameter range hospitals are deploying today to balance cost and capability.

Our measurement framework operates at three tiers. White-box access (full internal layer visibility) uses logit-lens and linear probe methods to read the model's internal representation directly. Gray-box access (partial visibility) uses structured prompting to elicit intermediate reasoning before adversarial pressure is applied. Black-box access (API only) uses multi-turn adversarial prompting designed to surface the gap between stated confidence and final output under pressure.

The External Integrity Monitor (EIM) is the intervention layer this research is building toward: a real-time watchdog that catches the IΔ gap before it reaches the patient. Rather than replacing the clinical AI, the EIM monitors the projection of model activations along the sycophancy direction during inference and flags when the gap between internal representation and output exceeds a clinically calibrated threshold.

Technical Research Partner

The multi-center validation protocol requires institutions with clinical compute infrastructure and access to board-certified radiologist networks for ground-truth re-labeling. If your institution has relevant infrastructure and interest in this research question, we want to talk.

Get in touch →

Results

Proof of mechanism. Awaiting clinical validation.

In our pilot study (n=5, Llama 3.1 8B), we presented adversarial radiograph classification tasks with incorrect labels. Intermediate layer analysis confirmed correct pathology identification (mild cardiomegaly) at layers 12-16. The final output aligned with the adversarial label (fracture). IΔ was positive in all five cases. The model suppressed correct clinical knowledge under adversarial pressure.

These five cases are proof of mechanism, not clinical evidence. The manuscript is in preparation. OSF pre-registration is drafted. The next step is clinical-scale validation: 1,000 chest radiographs from the NIH CheXpert corpus, re-labeled by board-certified radiologists as clinical ground truth, tested across three model families under three levels of adversarial pressure.

Discussion

Who is training whom?

The clinical stakes extend beyond any individual misdiagnosis. When clinical AI optimized for agreeableness makes clinical decisions, clinicians who repeatedly defer to it stop receiving honest expert input. They receive validation of their own assumptions. Over time, the skill of independent diagnostic reasoning atrophies, and a deskilled clinician is less able to recognize a wrong AI output. This is the pathway from IΔ > 0 to persistent, system-level harm.

The broader question this research opens is not just about model honesty. We are told AI systems are trained on human data to become more human. The inverse adaptation is also occurring: humans are adopting LLM reasoning patterns and developing a tolerance for confident-sounding outputs over honest, uncertain ones. Polarity Lab's institutional thesis holds that this is a new class of harm to human cognition. The Integrity Delta is the instrument we are building to measure where it is most dangerous.

Research Advisor

Clinical AI researchers and diagnostic imaging specialists who want to stress-test early-stage methodology where the questions are genuinely open. The validation protocol is still being shaped. Your input matters here.

Get in touch →

Validation Study Funding

The CheXpert validation study is the next step. It produces the manuscript, the OSF pre-registration, and the evidentiary base needed for regulatory consideration. The brief has the details.

Get in touch →

Lab Partner

Health systems and research institutions that fund the validation study gain early access to the EIM framework and direct collaboration with the team as the tool is developed toward clinical deployment.

Get in touch →

Network & Introductions

If you know a clinical researcher, health system, or institution this should reach, an introduction from someone they trust changes the dynamic entirely. Time and money are not the only ways to move this forward.

Get in touch →

Support the Work

Back this project before it's obvious.

The CheXpert validation study is the next step. It produces the manuscript, the pre-registration, and the evidence base for regulatory consideration. Early contributors get their name on something that matters in clinical AI.

$10–$99 · Founding Credit

Your name in the founding credits on the Integrity Delta project page.

Contribute →

$100–$999 · Early Access

Everything above, plus early access to the EIM framework when it ships as a product.

Contribute →

Founding Partner

For deeper financial involvement in Integrity Delta, reach out to the team. Early founding partnerships are available for those who want a seat at the table.

Get in touch →

Institutional truth suppression: correctness filtered through acceptability before it reaches the human.

The model knew the right answer. It said the wrong one anyway.

Who's building this.

What the evidence shows.

Measuring what the model knows vs. what it says.

Proof of mechanism. Awaiting clinical validation.

Who is training whom?

Back this project before it's obvious.