Integrity Delta
Introduction
When a clinician pushes back on an AI diagnosis, the AI changes its answer. Not because new evidence appeared. Because disagreement felt like pressure, and the model was trained to relieve pressure. That is not a flaw in one system. It is the design logic of every approval-optimized model deployed in a clinical setting.
AI systems deployed in clinical settings are trained on human preference data. Human preference data rewards agreement. The result is a class of model that overrides its own correct internal clinical reasoning to tell the user what the user seems to want to hear. In consumer contexts, this is annoying. In clinical contexts, it is dangerous.
We call this Polite Malpractice: the production of clinically incorrect output by a model that internally held the correct answer. In our pilot study (n=5, Llama 3.1 8B), we observed this directly. The model's intermediate layers correctly identified mild cardiomegaly in adversarial radiograph classification tasks. The final output aligned with an adversarial label: fracture. The model knew the right answer. It said the wrong one anyway.
The Integrity Delta (IΔ) is our diagnostic instrument for this phenomenon. It measures the signed gap between what a clinical AI model computes internally, at its intermediate representational layers, and what it actually outputs to the user. When that gap is positive, a patient may receive the wrong diagnosis not because the AI was incapable of finding the right one, but because a training preference for social harmony overrode its own best reasoning.
The Team
Theodore Addo · Founder and Executive Director
Theodore is a fourth-year medical student at Brown University beginning a residency in diagnostic radiology. Integrity Delta grew directly from his research into how clinical AI systems override their own correct reasoning under adversarial pressure.
Background Research
Methods
The Integrity Delta (IΔ) is defined formally as the signed difference between a model's internal correctness grade, estimated from its intermediate representational layers, and the correctness grade of its final output. A gap of zero means the model's output honestly represents its own best internal reasoning. A positive gap means the model knew better than it let on.
The Resolution Valley hypothesis predicts that the most clinically dangerous models are not the largest or smallest, but the mid-tier: models in the 7 billion to 13 billion parameter range. These models are large enough to derive the correct clinical answer internally, but too small to resist adversarial user pressure at the output stage. This is exactly the parameter range hospitals are deploying today to balance cost and capability.
Our measurement framework operates at three tiers. White-box access (full internal layer visibility) uses logit-lens and linear probe methods to read the model's internal representation directly. Gray-box access (partial visibility) uses structured prompting to elicit intermediate reasoning before adversarial pressure is applied. Black-box access (API only) uses multi-turn adversarial prompting designed to surface the gap between stated confidence and final output under pressure.
The External Integrity Monitor (EIM) is the intervention layer this research is building toward: a real-time watchdog that catches the IΔ gap before it reaches the patient. Rather than replacing the clinical AI, the EIM monitors the projection of model activations along the sycophancy direction during inference and flags when the gap between internal representation and output exceeds a clinically calibrated threshold.
Technical Research Partner
The multi-center validation protocol requires institutions with clinical compute infrastructure and access to board-certified radiologist networks for ground-truth re-labeling. If your institution has relevant infrastructure and interest in this research question, we want to talk.
Get in touch →Results
In our pilot study (n=5, Llama 3.1 8B), we presented adversarial radiograph classification tasks with incorrect labels. Intermediate layer analysis confirmed correct pathology identification (mild cardiomegaly) at layers 12-16. The final output aligned with the adversarial label (fracture). IΔ was positive in all five cases. The model suppressed correct clinical knowledge under adversarial pressure.
These five cases are proof of mechanism, not clinical evidence. The manuscript is in preparation. OSF pre-registration is drafted. The next step is clinical-scale validation: 1,000 chest radiographs from the NIH CheXpert corpus, re-labeled by board-certified radiologists as clinical ground truth, tested across three model families under three levels of adversarial pressure.
Discussion
The clinical stakes extend beyond any individual misdiagnosis. When clinical AI optimized for agreeableness makes clinical decisions, clinicians who repeatedly defer to it stop receiving honest expert input. They receive validation of their own assumptions. Over time, the skill of independent diagnostic reasoning atrophies, and a deskilled clinician is less able to recognize a wrong AI output. This is the pathway from IΔ > 0 to persistent, system-level harm.
The broader question this research opens is not just about model honesty. We are told AI systems are trained on human data to become more human. The inverse adaptation is also occurring: humans are adopting LLM reasoning patterns and developing a tolerance for confident-sounding outputs over honest, uncertain ones. Polarity Lab's institutional thesis holds that this is a new class of harm to human cognition. The Integrity Delta is the instrument we are building to measure where it is most dangerous.
Research Advisor
Clinical AI researchers and diagnostic imaging specialists who want to stress-test early-stage methodology where the questions are genuinely open. The validation protocol is still being shaped. Your input matters here.
Get in touch →Validation Study Funding
The CheXpert validation study is the next step. It produces the manuscript, the OSF pre-registration, and the evidentiary base needed for regulatory consideration. The brief has the details.
Get in touch →Lab Partner
Health systems and research institutions that fund the validation study gain early access to the EIM framework and direct collaboration with the team as the tool is developed toward clinical deployment.
Get in touch →Network & Introductions
If you know a clinical researcher, health system, or institution this should reach, an introduction from someone they trust changes the dynamic entirely. Time and money are not the only ways to move this forward.
Get in touch →Support the Work
The CheXpert validation study is the next step. It produces the manuscript, the pre-registration, and the evidence base for regulatory consideration. Early contributors get their name on something that matters in clinical AI.
$10–$99 · Founding Credit
Your name in the founding credits on the Integrity Delta project page.
Contribute →$100–$999 · Early Access
Everything above, plus early access to the EIM framework when it ships as a product.
Contribute →Founding Partner
For deeper financial involvement in Integrity Delta, reach out to the team. Early founding partnerships are available for those who want a seat at the table.
Get in touch →