Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs

Gupta, Ayush; Kaur, Ramneet; Roy, Anirban; Cobb, Adam D.; Chellappa, Rama; Jha, Susmit

Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs

Ayush Gupta^1,2, Ramneet Kaur¹, Anirban Roy¹, Adam D. Cobb¹, Rama Chellappa², Susmit Jha¹

¹SRI International, ²Johns Hopkins University
EMNLP 2025 main

Paper Code arXiv

Video Presentation

Abstract

We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain unreliable when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model's dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of 2% to 37% when treating OOD datapoints as positives and in-domain test datapoints as negatives.

Motivation

Large Language Models (LLMs) fine-tuned for specialized domains achieve state-of-the-art performance on in-domain tasks. However, they might be unreliable when they encounter out-of-domain (OOD) inputs. This vulnerability can lead to incorrect or hallucinated responses, posing significant risks in critical applications like clinical decision support.

For instance, MentaLLaMA, a model specialized in mental health, tends to incorrectly associate OOD queries with mental health topics. Similarly, EYE-LLaMA, an ophthalmology-focused model, often hallucinates answers when faced with non-ophthalmic questions. Ideally, we would want that these models abstain from answering rather than potentially hallucinate. This work addresses the crucial challenge of detecting OOD inputs for these specialized LLMs to enhance their safety and reliability in real-world deployments.

Examples of specialized LLMs failing on out-of-domain queries

Specialized LLMs like MentaLLaMA and EYE-LLaMA work well on in-domain questions but are prone to making mistakes on OOD queries.

Methodology

We propose a novel inference-time OOD detection method based on the concept of dropout tolerance. Our central hypothesis is that specialized LLMs are more robust to neuron dropout for in-domain (iD) inputs than for OOD inputs, due to the redundant representation of iD concepts (polysemanticity). We use this dropout tolerance as a non-conformity measure (NCM) within the Inductive Conformal Anomaly Detection (ICAD) framework.

Dropout Tolerance as NCM: We define dropout tolerance as the minimum fraction of neurons that must be dropped from a layer to change the model's original prediction. A lower tolerance (and thus higher non-conformity score) suggests an OOD input.
Conformal Anomaly Detection: The ICAD framework uses this score to compute a p-value for a new input by comparing its non-conformity to a pre-computed distribution from a calibration set of in-domain data. An input with a p-value below a certain threshold ε is flagged as OOD, with a theoretical guarantee that the false alarm rate is bounded by ε.
Ensemble Approach: Instead of relying on a single layer, we compute p-values from multiple layers (e.g., early, middle, and late stages of the model). These p-values are combined using a valid merging function (like Arithmetic Mean) into a single, more robust p-value, while preserving the false alarm guarantees.

Overview of the Polysemantic Dropout OOD detection method

Overview of our approach: Dropout tolerance is measured and compared against a calibration set distribution within the ICAD framework to detect OOD inputs with bounded false alarms.

Results

We evaluated our approach on two medical-specialized LLMs, EYE-LLaMA (ophthalmology) and MentaLLaMA (mental health), using COVID-QA and MedMCQA as OOD datasets. Our method, which combines dropout tolerance from three layers (7, 15, and 22), consistently outperforms all baselines, including methods using a base non-conformity score, a single layer's p-value, and ensemble majority voting.

AUROC results show consistent improvement over baselines across both models and OOD datasets. Our method achieves an AUROC of 0.96 for MentaLLaMA on MedMCQA.

Key results include AUROC improvements of up to 37% over baselines. The ensemble approach proves superior to single-layer methods, demonstrating the benefit of aggregating information across different stages of the model's inference process. Furthermore, our experiments empirically validate the theoretical false alarm rate guarantees of the ICAD framework.

ROC curves comparing our ensemble method to single-layer baselines

Comparison of ROC curves illustrates the superior performance of our ensemble approach over single-layer methods.

False alarm rates, which are bounded by a user-defined parameter within the ICAD framework

Empirical false alarm rates are consistently below the user-defined threshold ε, validating the theoretical guarantees of the ICAD framework.

Ablation Studies

We conducted several ablation studies to analyze the components of our algorithm:

Choice of Merging Function: We tested four valid p-value merging functions (Bonferroni, Harmonic Mean, Geometric Mean, and Arithmetic Mean). While all performed comparably for MentaLLaMA, the Arithmetic Mean provided the best and most consistent results for EYE-LLaMA.

Table comparing the performance of different merging functions

Results with different merging functions show that Arithmetic Mean consistently outperforms other ensembling methods.

Impact of Maximum Dropped Neurons (m): We found that increasing the upper limit of neurons to drop (m) increases the likelihood of changing the model's response. We chose m=30 as an effective balance between detection capability and computational efficiency.
Query Difficulty Analysis: An interesting finding emerged when analyzing query types. Multiple-choice questions (MCQs) were significantly "easier" to alter, requiring fewer dropped neurons to change the model's prediction compared to more complex, subjective questions. This suggests that the model's robustness is lower for constrained-choice tasks, which aligns with our stronger performance on the MCQ-based MedMCQA dataset.

Venn diagram showing how query difficulty affects the number of dropped neurons required to change a response

Ablation on query difficulty: Responses to MCQs are altered with fewer dropped neurons than responses to subjective, descriptive queries.

Poster

Download Poster

BibTeX

@inproceedings{Gupta2025Polysemantic,
  title={Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs},
  author={Gupta, Ayush and Kaur, Ramneet and Roy, Anirban and Cobb, Adam D. and Chellappa, Rama and Jha, Susmit},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2025},
  url={https://arxiv.org/abs/2509.04655}
}