On the Sensitivity of Instruction-tuned LLMs to Harmful Sentences in Long Inputs 事件

PRODUCT_LAUNCH2026-05-27影响: MEDIUM

On the Sensitivity of Instruction-tuned LLMs to Harmful Sentences in Long Inputs arXiv:2510.05864v2 Announce Type: replace Abstract: Large language models (LLMs) increasingly operate on long inputs, yet their behavior when harmful sentences are sparsely embedded within such inputs remains poorly understood. We present a sensitivity analysis that probes how LLMs extract harmful sentences embedded in long inputs. We construct long inputs by combining neutral and harmful sentences, and systematica