Where Code Meets Natural Language: Taxonomy-Driven Information Flow Analysis for LLM-Integrated Applications 文章

ArXiv CS.AI2026-05-27NEWSen作者: Zihao Xu, Xiao Cheng, Ruijie Meng, Yuekang Li

摘要

arXiv:2603.28345v2 Announce Type: replace-cross Abstract: LLM API calls are becoming a ubiquitous program construct, yet they create a boundary that no existing program analysis can cross: runtime values enter a natural-language prompt, undergo opaque processing inside the LLM, and re-emerge as code, SQL, JSON, or text that the program consumes. Every analysis that tracks data across function boundaries, including taint analysis, program slicing, dependency analysis, and change-impact analysis, relies on dataflow summaries of callee behavior. LLM calls have no such summaries, breaking all of these analyses at what we call the NL/PL boundary. We present the first information flow method to bridge this boundary. Grounded in quantitative information flow theory, our taxonomy defines 24 labels along two orthogonal dimensions: information preservation level (from lexically preserved to fully blocked) and output modality (natural language, structured format, executable artifact).