Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets 文章

ArXiv CS.AI2026-05-28NEWSen作者: Andrea Gurioli, Davide D'Ascenzo, Federico Pennino, Maurizio Gabbrielli, Stefano Zacchiroli

查看原文 →

关系图谱

摘要

arXiv:2605.28510v1 Announce Type: cross Abstract: Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints.

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品查看全部 (4)

相关技术查看全部 (2)