arxiv:2605.28510

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

Published on May 27

· Submitted by

Authors:

Abstract

A hybrid approach combining vector search and fingerprinting enables scalable and precise tracking of code provenance generated by large language models.

AI-generated summary

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.

View arXiv page View PDF Add to collection

Community

andreagurioli1995

Paper submitter about 9 hours ago

In this paper, we tackle the problem of tracing where LLM-generated code may come from.
We focus on cases where code models reproduce snippets that are identical or very close to training examples.
To make this practical at large scale, we introduce SOURCETRACKER, a code retrieval encoder.
We also propose HybridSourceTracker, which combines vector search with classical Winnowing fingerprints.
This lets us first retrieve a small set of likely sources and then verify them more precisely.
We evaluate the approach on TheStackV2 and show that it remains accurate while scaling efficiently.
Overall, we show that provenance tracking for generated code can be made practical for real-world LLM systems.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.28510

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.28510 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28510 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28510 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.