Papers
arxiv:2601.22655

The Semantic Trap: Do Fine-tuned LLMs Learn Vulnerability Root Cause or Just Functional Pattern?

Published on Feb 2
Authors:
,
,
,
,
,

Abstract

Fine-tuned large language models for software vulnerability detection often rely on functional patterns rather than true security reasoning, as demonstrated by systematic evaluation revealing persistent failures in distinguishing vulnerable code from patched versions.

AI-generated summary

LLMs demonstrate promising performance in software vulnerability detection after fine-tuning. However, it remains unclear whether these gains reflect a genuine understanding of vulnerability root causes or merely an exploitation of functional patterns. In this paper, we identify a critical failure mode termed the "semantic trap," where fine-tuned LLMs achieve high detection scores by associating certain functional domains with vulnerability likelihood rather than reasoning about the underlying security semantics. To systematically evaluate this phenomenon, we propose TrapEval, a comprehensive evaluation framework designed to disentangle vulnerability root cause from functional pattern. TrapEval introduces two complementary datasets derived from real-world open-source projects: V2N, which pairs vulnerable code with unrelated benign code, and V2P, which pairs vulnerable code with its corresponding patched version, forcing models to distinguish near-identical code that differs only in subtle security-critical logic. Using TrapEval, we fine-tune five representative state-of-the-art LLMs across three model families and evaluate them under cross-dataset testing, semantic-preserving perturbations, and varying degrees of semantic gap measured by CodeBLEU. Our empirical results reveal that, despite improvements in metrics, fine-tuned LLMs consistently struggle to distinguish vulnerable code from its patched counterpart, exhibit severe robustness degradation under minor semantic-preserving transformations, and rely heavily on functional-context shortcuts when the semantic gap is small. These findings provide strong evidence that current fine-tuning practices often fail to impart true vulnerability reasoning. Our findings serve as a wake-up call: high benchmark scores on traditional datasets may be illusory, masking the model's inability to understand the true causal logic of vulnerabilities.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2601.22655
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.22655 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.22655 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.22655 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.