CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis

Introduction

Note:

(1) CodeV-SVA-no-think-no-dut-8B removes RTL code from the prompt to better align with real-world usage scenarios. The model is still under development.

(2) This model is trained without thinking trajectories. It should be deployed in Qwen3's non-thinking mode (set enable_thinking=False).

We introduce CodeV-SVA, a family of large language models designed to translate natural-language verification properties into SystemVerilog Assertions (SVAs).

Open-Source Plan:

Model ✓
Evaluation code ✓
Paper
Dataset
Data synthesis and training code

Evaluation Results

Model		NL2SVA-Human (w/o DUT)
	Func.@1	Func.@16	Func.@32
SVACoder-8B	0.408	0.680	0.685
SVACoder-8B-no-think-no-dut	0.416	0.700	0.726

This evaluation removes RTLs from the original NL2SVA-Human benchmark. Although some information necessary for generating SVA is contained in the RTL code, and removing it significantly degrades model performance, this evaluation is intended solely to better align with our usage scenarios. We will further refine the benchmark in future work.

Models

Model	Download
CodeV-SVA-8B	🤗HuggingFace
CodeV-SVA-no-think-8B	🤗HuggingFace
CodeV-SVA-no-think-no-dut-8B	🤗HuggingFace
CodeV-SVA-14B	🤗HuggingFace

Usage

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

MODEL_DIR = "/path/to/CodeV-SVA-no-think-no-dut-8B/"

def get_nl2sva_human_prompt_no_dut(problem: str) -> str:
    prompt = "Create a SVA assertion that checks: "
    prompt += problem + "\n"

    return prompt

def get_nl2sva_human_prompt_no_dut_chinese(problem: str) -> str:
    prompt = "编写一个 SVA 断言来验证以下性质："
    prompt += problem + "\n"

    return prompt

if __name__ == "__main__":
    system_prompt = "You are an AI assistant tasked with formal verification of register transfer level (RTL) designs.\nYour job is to translate a description of an assertion to concrete SystemVerilog Assertion (SVA) implementation.\n"

    model = LLM(
        MODEL_DIR, 
        tensor_parallel_size=4
    )

    sampling_params = SamplingParams(
        temperature=0.8,
        top_p=0.95,
        max_tokens=32768,
        n=1
    )

    problem_en = "upd_rdy must be a one cycle pulse except next cycle upd_vld is valid"
    dialog = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": get_nl2sva_human_prompt_no_dut(problem_en)}
    ] 
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
    prompt = tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True, enable_thinking=False)
    print(f"### Prompt (en):\n{prompt}")
    responses = model.generate(prompt, sampling_params)
    formal_statements = []
    for i, output in enumerate(responses[0].outputs):
        response_text = output.text
        print(f"### Response (en):\n{response_text}")

    problem_zh = "upd_rdy 必须是单个时钟周期的脉冲，除非在下一个周期中 upd_vld 为有效"
    dialog = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": get_nl2sva_human_prompt_no_dut_chinese(problem_zh)}
    ] 
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
    prompt = tokenizer.apply_chat_template(dialog, tokenize=False, add_generation_prompt=True, enable_thinking=False)
    print(f"### Prompt (zh):\n{prompt}")
    responses = model.generate(prompt, sampling_params)
    formal_statements = []
    for i, output in enumerate(responses[0].outputs):
        response_text = output.text
        print(f"### Response (zh):\n{response_text}")

Citation

@misc{CodeV-SVA,
    title={CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis}, 
    author={Yutong Wu and Chenrui Cao and Pengwei Jin and Di Huang and Rui Zhang and Xishan Zhang and Zidong Du and Qi Guo and Xing Hu},
    year={2025},
    howpublished={\url{https://huggingface.co/wyt2000/CodeV-SVA-14B}},
}