File size: 2,993 Bytes
e148f6b
 
 
 
 
 
 
 
 
 
 
 
 
 
0e8f5d6
 
 
 
 
e148f6b
 
 
 
 
 
 
b233f03
e148f6b
14a2644
0e8f5d6
 
1ac3f2c
0e8f5d6
 
9a205f0
e148f6b
 
 
0e8f5d6
 
 
 
 
 
1ac3f2c
e148f6b
 
 
 
0e8f5d6
 
 
1ac3f2c
0e8f5d6
 
e148f6b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    task0 = Task("pass_rate", "acc", "Pass Rate (%)")
    task1 = Task("l1", "acc", "L1 Syntactic (%)")
    task2 = Task("l2", "acc", "L2 Semantic (%)")
    task3 = Task("l3s", "acc", "L3S Solvable (%)")
    task4 = Task("l3u", "acc", "L3U Unsolvable (%)")

NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------



# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">How LLMs Fail and Generalize in RTL Coding for Hardware Design?</h1>"""

CONCLUSION_TEXT = f"""
Evaluations on the VerilogEval Human benchmark reveal a strict empirical ceiling, with frontier models plateauing at a 90.8% initial pass rate.
The solvability taxonomy exposes that L3U (Unsolvable) errors dominate across all model families, revealing persistent knowledge gaps that inference-time scaling cannot address.
Our analysis exposes a striking surface convergence gap: optimization drastically reduces syntax errors but concurrently increases functional testbench failures.
Ultimately, register transfer level (RTL) coding capacity relies heavily upon pretraining knowledge.
Integrating reward and policy modelling (i.e., GRPO) during the post-training phase amplifies existing competencies by teaching models to compile, while L3S errors (addressable via best-of-N sampling) coexist with L3U errors (requiring model improvement).

"""

LLM_BENCHMARKS_TEXT = f"""
## About the Taxonomy
Our four-level error taxonomy evaluates LLM-generated RTL code based on successive stages of the EDA pipeline:
- **L1 Syntactic**: The source string is rejected by the HDL parser. No AST can be constructed.
- **L2 Semantic**: The source string parses into a valid AST but violates at least one static semantic constraint (e.g., detected during elaboration, linting, or synthesis).
- **L3S Functional-Solvable**: The synthesized model fails to meet the design specification, but the model has demonstrated the ability to solve the problem in at least one other rollout (addressable via inference-time scaling / best-of-N sampling).
- **L3U Functional-Unsolvable**: The synthesized model fails to meet the design specification, and the model cannot solve the problem in any rollout (requires fundamental model improvement).

"""


CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""@article{liu2026rtlerror,
  title={How Large Language Models Fail and Generalize to Learn RTL Coding for Digital Circuit Design},
  author={Liu, Guan-Ting and Yang, Chao-Han Huck and Deng, Chenhui and Yu, Zhongzhi and Khailany, Brucek and Wang, Yu-Chiang Frank},
  journal={Under Review},
  year={2026}
}
"""