Spaces:
Running
Running
| from dataclasses import dataclass | |
| from enum import Enum | |
| class Task: | |
| benchmark: str | |
| metric: str | |
| col_name: str | |
| # Select your tasks here | |
| # --------------------------------------------------- | |
| class Tasks(Enum): | |
| # task_key in the json file, metric_key in the json file, name to display in the leaderboard | |
| task0 = Task("pass_rate", "acc", "Pass Rate (%)") | |
| task1 = Task("l1", "acc", "L1 Syntactic (%)") | |
| task2 = Task("l2", "acc", "L2 Semantic (%)") | |
| task3 = Task("l3s", "acc", "L3S Solvable (%)") | |
| task4 = Task("l3u", "acc", "L3U Unsolvable (%)") | |
| NUM_FEWSHOT = 0 # Change with your few shot | |
| # --------------------------------------------------- | |
| # Your leaderboard name | |
| TITLE = """<h1 align="center" id="space-title">How LLMs Fail and Generalize in RTL Coding for Hardware Design?</h1>""" | |
| CONCLUSION_TEXT = f""" | |
| Evaluations on the VerilogEval Human benchmark reveal a strict empirical ceiling, with frontier models plateauing at a 90.8% initial pass rate. | |
| The solvability taxonomy exposes that L3U (Unsolvable) errors dominate across all model families, revealing persistent knowledge gaps that inference-time scaling cannot address. | |
| Our analysis exposes a striking surface convergence gap: optimization drastically reduces syntax errors but concurrently increases functional testbench failures. | |
| Ultimately, register transfer level (RTL) coding capacity relies heavily upon pretraining knowledge. | |
| Integrating reward and policy modelling (i.e., GRPO) during the post-training phase amplifies existing competencies by teaching models to compile, while L3S errors (addressable via best-of-N sampling) coexist with L3U errors (requiring model improvement). | |
| """ | |
| LLM_BENCHMARKS_TEXT = f""" | |
| ## About the Taxonomy | |
| Our four-level error taxonomy evaluates LLM-generated RTL code based on successive stages of the EDA pipeline: | |
| - **L1 Syntactic**: The source string is rejected by the HDL parser. No AST can be constructed. | |
| - **L2 Semantic**: The source string parses into a valid AST but violates at least one static semantic constraint (e.g., detected during elaboration, linting, or synthesis). | |
| - **L3S Functional-Solvable**: The synthesized model fails to meet the design specification, but the model has demonstrated the ability to solve the problem in at least one other rollout (addressable via inference-time scaling / best-of-N sampling). | |
| - **L3U Functional-Unsolvable**: The synthesized model fails to meet the design specification, and the model cannot solve the problem in any rollout (requires fundamental model improvement). | |
| """ | |
| CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
| CITATION_BUTTON_TEXT = r"""@article{liu2026rtlerror, | |
| title={How Large Language Models Fail and Generalize to Learn RTL Coding for Digital Circuit Design}, | |
| author={Liu, Guan-Ting and Yang, Chao-Han Huck and Deng, Chenhui and Yu, Zhongzhi and Khailany, Brucek and Wang, Yu-Chiang Frank}, | |
| journal={Under Review}, | |
| year={2026} | |
| } | |
| """ | |