Spaces:
Running
Running
Danny Liu commited on
Commit ·
9a205f0
1
Parent(s): 1ac3f2c
fix content issue
Browse files- app.py +2 -1
- src/about.py +1 -8
app.py
CHANGED
|
@@ -41,7 +41,8 @@ with demo:
|
|
| 41 |
|
| 42 |
gr.Image("taxonomy_overview.png", elem_id="taxonomy-img", show_label=False, show_download_button=False)
|
| 43 |
gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
|
| 44 |
-
|
|
|
|
| 45 |
gr.Markdown("### Model evaluation on VerilogEval-Human V1 benchmark (156 problems, 10 rollouts each)")
|
| 46 |
leaderboard = init_leaderboard(LEADERBOARD_DF)
|
| 47 |
gr.Markdown(CONCLUSION_TEXT, elem_classes="markdown-text")
|
|
|
|
| 41 |
|
| 42 |
gr.Image("taxonomy_overview.png", elem_id="taxonomy-img", show_label=False, show_download_button=False)
|
| 43 |
gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
|
| 44 |
+
|
| 45 |
+
gr.Markdown("## Benchmark")
|
| 46 |
gr.Markdown("### Model evaluation on VerilogEval-Human V1 benchmark (156 problems, 10 rollouts each)")
|
| 47 |
leaderboard = init_leaderboard(LEADERBOARD_DF)
|
| 48 |
gr.Markdown(CONCLUSION_TEXT, elem_classes="markdown-text")
|
src/about.py
CHANGED
|
@@ -26,19 +26,15 @@ NUM_FEWSHOT = 0 # Change with your few shot
|
|
| 26 |
# Your leaderboard name
|
| 27 |
TITLE = """<h1 align="center" id="space-title">How LLMs Fail and Generalize in RTL Coding for Hardware Design?</h1>"""
|
| 28 |
|
| 29 |
-
# What does your leaderboard evaluate?
|
| 30 |
-
INTRODUCTION_TEXT = """
|
| 31 |
-
"""
|
| 32 |
-
|
| 33 |
CONCLUSION_TEXT = """
|
| 34 |
Evaluations on the VerilogEval Human benchmark reveal a strict empirical ceiling, with frontier models plateauing at a 90.8% initial pass rate.
|
| 35 |
The solvability taxonomy exposes that L3U (Unsolvable) errors dominate across all model families, revealing persistent knowledge gaps that inference-time scaling cannot address.
|
| 36 |
Our analysis exposes a striking surface convergence gap: optimization drastically reduces syntax errors but concurrently increases functional testbench failures.
|
| 37 |
Ultimately, register transfer level (RTL) coding capacity relies heavily upon pretraining knowledge.
|
| 38 |
Integrating reward and policy modelling (i.e., GRPO) during the post-training phase amplifies existing competencies by teaching models to compile, while L3S errors (addressable via best-of-N sampling) coexist with L3U errors (requiring model improvement).
|
|
|
|
| 39 |
"""
|
| 40 |
|
| 41 |
-
# Which evaluations are you running? how can people reproduce what you have?
|
| 42 |
LLM_BENCHMARKS_TEXT = f"""
|
| 43 |
## About the Taxonomy
|
| 44 |
Our four-level error taxonomy evaluates LLM-generated RTL code based on successive stages of the EDA pipeline:
|
|
@@ -47,11 +43,8 @@ Our four-level error taxonomy evaluates LLM-generated RTL code based on successi
|
|
| 47 |
- **L3S Functional-Solvable**: The synthesized model fails to meet the design specification, but the model has demonstrated the ability to solve the problem in at least one other rollout (addressable via inference-time scaling / best-of-N sampling).
|
| 48 |
- **L3U Functional-Unsolvable**: The synthesized model fails to meet the design specification, and the model cannot solve the problem in any rollout (requires fundamental model improvement).
|
| 49 |
|
| 50 |
-
## Benchmark
|
| 51 |
"""
|
| 52 |
|
| 53 |
-
EVALUATION_QUEUE_TEXT = """
|
| 54 |
-
"""
|
| 55 |
|
| 56 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
| 57 |
CITATION_BUTTON_TEXT = r"""@article{liu2026rtlerror,
|
|
|
|
| 26 |
# Your leaderboard name
|
| 27 |
TITLE = """<h1 align="center" id="space-title">How LLMs Fail and Generalize in RTL Coding for Hardware Design?</h1>"""
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
CONCLUSION_TEXT = """
|
| 30 |
Evaluations on the VerilogEval Human benchmark reveal a strict empirical ceiling, with frontier models plateauing at a 90.8% initial pass rate.
|
| 31 |
The solvability taxonomy exposes that L3U (Unsolvable) errors dominate across all model families, revealing persistent knowledge gaps that inference-time scaling cannot address.
|
| 32 |
Our analysis exposes a striking surface convergence gap: optimization drastically reduces syntax errors but concurrently increases functional testbench failures.
|
| 33 |
Ultimately, register transfer level (RTL) coding capacity relies heavily upon pretraining knowledge.
|
| 34 |
Integrating reward and policy modelling (i.e., GRPO) during the post-training phase amplifies existing competencies by teaching models to compile, while L3S errors (addressable via best-of-N sampling) coexist with L3U errors (requiring model improvement).
|
| 35 |
+
|
| 36 |
"""
|
| 37 |
|
|
|
|
| 38 |
LLM_BENCHMARKS_TEXT = f"""
|
| 39 |
## About the Taxonomy
|
| 40 |
Our four-level error taxonomy evaluates LLM-generated RTL code based on successive stages of the EDA pipeline:
|
|
|
|
| 43 |
- **L3S Functional-Solvable**: The synthesized model fails to meet the design specification, but the model has demonstrated the ability to solve the problem in at least one other rollout (addressable via inference-time scaling / best-of-N sampling).
|
| 44 |
- **L3U Functional-Unsolvable**: The synthesized model fails to meet the design specification, and the model cannot solve the problem in any rollout (requires fundamental model improvement).
|
| 45 |
|
|
|
|
| 46 |
"""
|
| 47 |
|
|
|
|
|
|
|
| 48 |
|
| 49 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
| 50 |
CITATION_BUTTON_TEXT = r"""@article{liu2026rtlerror,
|