Danny Liu commited on
Commit
9a205f0
·
1 Parent(s): 1ac3f2c

fix content issue

Browse files
Files changed (2) hide show
  1. app.py +2 -1
  2. src/about.py +1 -8
app.py CHANGED
@@ -41,7 +41,8 @@ with demo:
41
 
42
  gr.Image("taxonomy_overview.png", elem_id="taxonomy-img", show_label=False, show_download_button=False)
43
  gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
44
-
 
45
  gr.Markdown("### Model evaluation on VerilogEval-Human V1 benchmark (156 problems, 10 rollouts each)")
46
  leaderboard = init_leaderboard(LEADERBOARD_DF)
47
  gr.Markdown(CONCLUSION_TEXT, elem_classes="markdown-text")
 
41
 
42
  gr.Image("taxonomy_overview.png", elem_id="taxonomy-img", show_label=False, show_download_button=False)
43
  gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
44
+
45
+ gr.Markdown("## Benchmark")
46
  gr.Markdown("### Model evaluation on VerilogEval-Human V1 benchmark (156 problems, 10 rollouts each)")
47
  leaderboard = init_leaderboard(LEADERBOARD_DF)
48
  gr.Markdown(CONCLUSION_TEXT, elem_classes="markdown-text")
src/about.py CHANGED
@@ -26,19 +26,15 @@ NUM_FEWSHOT = 0 # Change with your few shot
26
  # Your leaderboard name
27
  TITLE = """<h1 align="center" id="space-title">How LLMs Fail and Generalize in RTL Coding for Hardware Design?</h1>"""
28
 
29
- # What does your leaderboard evaluate?
30
- INTRODUCTION_TEXT = """
31
- """
32
-
33
  CONCLUSION_TEXT = """
34
  Evaluations on the VerilogEval Human benchmark reveal a strict empirical ceiling, with frontier models plateauing at a 90.8% initial pass rate.
35
  The solvability taxonomy exposes that L3U (Unsolvable) errors dominate across all model families, revealing persistent knowledge gaps that inference-time scaling cannot address.
36
  Our analysis exposes a striking surface convergence gap: optimization drastically reduces syntax errors but concurrently increases functional testbench failures.
37
  Ultimately, register transfer level (RTL) coding capacity relies heavily upon pretraining knowledge.
38
  Integrating reward and policy modelling (i.e., GRPO) during the post-training phase amplifies existing competencies by teaching models to compile, while L3S errors (addressable via best-of-N sampling) coexist with L3U errors (requiring model improvement).
 
39
  """
40
 
41
- # Which evaluations are you running? how can people reproduce what you have?
42
  LLM_BENCHMARKS_TEXT = f"""
43
  ## About the Taxonomy
44
  Our four-level error taxonomy evaluates LLM-generated RTL code based on successive stages of the EDA pipeline:
@@ -47,11 +43,8 @@ Our four-level error taxonomy evaluates LLM-generated RTL code based on successi
47
  - **L3S Functional-Solvable**: The synthesized model fails to meet the design specification, but the model has demonstrated the ability to solve the problem in at least one other rollout (addressable via inference-time scaling / best-of-N sampling).
48
  - **L3U Functional-Unsolvable**: The synthesized model fails to meet the design specification, and the model cannot solve the problem in any rollout (requires fundamental model improvement).
49
 
50
- ## Benchmark
51
  """
52
 
53
- EVALUATION_QUEUE_TEXT = """
54
- """
55
 
56
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
57
  CITATION_BUTTON_TEXT = r"""@article{liu2026rtlerror,
 
26
  # Your leaderboard name
27
  TITLE = """<h1 align="center" id="space-title">How LLMs Fail and Generalize in RTL Coding for Hardware Design?</h1>"""
28
 
 
 
 
 
29
  CONCLUSION_TEXT = """
30
  Evaluations on the VerilogEval Human benchmark reveal a strict empirical ceiling, with frontier models plateauing at a 90.8% initial pass rate.
31
  The solvability taxonomy exposes that L3U (Unsolvable) errors dominate across all model families, revealing persistent knowledge gaps that inference-time scaling cannot address.
32
  Our analysis exposes a striking surface convergence gap: optimization drastically reduces syntax errors but concurrently increases functional testbench failures.
33
  Ultimately, register transfer level (RTL) coding capacity relies heavily upon pretraining knowledge.
34
  Integrating reward and policy modelling (i.e., GRPO) during the post-training phase amplifies existing competencies by teaching models to compile, while L3S errors (addressable via best-of-N sampling) coexist with L3U errors (requiring model improvement).
35
+
36
  """
37
 
 
38
  LLM_BENCHMARKS_TEXT = f"""
39
  ## About the Taxonomy
40
  Our four-level error taxonomy evaluates LLM-generated RTL code based on successive stages of the EDA pipeline:
 
43
  - **L3S Functional-Solvable**: The synthesized model fails to meet the design specification, but the model has demonstrated the ability to solve the problem in at least one other rollout (addressable via inference-time scaling / best-of-N sampling).
44
  - **L3U Functional-Unsolvable**: The synthesized model fails to meet the design specification, and the model cannot solve the problem in any rollout (requires fundamental model improvement).
45
 
 
46
  """
47
 
 
 
48
 
49
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
50
  CITATION_BUTTON_TEXT = r"""@article{liu2026rtlerror,