Spaces:

nvidia
/

LLM_RTL_Errors_Explainer

Running

Danny Liu commited on 4 days ago

Commit

26417d8

1 Parent(s): 77ec02d

Rearrange text per user request

Files changed (2) hide show

app.py CHANGED Viewed

@@ -5,7 +5,7 @@ import pandas as pd
 from src.about import (
     CITATION_BUTTON_LABEL,
     CITATION_BUTTON_TEXT,
-    INTRODUCTION_TEXT,
     LLM_BENCHMARKS_TEXT,
     TITLE,
 )
@@ -38,13 +38,13 @@ def init_leaderboard(dataframe):
 demo = gr.Blocks(css=custom_css)
 with demo:
     gr.HTML(TITLE)
-    gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
     gr.Image("taxonomy_overview.png", elem_id="taxonomy-img", show_label=False, show_download_button=False)
     gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
     gr.Markdown("### Model evaluation on VerilogEval-Human V1 benchmark (156 problems, 10 rollouts each)")
     leaderboard = init_leaderboard(LEADERBOARD_DF)
     gr.Markdown("### Transition Matrices")
     gr.Markdown("The transition matrices below show how errors evolve during the SFT and RL phases, revealing the surface convergence gap where optimization reduces syntax errors but increases functional testbench failures.")

 from src.about import (
     CITATION_BUTTON_LABEL,
     CITATION_BUTTON_TEXT,
+    CONCLUSION_TEXT,
     LLM_BENCHMARKS_TEXT,
     TITLE,
 )
 demo = gr.Blocks(css=custom_css)
 with demo:
     gr.HTML(TITLE)
     gr.Image("taxonomy_overview.png", elem_id="taxonomy-img", show_label=False, show_download_button=False)
     gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
     gr.Markdown("### Model evaluation on VerilogEval-Human V1 benchmark (156 problems, 10 rollouts each)")
     leaderboard = init_leaderboard(LEADERBOARD_DF)
+    gr.Markdown(CONCLUSION_TEXT, elem_classes="markdown-text")
     gr.Markdown("### Transition Matrices")
     gr.Markdown("The transition matrices below show how errors evolve during the SFT and RL phases, revealing the surface convergence gap where optimization reduces syntax errors but increases functional testbench failures.")

src/about.py CHANGED Viewed

@@ -28,9 +28,9 @@ TITLE = """<h1 align="center" id="space-title">How LLMs Fail and Generalize in R
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
-Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models.
-We introduce a four-level error taxonomy—**L1 syntactic**, **L2 semantic**, **L3S functional-solvable**, and **L3U functional-unsolvable**—where the L3 split is determined by problem-level solvability: whether the model can solve the problem in any rollout.
 Evaluations on the VerilogEval Human benchmark reveal a strict empirical ceiling, with frontier models plateauing at a 90.8% initial pass rate.
 The solvability taxonomy exposes that L3U (Unsolvable) errors dominate across all model families, revealing persistent knowledge gaps that inference-time scaling cannot address.
 Our analysis exposes a striking surface convergence gap: optimization drastically reduces syntax errors but concurrently increases functional testbench failures.

 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
+"""
+CONCLUSION_TEXT = """
 Evaluations on the VerilogEval Human benchmark reveal a strict empirical ceiling, with frontier models plateauing at a 90.8% initial pass rate.
 The solvability taxonomy exposes that L3U (Unsolvable) errors dominate across all model families, revealing persistent knowledge gaps that inference-time scaling cannot address.
 Our analysis exposes a striking surface convergence gap: optimization drastically reduces syntax errors but concurrently increases functional testbench failures.