Spaces:
Running
Running
Danny Liu commited on
Commit ·
b627e0e
1
Parent(s): 26417d8
Remove duplicate sentences and Benchmark section
Browse files- src/about.py +0 -4
src/about.py
CHANGED
|
@@ -33,7 +33,6 @@ INTRODUCTION_TEXT = """
|
|
| 33 |
CONCLUSION_TEXT = """
|
| 34 |
Evaluations on the VerilogEval Human benchmark reveal a strict empirical ceiling, with frontier models plateauing at a 90.8% initial pass rate.
|
| 35 |
The solvability taxonomy exposes that L3U (Unsolvable) errors dominate across all model families, revealing persistent knowledge gaps that inference-time scaling cannot address.
|
| 36 |
-
Our analysis exposes a striking surface convergence gap: optimization drastically reduces syntax errors but concurrently increases functional testbench failures.
|
| 37 |
Ultimately, register transfer level (RTL) coding capacity relies heavily upon pretraining knowledge.
|
| 38 |
Integrating reward and policy modelling (i.e., GRPO) during the post-training phase amplifies existing competencies by teaching models to compile, while L3S errors (addressable via best-of-N sampling) coexist with L3U errors (requiring model improvement).
|
| 39 |
"""
|
|
@@ -46,9 +45,6 @@ Our four-level error taxonomy evaluates LLM-generated RTL code based on successi
|
|
| 46 |
- **L2 Semantic**: The source string parses into a valid AST but violates at least one static semantic constraint (e.g., detected during elaboration, linting, or synthesis).
|
| 47 |
- **L3S Functional-Solvable**: The synthesized model fails to meet the design specification, but the model has demonstrated the ability to solve the problem in at least one other rollout (addressable via inference-time scaling / best-of-N sampling).
|
| 48 |
- **L3U Functional-Unsolvable**: The synthesized model fails to meet the design specification, and the model cannot solve the problem in any rollout (requires fundamental model improvement).
|
| 49 |
-
|
| 50 |
-
## Benchmark
|
| 51 |
-
We evaluate models on the **VerilogEval Human** benchmark, which tests the ability of LLMs to generate correct Verilog code from natural language specifications.
|
| 52 |
"""
|
| 53 |
|
| 54 |
EVALUATION_QUEUE_TEXT = """
|
|
|
|
| 33 |
CONCLUSION_TEXT = """
|
| 34 |
Evaluations on the VerilogEval Human benchmark reveal a strict empirical ceiling, with frontier models plateauing at a 90.8% initial pass rate.
|
| 35 |
The solvability taxonomy exposes that L3U (Unsolvable) errors dominate across all model families, revealing persistent knowledge gaps that inference-time scaling cannot address.
|
|
|
|
| 36 |
Ultimately, register transfer level (RTL) coding capacity relies heavily upon pretraining knowledge.
|
| 37 |
Integrating reward and policy modelling (i.e., GRPO) during the post-training phase amplifies existing competencies by teaching models to compile, while L3S errors (addressable via best-of-N sampling) coexist with L3U errors (requiring model improvement).
|
| 38 |
"""
|
|
|
|
| 45 |
- **L2 Semantic**: The source string parses into a valid AST but violates at least one static semantic constraint (e.g., detected during elaboration, linting, or synthesis).
|
| 46 |
- **L3S Functional-Solvable**: The synthesized model fails to meet the design specification, but the model has demonstrated the ability to solve the problem in at least one other rollout (addressable via inference-time scaling / best-of-N sampling).
|
| 47 |
- **L3U Functional-Unsolvable**: The synthesized model fails to meet the design specification, and the model cannot solve the problem in any rollout (requires fundamental model improvement).
|
|
|
|
|
|
|
|
|
|
| 48 |
"""
|
| 49 |
|
| 50 |
EVALUATION_QUEUE_TEXT = """
|