Spaces:

braindao
/

soliditybench-leaderboard

Runtime error

App Files Files Community

brunneis commited on Oct 15, 2024

Commit

e9d6a57

unverified ·

1 Parent(s): 36eddfe

Update about page

Browse files

Files changed (2) hide show

README.md +1 -1
src/about.py +56 -4

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: IQ Code | Solidity Leaderboard
 emoji: 🧠 🏆
 colorFrom: pink
 colorTo: purple

 ---
+title: SolidityBench Leaderboard
 emoji: 🧠 🏆
 colorFrom: pink
 colorTo: purple

src/about.py CHANGED Viewed

@@ -30,18 +30,70 @@ class Tasks(Enum):
 # Your leaderboard name
 TITLE = """<br><img src="file/images/soliditybench.svg" width="500"  style="display: block; margin-left: auto; margin-right: auto;">
-<h3 align="center" id="space-title">Solidity Leaderboard | Powered by IQ</h3>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = ""
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = """
-## How it works
-## Reproducibility
-To reproduce our results, here is the commands you can run:
 """
 EVALUATION_REQUESTS_TEXT = """

 # Your leaderboard name
 TITLE = """<br><img src="file/images/soliditybench.svg" width="500"  style="display: block; margin-left: auto; margin-right: auto;">
+<h3 align="center" id="space-title">Solidity Leaderboard by IQ</h3>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = ""
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = """
+# SolidityBench: Evaluating LLM Solidity Code Generation
+SolidityBench is the first leaderboard for evaluating and ranking the ability of LLMs in Solidity code generation. Developed by BrainDAO as part of [IQ Code](https://iqcode.ai/), which aims to create a suite of AI models designed for generating and auditing smart contract code.
+We introduce two benchmarks specifically designed for Solidity: NaïveJudge and HumanEval for Solidity.
+## Benchmarks
+### 1. NaïveJudge
+NaïveJudge is a novel approach to smart contract evaluation, integrating a dataset of audited smart contracts from [OpenZeppelin](https://huggingface.co/datasets/braindao/soliditybench-naive-judge-openzeppelin-v1).
+#### Evaluation Process:
+- LLMs implement smart contracts based on detailed specifications.
+- Generated code is compared to audited reference implementations.
+- Evaluation is performed by SOTA LLMs (OpenAI GPT-4 and Claude 3.5 Sonnet) acting as impartial code reviewers.
+#### Evaluation Criteria:
+1. Functional Completeness (0-60 points)
+   - Implementation of key functionality
+   - Handling of edge cases
+   - Appropriate error management
+2. Solidity Best Practices and Security (0-30 points)
+   - Correct and up-to-date Solidity syntax
+   - Adherence to best practices and design patterns
+   - Appropriate use of data types and visibility modifiers
+   - Code structure and maintainability
+3. Optimization and Efficiency (0-10 points)
+   - Gas efficiency
+   - Avoidance of unnecessary computations
+   - Storage efficiency
+   - Overall performance compared to expert implementation
+The final score ranges from 0 to 100, calculated by summing the points from each criterion.
+### 2. HumanEval for Solidity
+[HumanEval for Solidity](https://huggingface.co/datasets/braindao/humaneval-for-solidity-25) is an adaptation of OpenAI's original HumanEval benchmark, ported from Python to Solidity.
+#### Dataset:
+- 25 tasks of varying difficulty
+- Each task includes corresponding tests designed for use with Hardhat
+#### Evaluation Process:
+- Custom server built on top of Hardhat compiles and tests the generated Solidity code
+- Evaluates the AI model's ability to produce fully functional smart contracts
+#### Metrics:
+1. pass@1 (Score: 0-100)
+   - Measures the model's success on the first attempt
+   - Assesses precision and efficiency
+2. pass@3 (Score: 0-100)
+   - Allows up to three attempts at solving each task
+   - Provides insights into the model's problem-solving capabilities over multiple tries
 """
 EVALUATION_REQUESTS_TEXT = """