Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -276,7 +276,7 @@ contract DecentralizedLibrary is Ownable(msg.sender) {
|
|
| 276 |
```
|
| 277 |
|
| 278 |
# Evaluation Matrics
|
| 279 |
-
To evaluate the performance of our fine-tuned LLM specialized in Solidity smart contract generation, we used **[Slither](https://github.com/crytic/slither)**, a static analysis framework widely used for analyzing Solidity code.
|
| 280 |
|
| 281 |
We focused on six key evaluation criteria:
|
| 282 |
|
|
@@ -292,15 +292,17 @@ Using Slither’s gas optimization analysis, we identified areas in the generate
|
|
| 292 |
- **Security Vulnerabilities**
|
| 293 |
We analyzed each contract for known security vulnerabilities using Slither’s built-in detectors. We recorded the number and severity of the vulnerabilities detected, providing a measure of the security quality of the model’s outputs.
|
| 294 |
|
| 295 |
-
- **Average Lines of Code**
|
| 296 |
-
|
| 297 |
|
| 298 |
-
- **Correctness
|
| 299 |
-
|
| 300 |
|
| 301 |
-
|
|
|
|
| 302 |
|
| 303 |
|
|
|
|
| 304 |
|
| 305 |
|
| 306 |
# Summary
|
|
|
|
| 276 |
```
|
| 277 |
|
| 278 |
# Evaluation Matrics
|
| 279 |
+
To evaluate the performance of our fine-tuned LLM specialized in Solidity smart contract generation, we used **[Slither](https://github.com/crytic/slither)**, a static analysis framework widely used for analyzing Solidity code. Additionally, we used both automated LLM-based assessments and expert human evaluations to ensure a comprehensive benchmarking approach.
|
| 280 |
|
| 281 |
We focused on six key evaluation criteria:
|
| 282 |
|
|
|
|
| 292 |
- **Security Vulnerabilities**
|
| 293 |
We analyzed each contract for known security vulnerabilities using Slither’s built-in detectors. We recorded the number and severity of the vulnerabilities detected, providing a measure of the security quality of the model’s outputs.
|
| 294 |
|
| 295 |
+
- **Average Lines of Code (LOC)**
|
| 296 |
+
Captures the average number of lines per generated contract, excluding blank lines but including comments. This metric reflects code verbosity or conciseness, and helps gauge implementation completeness versus potential redundancy.
|
| 297 |
|
| 298 |
+
- **Correctness (OpenAI Evaluation)**
|
| 299 |
+
Evaluates how accurately the generated contract matches the intended prompt using GPT-4o Mini. Prompts and outputs are scored against a structured rubric, providing a scalable LLM-based perspective on prompt alignment.
|
| 300 |
|
| 301 |
+
- **Correctness (Human Evaluation)**
|
| 302 |
+
Involves manual review by a blockchain expert to assess how well the output satisfies the original prompt and category. This provides human-validated insight into the practical applicability and quality of the generated code.
|
| 303 |
|
| 304 |
|
| 305 |
+
These metrics collectively provide a multi-dimensional view of the model’s effectiveness, spanning correctness, efficiency, security, and usability. They are designed to reflect both automated benchmarks and real-world developer expectations.
|
| 306 |
|
| 307 |
|
| 308 |
# Summary
|