Chain-GPT
/

Solidity-LLM

@@ -276,7 +276,7 @@ contract DecentralizedLibrary is Ownable(msg.sender) {
 ```
 # Evaluation Matrics
-To evaluate the performance of our fine-tuned LLM specialized in Solidity smart contract generation, we used **[Slither](https://github.com/crytic/slither)**, a static analysis framework widely used for analyzing Solidity code.
 We focused on six key evaluation criteria:
@@ -292,15 +292,17 @@ Using Slither’s gas optimization analysis, we identified areas in the generate
 - **Security Vulnerabilities**
 We analyzed each contract for known security vulnerabilities using Slither’s built-in detectors. We recorded the number and severity of the vulnerabilities detected, providing a measure of the security quality of the model’s outputs.
-- **Average Lines of Code**
-This metric provides insight into the verbosity or conciseness of the model’s output. Higher LOC may suggest redundancy or complete code, while lower LOC could indicate either efficiency or missing implementation details, depending on context.
-- **Correctness of Code**
-To assess how well the generated code aligns with the given prompt and category, We conducted both manual and OpenAI LLM evaluation of each generated contract. The prompt and the generated code were keenly observed for alignment analysis.
-These evaluation metrics help quantify the practical usability and reliability of the generated smart contracts in real-world scenarios.
 # Summary

 ```
 # Evaluation Matrics
+To evaluate the performance of our fine-tuned LLM specialized in Solidity smart contract generation, we used **[Slither](https://github.com/crytic/slither)**, a static analysis framework widely used for analyzing Solidity code. Additionally, we used both automated LLM-based assessments and expert human evaluations to ensure a comprehensive benchmarking approach.
 We focused on six key evaluation criteria:
 - **Security Vulnerabilities**
 We analyzed each contract for known security vulnerabilities using Slither’s built-in detectors. We recorded the number and severity of the vulnerabilities detected, providing a measure of the security quality of the model’s outputs.
+- **Average Lines of Code (LOC)**
+Captures the average number of lines per generated contract, excluding blank lines but including comments. This metric reflects code verbosity or conciseness, and helps gauge implementation completeness versus potential redundancy.
+- **Correctness (OpenAI Evaluation)**
+Evaluates how accurately the generated contract matches the intended prompt using GPT-4o Mini. Prompts and outputs are scored against a structured rubric, providing a scalable LLM-based perspective on prompt alignment.
+- **Correctness (Human Evaluation)**
+Involves manual review by a blockchain expert to assess how well the output satisfies the original prompt and category. This provides human-validated insight into the practical applicability and quality of the generated code.
+These metrics collectively provide a multi-dimensional view of the model’s effectiveness, spanning correctness, efficiency, security, and usability. They are designed to reflect both automated benchmarks and real-world developer expectations.
 # Summary