emekaphilians's picture
Upload tokenizer
0c00d4f verified
|
Raw
History Blame Contribute Delete
4.67 kB
---
language: en
license: mit
tags:
- code
- solidity
- smart-contracts
- security
- vulnerability-detection
- blockchain
- ethereum
- defi
datasets:
- smartbugs-curated
- solidifi-benchmark
- defihacklabs
- not-so-smart-contracts
metrics:
- f1
base_model: microsoft/codebert-base
---
# trustchainai-codebert
**Fine-tuned CodeBERT for Solidity smart contract vulnerability detection.**
Part of the [TrustChainAI](https://github.com/emekaphilian/TrustChainAI) project — an AI-powered smart contract auditor with explainability and ethics monitoring, built to make blockchain security accessible to African and emerging-market Web3 ecosystems.
---
## Model Performance
| Metric | Score |
|---|---|
| **F1 (weighted, test set)** | **98.6%** |
| Eval Loss | 0.0428 |
| Test Samples | 1,032 contracts |
| Classes | 13 vulnerability categories |
---
## How to Use
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="emekaphilians/trustchainai-codebert"
)
contract = """
pragma solidity ^0.8.0;
contract Vulnerable {
mapping(address => uint) public balances;
function withdraw() external {
uint amt = balances[msg.sender];
(bool ok,) = msg.sender.call{value: amt}("");
balances[msg.sender] = 0;
}
}
"""
result = classifier(contract[:512])
print(result)
# [{'label': 'reentrancy', 'score': 0.997}]
```
---
## Label Schema
| ID | Label | Description |
|---|---|---|
| 0 | safe | No vulnerability detected |
| 1 | reentrancy | Reentrancy attack (DAO-style) |
| 2 | integer_overflow | Arithmetic overflow / underflow |
| 3 | access_control | Unprotected ownership or selfdestruct |
| 4 | tx_origin_phishing | tx.origin used for authentication |
| 5 | dos_gas | Unbounded loop / gas exhaustion |
| 6 | unchecked_call | External call return value ignored |
| 7 | front_running_mev | Mempool-visible state / TOD |
| 8 | timestamp_dependence | block.timestamp manipulation |
| 9 | proxy_storage_collision | Delegatecall storage slot collision |
| 10 | flash_loan_oracle | Oracle price manipulation via flash loan |
| 11 | flash_loan_single_block | Single-block liquidity attack |
| 12 | misnamed_constructor | Pre-Solidity-0.5 constructor naming bug |
| 13 | other | Multi-class or miscellaneous vulnerability |
---
## Training Data
Assembled from four open-source sources using the [prepare_datasets.py](https://github.com/emekaphilian/TrustChainAI/blob/main/TrustChainAi/scripts/prepare_datasets.py) pipeline:
| Source | Contracts |
|---|---|
| SmartBugs Curated | 143 |
| SolidiFI Benchmark | 1,700 |
| DeFiHackLabs | 729 |
| Not-So-Smart Contracts | 25 |
| Synthetic augmentation | 3,600 |
| **Total (after dedup)** | **6,879** |
Split: 70% train / 15% val / 15% test (stratified by label).
---
## Training Details
| Parameter | Value |
|---|---|
| Base model | microsoft/codebert-base |
| Epochs | 5 (best checkpoint at epoch 2) |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Optimizer | AdamW (weight decay 0.01, warmup 100 steps) |
| Max token length | 512 |
| Mixed precision | fp16 |
| Hardware | Google Colab T4 GPU |
---
## Intended Use
- Pre-deployment security screening of Solidity smart contracts
- Automated vulnerability triage for DeFi protocols
- Research baseline for smart contract security ML benchmarks
- Integration into the TrustChainAI multi-agent audit pipeline
## Out-of-Scope Use
- This model is **not a substitute** for a full professional security audit on high-value contracts
- Performance on Vyper, Yul, or non-EVM contracts is untested
- The `tx_origin_phishing` class has limited real training samples (28); treat predictions for this class with extra caution
---
## Limitations & Bias
- Synthetic augmentation was used for 9 of 13 classes to compensate for dataset scarcity. Synthetic contracts may not fully capture real-world obfuscation patterns.
- The `tx_origin_phishing` class had only 28 real-world training samples; model confidence for this class may be lower in practice.
- Training data skews toward older Solidity vulnerability patterns (pre-0.8). Newer attack vectors may be underrepresented.
---
## Citation
```bibtex
@misc{trustchainai2025,
author = {Emeka Philian},
title = {TrustChainAI: AI-Powered Smart Contract Auditor},
year = {2025},
url = {https://github.com/emekaphilian/TrustChainAI}
}
```
---
## Links
- 🔗 GitHub: [emekaphilian/TrustChainAI](https://github.com/emekaphilian/TrustChainAI)
- 🤗 Profile: [emekaphilians](https://huggingface.co/emekaphilians)
- 📄 Architecture: [docs/ARCHITECTURE.md](https://github.com/emekaphilian/TrustChainAI/blob/main/docs/ARCHITECTURE.md)