--- language: en license: mit tags: - code - solidity - smart-contracts - security - vulnerability-detection - blockchain - ethereum - defi datasets: - smartbugs-curated - solidifi-benchmark - defihacklabs - not-so-smart-contracts metrics: - f1 base_model: microsoft/codebert-base --- # trustchainai-codebert **Fine-tuned CodeBERT for Solidity smart contract vulnerability detection.** Part of the [TrustChainAI](https://github.com/emekaphilian/TrustChainAI) project — an AI-powered smart contract auditor with explainability and ethics monitoring, built to make blockchain security accessible to African and emerging-market Web3 ecosystems. --- ## Model Performance | Metric | Score | |---|---| | **F1 (weighted, test set)** | **98.6%** | | Eval Loss | 0.0428 | | Test Samples | 1,032 contracts | | Classes | 13 vulnerability categories | --- ## How to Use ```python from transformers import pipeline classifier = pipeline( "text-classification", model="emekaphilians/trustchainai-codebert" ) contract = """ pragma solidity ^0.8.0; contract Vulnerable { mapping(address => uint) public balances; function withdraw() external { uint amt = balances[msg.sender]; (bool ok,) = msg.sender.call{value: amt}(""); balances[msg.sender] = 0; } } """ result = classifier(contract[:512]) print(result) # [{'label': 'reentrancy', 'score': 0.997}] ``` --- ## Label Schema | ID | Label | Description | |---|---|---| | 0 | safe | No vulnerability detected | | 1 | reentrancy | Reentrancy attack (DAO-style) | | 2 | integer_overflow | Arithmetic overflow / underflow | | 3 | access_control | Unprotected ownership or selfdestruct | | 4 | tx_origin_phishing | tx.origin used for authentication | | 5 | dos_gas | Unbounded loop / gas exhaustion | | 6 | unchecked_call | External call return value ignored | | 7 | front_running_mev | Mempool-visible state / TOD | | 8 | timestamp_dependence | block.timestamp manipulation | | 9 | proxy_storage_collision | Delegatecall storage slot collision | | 10 | flash_loan_oracle | Oracle price manipulation via flash loan | | 11 | flash_loan_single_block | Single-block liquidity attack | | 12 | misnamed_constructor | Pre-Solidity-0.5 constructor naming bug | | 13 | other | Multi-class or miscellaneous vulnerability | --- ## Training Data Assembled from four open-source sources using the [prepare_datasets.py](https://github.com/emekaphilian/TrustChainAI/blob/main/TrustChainAi/scripts/prepare_datasets.py) pipeline: | Source | Contracts | |---|---| | SmartBugs Curated | 143 | | SolidiFI Benchmark | 1,700 | | DeFiHackLabs | 729 | | Not-So-Smart Contracts | 25 | | Synthetic augmentation | 3,600 | | **Total (after dedup)** | **6,879** | Split: 70% train / 15% val / 15% test (stratified by label). --- ## Training Details | Parameter | Value | |---|---| | Base model | microsoft/codebert-base | | Epochs | 5 (best checkpoint at epoch 2) | | Batch size | 16 | | Learning rate | 2e-5 | | Optimizer | AdamW (weight decay 0.01, warmup 100 steps) | | Max token length | 512 | | Mixed precision | fp16 | | Hardware | Google Colab T4 GPU | --- ## Intended Use - Pre-deployment security screening of Solidity smart contracts - Automated vulnerability triage for DeFi protocols - Research baseline for smart contract security ML benchmarks - Integration into the TrustChainAI multi-agent audit pipeline ## Out-of-Scope Use - This model is **not a substitute** for a full professional security audit on high-value contracts - Performance on Vyper, Yul, or non-EVM contracts is untested - The `tx_origin_phishing` class has limited real training samples (28); treat predictions for this class with extra caution --- ## Limitations & Bias - Synthetic augmentation was used for 9 of 13 classes to compensate for dataset scarcity. Synthetic contracts may not fully capture real-world obfuscation patterns. - The `tx_origin_phishing` class had only 28 real-world training samples; model confidence for this class may be lower in practice. - Training data skews toward older Solidity vulnerability patterns (pre-0.8). Newer attack vectors may be underrepresented. --- ## Citation ```bibtex @misc{trustchainai2025, author = {Emeka Philian}, title = {TrustChainAI: AI-Powered Smart Contract Auditor}, year = {2025}, url = {https://github.com/emekaphilian/TrustChainAI} } ``` --- ## Links - 🔗 GitHub: [emekaphilian/TrustChainAI](https://github.com/emekaphilian/TrustChainAI) - 🤗 Profile: [emekaphilians](https://huggingface.co/emekaphilians) - 📄 Architecture: [docs/ARCHITECTURE.md](https://github.com/emekaphilian/TrustChainAI/blob/main/docs/ARCHITECTURE.md)