Upload tokenizer

0c00d4f verified about 1 month ago

4.67 kB

	---
	language: en
	license: mit
	tags:
	- code
	- solidity
	- smart-contracts
	- security
	- vulnerability-detection
	- blockchain
	- ethereum
	- defi
	datasets:
	- smartbugs-curated
	- solidifi-benchmark
	- defihacklabs
	- not-so-smart-contracts
	metrics:
	- f1
	base_model: microsoft/codebert-base
	---

	# trustchainai-codebert

	Fine-tuned CodeBERT for Solidity smart contract vulnerability detection.

	Part of the [TrustChainAI](https://github.com/emekaphilian/TrustChainAI) project — an AI-powered smart contract auditor with explainability and ethics monitoring, built to make blockchain security accessible to African and emerging-market Web3 ecosystems.

	---

	## Model Performance

	\| Metric \| Score \|
	\|---\|---\|
	\| F1 (weighted, test set) \| 98.6% \|
	\| Eval Loss \| 0.0428 \|
	\| Test Samples \| 1,032 contracts \|
	\| Classes \| 13 vulnerability categories \|

	---

	## How to Use

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="emekaphilians/trustchainai-codebert"
	)

	contract = """
	pragma solidity ^0.8.0;
	contract Vulnerable {
	mapping(address => uint) public balances;
	function withdraw() external {
	uint amt = balances[msg.sender];
	(bool ok,) = msg.sender.call{value: amt}("");
	balances[msg.sender] = 0;
	}
	}
	"""

	result = classifier(contract[:512])
	print(result)
	# [{'label': 'reentrancy', 'score': 0.997}]
	```

	---

	## Label Schema

	\| ID \| Label \| Description \|
	\|---\|---\|---\|
	\| 0 \| safe \| No vulnerability detected \|
	\| 1 \| reentrancy \| Reentrancy attack (DAO-style) \|
	\| 2 \| integer_overflow \| Arithmetic overflow / underflow \|
	\| 3 \| access_control \| Unprotected ownership or selfdestruct \|
	\| 4 \| tx_origin_phishing \| tx.origin used for authentication \|
	\| 5 \| dos_gas \| Unbounded loop / gas exhaustion \|
	\| 6 \| unchecked_call \| External call return value ignored \|
	\| 7 \| front_running_mev \| Mempool-visible state / TOD \|
	\| 8 \| timestamp_dependence \| block.timestamp manipulation \|
	\| 9 \| proxy_storage_collision \| Delegatecall storage slot collision \|
	\| 10 \| flash_loan_oracle \| Oracle price manipulation via flash loan \|
	\| 11 \| flash_loan_single_block \| Single-block liquidity attack \|
	\| 12 \| misnamed_constructor \| Pre-Solidity-0.5 constructor naming bug \|
	\| 13 \| other \| Multi-class or miscellaneous vulnerability \|

	---

	## Training Data

	Assembled from four open-source sources using the [prepare_datasets.py](https://github.com/emekaphilian/TrustChainAI/blob/main/TrustChainAi/scripts/prepare_datasets.py) pipeline:

	\| Source \| Contracts \|
	\|---\|---\|
	\| SmartBugs Curated \| 143 \|
	\| SolidiFI Benchmark \| 1,700 \|
	\| DeFiHackLabs \| 729 \|
	\| Not-So-Smart Contracts \| 25 \|
	\| Synthetic augmentation \| 3,600 \|
	\| Total (after dedup) \| 6,879 \|

	Split: 70% train / 15% val / 15% test (stratified by label).

	---

	## Training Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Base model \| microsoft/codebert-base \|
	\| Epochs \| 5 (best checkpoint at epoch 2) \|
	\| Batch size \| 16 \|
	\| Learning rate \| 2e-5 \|
	\| Optimizer \| AdamW (weight decay 0.01, warmup 100 steps) \|
	\| Max token length \| 512 \|
	\| Mixed precision \| fp16 \|
	\| Hardware \| Google Colab T4 GPU \|

	---

	## Intended Use

	- Pre-deployment security screening of Solidity smart contracts
	- Automated vulnerability triage for DeFi protocols
	- Research baseline for smart contract security ML benchmarks
	- Integration into the TrustChainAI multi-agent audit pipeline

	## Out-of-Scope Use

	- This model is not a substitute for a full professional security audit on high-value contracts
	- Performance on Vyper, Yul, or non-EVM contracts is untested
	- The `tx_origin_phishing` class has limited real training samples (28); treat predictions for this class with extra caution

	---

	## Limitations & Bias

	- Synthetic augmentation was used for 9 of 13 classes to compensate for dataset scarcity. Synthetic contracts may not fully capture real-world obfuscation patterns.
	- The `tx_origin_phishing` class had only 28 real-world training samples; model confidence for this class may be lower in practice.
	- Training data skews toward older Solidity vulnerability patterns (pre-0.8). Newer attack vectors may be underrepresented.

	---

	## Citation

	```bibtex
	@misc{trustchainai2025,
	author = {Emeka Philian},
	title = {TrustChainAI: AI-Powered Smart Contract Auditor},
	year = {2025},
	url = {https://github.com/emekaphilian/TrustChainAI}
	}
	```

	---

	## Links

	- 🔗 GitHub: [emekaphilian/TrustChainAI](https://github.com/emekaphilian/TrustChainAI)
	- 🤗 Profile: [emekaphilians](https://huggingface.co/emekaphilians)
	- 📄 Architecture: [docs/ARCHITECTURE.md](https://github.com/emekaphilian/TrustChainAI/blob/main/docs/ARCHITECTURE.md)