|
|
--- |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- embeddings |
|
|
- roberta |
|
|
- code |
|
|
- solidity |
|
|
- ethereum |
|
|
- smart-contracts |
|
|
- security |
|
|
library_name: sentence-transformers |
|
|
pipeline_tag: sentence-similarity |
|
|
base_model: web3se/SmartBERT-v2 |
|
|
model-index: |
|
|
- name: RavenBERT |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# RavenBERT |
|
|
|
|
|
RavenBERT is a **SentenceTransformers** embedding model specialized for **smart-contract invariants** (e.g., `require(...)`, `assert(...)`, `if (...) revert`) from Ethereum/Vyper sources. |
|
|
It starts from **`web3se/SmartBERT-v2`** and is **contrastively fine-tuned** so that cosine similarity reflects *semantic intent* of guards used in transaction-reverting checks. |
|
|
|
|
|
- **Architecture:** BERT-family encoder (SmartBERT-v2) β MeanPooling β L2 Normalize |
|
|
- **Embedding dimension:** 768 |
|
|
- **Normalization:** Enabled (unit-norm vectors; cosine β‘ dot product) |
|
|
- **Intended use:** clustering / semantic search / dedup / taxonomy building for short guard predicates (and optional messages) |
|
|
|
|
|
## Quick start |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
model = SentenceTransformer("MojtabaEshghie/RavenBERT") |
|
|
sentences = [ |
|
|
"amountOut >= amountOutMin", |
|
|
"deadline >= block.timestamp", |
|
|
"balances[msg.sender] >= amount" |
|
|
] |
|
|
emb = model.encode(sentences, convert_to_numpy=True, show_progress_bar=False) |
|
|
# emb are L2-normalized; use cosine similarity for comparisons |
|
|
``` |
|
|
|
|
|
## Training summary (contrastive) |
|
|
|
|
|
* **Base model:** `web3se/SmartBERT-v2` |
|
|
* **Objective:** `CosineSimilarityLoss` (positives near 1.0, negatives near 0.0) |
|
|
* **Pair construction:** L2-normalized seed embeddings β |
|
|
positives if **cosine β₯ 0.80**, negatives if **cosine β€ 0.20** (nearest-neighbor candidates, `top_k=10`, max 5 positives/item) |
|
|
* **This release stats:** 1,647 unique texts β **16,470 pairs** (8,235 pos / 8,235 neg) |
|
|
* **Hyperparams:** epochs=1, batch_size=16, max_seq_len=512 |
|
|
* **Saved as:** canonical SentenceTransformers layout (`0_Transformer/`, `1_Pooling/`, `2_Normalize/`) |
|
|
|
|
|
A more detailed methodology and evaluation appear in the RAVEN paper (semantic clustering of revert-inducing invariants). |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
**Good for** |
|
|
|
|
|
* Measuring semantic relatedness of short invariant predicates |
|
|
* Clustering guards by intent (e.g., access control, slippage, timeouts) |
|
|
* Deduplicating near-equivalent checks across contracts |
|
|
|
|
|
**Not ideal for** |
|
|
|
|
|
* Long code blocks or whole-function embeddings |
|
|
* General code understanding outside invariant-style snippets |
|
|
* Non-EVM ecosystems without adaptation |
|
|
|
|
|
## Evaluation (paper) |
|
|
|
|
|
When paired with DBSCAN on predicate-only text, RavenBERT produced **compact, well-separated clusters** (e.g., Silhouette β 0.93, S_Dbw β 0.043 at ~52% coverage), surfacing meaningful categories of defenses from reverted transactions. See paper for full protocol, ablations, and metrics. |
|
|
|
|
|
## Reproducibility |
|
|
|
|
|
* Pair thresholds: **Οβ = 0.80**, **Οβ = 0.20** |
|
|
* Normalization: L2 via `sentence_transformers.models.Normalize()` |
|
|
* Training log: `ravenbert_training_stats.json` (included in repo) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use RavenBERT, please cite the RAVEN paper and this model: |
|
|
|
|
|
``` |
|
|
TBD |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|