RavenBERT / README.md
MojtabaEshghie's picture
Update README.md
7a9f5d8 verified
---
tags:
- sentence-transformers
- embeddings
- roberta
- code
- solidity
- ethereum
- smart-contracts
- security
library_name: sentence-transformers
pipeline_tag: sentence-similarity
base_model: web3se/SmartBERT-v2
model-index:
- name: RavenBERT
results: []
---
# RavenBERT
RavenBERT is a **SentenceTransformers** embedding model specialized for **smart-contract invariants** (e.g., `require(...)`, `assert(...)`, `if (...) revert`) from Ethereum/Vyper sources.
It starts from **`web3se/SmartBERT-v2`** and is **contrastively fine-tuned** so that cosine similarity reflects *semantic intent* of guards used in transaction-reverting checks.
- **Architecture:** BERT-family encoder (SmartBERT-v2) β†’ MeanPooling β†’ L2 Normalize
- **Embedding dimension:** 768
- **Normalization:** Enabled (unit-norm vectors; cosine ≑ dot product)
- **Intended use:** clustering / semantic search / dedup / taxonomy building for short guard predicates (and optional messages)
## Quick start
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("MojtabaEshghie/RavenBERT")
sentences = [
"amountOut >= amountOutMin",
"deadline >= block.timestamp",
"balances[msg.sender] >= amount"
]
emb = model.encode(sentences, convert_to_numpy=True, show_progress_bar=False)
# emb are L2-normalized; use cosine similarity for comparisons
```
## Training summary (contrastive)
* **Base model:** `web3se/SmartBERT-v2`
* **Objective:** `CosineSimilarityLoss` (positives near 1.0, negatives near 0.0)
* **Pair construction:** L2-normalized seed embeddings β†’
positives if **cosine β‰₯ 0.80**, negatives if **cosine ≀ 0.20** (nearest-neighbor candidates, `top_k=10`, max 5 positives/item)
* **This release stats:** 1,647 unique texts β†’ **16,470 pairs** (8,235 pos / 8,235 neg)
* **Hyperparams:** epochs=1, batch_size=16, max_seq_len=512
* **Saved as:** canonical SentenceTransformers layout (`0_Transformer/`, `1_Pooling/`, `2_Normalize/`)
A more detailed methodology and evaluation appear in the RAVEN paper (semantic clustering of revert-inducing invariants).
## Intended uses & limitations
**Good for**
* Measuring semantic relatedness of short invariant predicates
* Clustering guards by intent (e.g., access control, slippage, timeouts)
* Deduplicating near-equivalent checks across contracts
**Not ideal for**
* Long code blocks or whole-function embeddings
* General code understanding outside invariant-style snippets
* Non-EVM ecosystems without adaptation
## Evaluation (paper)
When paired with DBSCAN on predicate-only text, RavenBERT produced **compact, well-separated clusters** (e.g., Silhouette β‰ˆ 0.93, S_Dbw β‰ˆ 0.043 at ~52% coverage), surfacing meaningful categories of defenses from reverted transactions. See paper for full protocol, ablations, and metrics.
## Reproducibility
* Pair thresholds: **Ο„β‚Š = 0.80**, **Ο„β‚‹ = 0.20**
* Normalization: L2 via `sentence_transformers.models.Normalize()`
* Training log: `ravenbert_training_stats.json` (included in repo)
## Citation
If you use RavenBERT, please cite the RAVEN paper and this model:
```
TBD
```
## License
MIT