File size: 3,150 Bytes

91e5044
 
 
e315fe6
 
 
 
 
 
 
91e5044
e315fe6
 
 
 
 
91e5044
 
e315fe6
91e5044
e315fe6
 
91e5044
e315fe6
 
 
 
91e5044
e315fe6
91e5044
 
 
 
 
 
e315fe6
 
 
91e5044
e315fe6
 
91e5044
 
e315fe6
91e5044
e315fe6
 
 
 
 
 
 
91e5044
e315fe6
91e5044
e315fe6
91e5044
e315fe6
91e5044
e315fe6
 
 
91e5044
e315fe6
91e5044
e315fe6
 
 
91e5044
e315fe6
91e5044
e315fe6
91e5044
e315fe6
91e5044
e315fe6
 
 
91e5044
 
 
e315fe6
91e5044
e315fe6
 
 
91e5044
e315fe6
91e5044
e315fe6

---
tags:
- sentence-transformers
- embeddings
- roberta
- code
- solidity
- ethereum
- smart-contracts
- security
library_name: sentence-transformers
pipeline_tag: sentence-similarity
base_model: web3se/SmartBERT-v2
model-index:
- name: RavenBERT
  results: []
---

# RavenBERT

RavenBERT is a **SentenceTransformers** embedding model specialized for **smart-contract invariants** (e.g., `require(...)`, `assert(...)`, `if (...) revert`) from Ethereum/Vyper sources.  
It starts from **`web3se/SmartBERT-v2`** and is **contrastively fine-tuned** so that cosine similarity reflects *semantic intent* of guards used in transaction-reverting checks.

- **Architecture:** BERT-family encoder (SmartBERT-v2) → MeanPooling → L2 Normalize  
- **Embedding dimension:** 768  
- **Normalization:** Enabled (unit-norm vectors; cosine ≡ dot product)  
- **Intended use:** clustering / semantic search / dedup / taxonomy building for short guard predicates (and optional messages)

## Quick start

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("MojtabaEshghie/RavenBERT")
sentences = [
    "amountOut >= amountOutMin",
    "deadline >= block.timestamp",
    "balances[msg.sender] >= amount"
]
emb = model.encode(sentences, convert_to_numpy=True, show_progress_bar=False)
# emb are L2-normalized; use cosine similarity for comparisons
```

## Training summary (contrastive)

* **Base model:** `web3se/SmartBERT-v2`
* **Objective:** `CosineSimilarityLoss` (positives near 1.0, negatives near 0.0)
* **Pair construction:** L2-normalized seed embeddings →
  positives if **cosine ≥ 0.80**, negatives if **cosine ≤ 0.20** (nearest-neighbor candidates, `top_k=10`, max 5 positives/item)
* **This release stats:** 1,647 unique texts → **16,470 pairs** (8,235 pos / 8,235 neg)
* **Hyperparams:** epochs=1, batch_size=16, max_seq_len=512
* **Saved as:** canonical SentenceTransformers layout (`0_Transformer/`, `1_Pooling/`, `2_Normalize/`)

A more detailed methodology and evaluation appear in the RAVEN paper (semantic clustering of revert-inducing invariants). 

## Intended uses & limitations

**Good for**

* Measuring semantic relatedness of short invariant predicates
* Clustering guards by intent (e.g., access control, slippage, timeouts)
* Deduplicating near-equivalent checks across contracts

**Not ideal for**

* Long code blocks or whole-function embeddings
* General code understanding outside invariant-style snippets
* Non-EVM ecosystems without adaptation

## Evaluation (paper)

When paired with DBSCAN on predicate-only text, RavenBERT produced **compact, well-separated clusters** (e.g., Silhouette ≈ 0.93, S_Dbw ≈ 0.043 at ~52% coverage), surfacing meaningful categories of defenses from reverted transactions. See paper for full protocol, ablations, and metrics. 

## Reproducibility

* Pair thresholds: **τ₊ = 0.80**, **τ₋ = 0.20**
* Normalization: L2 via `sentence_transformers.models.Normalize()`
* Training log: `ravenbert_training_stats.json` (included in repo)

## Citation

If you use RavenBERT, please cite the RAVEN paper and this model:

```
TBD
```

## License

MIT