---
tags:
- sentence-transformers
- embeddings
- roberta
- code
- solidity
- ethereum
- smart-contracts
- security
library_name: sentence-transformers
pipeline_tag: sentence-similarity
base_model: web3se/SmartBERT-v2
model-index:
- name: RavenBERT
  results: []
---

# RavenBERT

RavenBERT is a **SentenceTransformers** embedding model specialized for **smart-contract invariants** (e.g., `require(...)`, `assert(...)`, `if (...) revert`) from Ethereum/Vyper sources.  
It starts from **`web3se/SmartBERT-v2`** and is **contrastively fine-tuned** so that cosine similarity reflects *semantic intent* of guards used in transaction-reverting checks.

- **Architecture:** BERT-family encoder (SmartBERT-v2) → MeanPooling → L2 Normalize  
- **Embedding dimension:** 768  
- **Normalization:** Enabled (unit-norm vectors; cosine ≡ dot product)  
- **Intended use:** clustering / semantic search / dedup / taxonomy building for short guard predicates (and optional messages)

## Quick start

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("MojtabaEshghie/RavenBERT")
sentences = [
    "amountOut >= amountOutMin",
    "deadline >= block.timestamp",
    "balances[msg.sender] >= amount"
]
emb = model.encode(sentences, convert_to_numpy=True, show_progress_bar=False)
# emb are L2-normalized; use cosine similarity for comparisons
```

## Training summary (contrastive)

* **Base model:** `web3se/SmartBERT-v2`
* **Objective:** `CosineSimilarityLoss` (positives near 1.0, negatives near 0.0)
* **Pair construction:** L2-normalized seed embeddings →
  positives if **cosine ≥ 0.80**, negatives if **cosine ≤ 0.20** (nearest-neighbor candidates, `top_k=10`, max 5 positives/item)
* **This release stats:** 1,647 unique texts → **16,470 pairs** (8,235 pos / 8,235 neg)
* **Hyperparams:** epochs=1, batch_size=16, max_seq_len=512
* **Saved as:** canonical SentenceTransformers layout (`0_Transformer/`, `1_Pooling/`, `2_Normalize/`)

A more detailed methodology and evaluation appear in the RAVEN paper (semantic clustering of revert-inducing invariants). 

## Intended uses & limitations

**Good for**

* Measuring semantic relatedness of short invariant predicates
* Clustering guards by intent (e.g., access control, slippage, timeouts)
* Deduplicating near-equivalent checks across contracts

**Not ideal for**

* Long code blocks or whole-function embeddings
* General code understanding outside invariant-style snippets
* Non-EVM ecosystems without adaptation

## Evaluation (paper)

When paired with DBSCAN on predicate-only text, RavenBERT produced **compact, well-separated clusters** (e.g., Silhouette ≈ 0.93, S_Dbw ≈ 0.043 at ~52% coverage), surfacing meaningful categories of defenses from reverted transactions. See paper for full protocol, ablations, and metrics. 

## Reproducibility

* Pair thresholds: **τ₊ = 0.80**, **τ₋ = 0.20**
* Normalization: L2 via `sentence_transformers.models.Normalize()`
* Training log: `ravenbert_training_stats.json` (included in repo)

## Citation

If you use RavenBERT, please cite the RAVEN paper and this model:

```
TBD
```

## License

MIT