--- tags: - sentence-transformers - embeddings - roberta - code - solidity - ethereum - smart-contracts - security library_name: sentence-transformers pipeline_tag: sentence-similarity base_model: web3se/SmartBERT-v2 model-index: - name: RavenBERT results: [] --- # RavenBERT RavenBERT is a **SentenceTransformers** embedding model specialized for **smart-contract invariants** (e.g., `require(...)`, `assert(...)`, `if (...) revert`) from Ethereum/Vyper sources. It starts from **`web3se/SmartBERT-v2`** and is **contrastively fine-tuned** so that cosine similarity reflects *semantic intent* of guards used in transaction-reverting checks. - **Architecture:** BERT-family encoder (SmartBERT-v2) → MeanPooling → L2 Normalize - **Embedding dimension:** 768 - **Normalization:** Enabled (unit-norm vectors; cosine ≡ dot product) - **Intended use:** clustering / semantic search / dedup / taxonomy building for short guard predicates (and optional messages) ## Quick start ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("MojtabaEshghie/RavenBERT") sentences = [ "amountOut >= amountOutMin", "deadline >= block.timestamp", "balances[msg.sender] >= amount" ] emb = model.encode(sentences, convert_to_numpy=True, show_progress_bar=False) # emb are L2-normalized; use cosine similarity for comparisons ``` ## Training summary (contrastive) * **Base model:** `web3se/SmartBERT-v2` * **Objective:** `CosineSimilarityLoss` (positives near 1.0, negatives near 0.0) * **Pair construction:** L2-normalized seed embeddings → positives if **cosine ≥ 0.80**, negatives if **cosine ≤ 0.20** (nearest-neighbor candidates, `top_k=10`, max 5 positives/item) * **This release stats:** 1,647 unique texts → **16,470 pairs** (8,235 pos / 8,235 neg) * **Hyperparams:** epochs=1, batch_size=16, max_seq_len=512 * **Saved as:** canonical SentenceTransformers layout (`0_Transformer/`, `1_Pooling/`, `2_Normalize/`) A more detailed methodology and evaluation appear in the RAVEN paper (semantic clustering of revert-inducing invariants). ## Intended uses & limitations **Good for** * Measuring semantic relatedness of short invariant predicates * Clustering guards by intent (e.g., access control, slippage, timeouts) * Deduplicating near-equivalent checks across contracts **Not ideal for** * Long code blocks or whole-function embeddings * General code understanding outside invariant-style snippets * Non-EVM ecosystems without adaptation ## Evaluation (paper) When paired with DBSCAN on predicate-only text, RavenBERT produced **compact, well-separated clusters** (e.g., Silhouette ≈ 0.93, S_Dbw ≈ 0.043 at ~52% coverage), surfacing meaningful categories of defenses from reverted transactions. See paper for full protocol, ablations, and metrics. ## Reproducibility * Pair thresholds: **τ₊ = 0.80**, **τ₋ = 0.20** * Normalization: L2 via `sentence_transformers.models.Normalize()` * Training log: `ravenbert_training_stats.json` (included in repo) ## Citation If you use RavenBERT, please cite the RAVEN paper and this model: ``` TBD ``` ## License MIT