software-ses
/

RavenBERT

Sentence Similarity

sentence-transformers

smart-contracts

text-embeddings-inference

Model card Files Files and versions

RavenBERT / README.md

MojtabaEshghie's picture

Update README.md

7a9f5d8 verified about 2 months ago

|

history blame contribute delete

3.15 kB

	---
	tags:
	- sentence-transformers
	- embeddings
	- roberta
	- code
	- solidity
	- ethereum
	- smart-contracts
	- security
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	base_model: web3se/SmartBERT-v2
	model-index:
	- name: RavenBERT
	results: []
	---

	# RavenBERT

	RavenBERT is a SentenceTransformers embedding model specialized for smart-contract invariants (e.g., `require(...)`, `assert(...)`, `if (...) revert`) from Ethereum/Vyper sources.
	It starts from `web3se/SmartBERT-v2` and is contrastively fine-tuned so that cosine similarity reflects semantic intent of guards used in transaction-reverting checks.

	- Architecture: BERT-family encoder (SmartBERT-v2) → MeanPooling → L2 Normalize
	- Embedding dimension: 768
	- Normalization: Enabled (unit-norm vectors; cosine ≡ dot product)
	- Intended use: clustering / semantic search / dedup / taxonomy building for short guard predicates (and optional messages)

	## Quick start

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("MojtabaEshghie/RavenBERT")
	sentences = [
	"amountOut >= amountOutMin",
	"deadline >= block.timestamp",
	"balances[msg.sender] >= amount"
	]
	emb = model.encode(sentences, convert_to_numpy=True, show_progress_bar=False)
	# emb are L2-normalized; use cosine similarity for comparisons
	```

	## Training summary (contrastive)

	* Base model: `web3se/SmartBERT-v2`
	* Objective: `CosineSimilarityLoss` (positives near 1.0, negatives near 0.0)
	* Pair construction: L2-normalized seed embeddings →
	positives if cosine ≥ 0.80, negatives if cosine ≤ 0.20 (nearest-neighbor candidates, `top_k=10`, max 5 positives/item)
	* This release stats: 1,647 unique texts → 16,470 pairs (8,235 pos / 8,235 neg)
	* Hyperparams: epochs=1, batch_size=16, max_seq_len=512
	* Saved as: canonical SentenceTransformers layout (`0_Transformer/`, `1_Pooling/`, `2_Normalize/`)

	A more detailed methodology and evaluation appear in the RAVEN paper (semantic clustering of revert-inducing invariants).

	## Intended uses & limitations

	Good for

	* Measuring semantic relatedness of short invariant predicates
	* Clustering guards by intent (e.g., access control, slippage, timeouts)
	* Deduplicating near-equivalent checks across contracts

	Not ideal for

	* Long code blocks or whole-function embeddings
	* General code understanding outside invariant-style snippets
	* Non-EVM ecosystems without adaptation

	## Evaluation (paper)

	When paired with DBSCAN on predicate-only text, RavenBERT produced compact, well-separated clusters (e.g., Silhouette ≈ 0.93, S_Dbw ≈ 0.043 at ~52% coverage), surfacing meaningful categories of defenses from reverted transactions. See paper for full protocol, ablations, and metrics.

	## Reproducibility

	* Pair thresholds: τ₊ = 0.80, τ₋ = 0.20
	* Normalization: L2 via `sentence_transformers.models.Normalize()`
	* Training log: `ravenbert_training_stats.json` (included in repo)

	## Citation

	If you use RavenBERT, please cite the RAVEN paper and this model:

	```
	TBD
	```

	## License

	MIT