Nileshka commited on
Commit
507c14e
·
verified ·
1 Parent(s): ce6e044

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -0
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ task_categories:
4
+ - tabular-classification
5
+ language:
6
+ - en
7
+ tags:
8
+ - ethereum
9
+ - fraud-detection
10
+ - blockchain
11
+ - multimodal
12
+ - xgboost
13
+ - shap
14
+ - explainable-ai
15
+ pretty_name: FuseChain Ethereum Fraud Detection Model
16
+ ---
17
+
18
+ # FuseChain: Ethereum Fraud Detection via Multimodal Signal Fusion
19
+
20
+ ## Model Summary
21
+
22
+ FuseChain is a multimodal supervised classification model for detecting fraudulent Ethereum Externally Owned Accounts (EOAs). It integrates on-chain transaction features with off-chain contextual signals from market data, Reddit, and Twitter to classify Ethereum addresses as scam or normal.
23
+
24
+ The model is an **XGBoost classifier** trained on a novel address-level dataset of **35,272 Ethereum EOAs**, achieving an **F1-score of 82.5%** and an **AUC of 96.1%** on a stratified held-out test set — representing a **14.7 point F1 improvement** over an on-chain only baseline.
25
+
26
+ ---
27
+
28
+ ## Model Details
29
+
30
+ | Property | Details |
31
+ |---|---|
32
+ | Model Type | XGBoost Classifier |
33
+ | Task | Binary Classification (Scam / Normal) |
34
+ | Input | 31 address-level multimodal features |
35
+ | Output | Fraud probability score (0 to 1) |
36
+ | Classification Threshold | 0.5 |
37
+ | Explainability | TreeSHAP (per-prediction feature attribution) |
38
+ | Training Framework | XGBoost 2.x, Scikit-learn |
39
+ | Language | Python 3.10+ |
40
+
41
+ ---
42
+
43
+ ## Performance
44
+
45
+ ### Test Set Results (Stratified 80/20 Split)
46
+
47
+ | Metric | Normal | Scam | Overall |
48
+ |---|---|---|---|
49
+ | Precision | 0.96 | 0.89 | 0.95 |
50
+ | Recall | 0.98 | 0.77 | 0.95 |
51
+ | F1-Score | 0.97 | 0.83 | 0.95 |
52
+ | AUC-ROC | - | - | 0.961 |
53
+ | Accuracy | - | - | 95% |
54
+
55
+ ### Ablation Study Results
56
+
57
+ | Configuration | Features | F1 | AUC |
58
+ |---|---|---|---|
59
+ | On-Chain Only | 14 | 0.678 | 0.919 |
60
+ | On-Chain + Market | 19 | 0.721 | 0.936 |
61
+ | On-Chain + Market + Reddit | 22 | 0.802 | 0.955 |
62
+ | On-Chain + Market + Twitter | 28 | 0.825 | 0.962 |
63
+ | On-Chain + Market + Reddit + Twitter | 31 | 0.825 | 0.961 |
64
+
65
+ ---
66
+
67
+ ## Feature Set
68
+
69
+ The model was trained on 31 features across four modalities:
70
+
71
+ | Modality | Features | Examples |
72
+ |---|---|---|
73
+ | On-Chain | 14 | `eth_net_flow_max`, `eth_recv_mean`, `burst_max_tx_5m_mean`, `active_days` |
74
+ | Twitter | 9 | `twitter_avg_retweets_mean`, `twitter_avg_positive_mean`, `twitter_fraud_mention_ratio_mean` |
75
+ | Market | 5 | `market_intraday_volatility_mean`, `market_daily_return_mean` |
76
+ | Reddit | 3 | `reddit_total_fraud_mentions_mean`, `reddit_avg_sentiment_mean` |
77
+
78
+ For the full feature schema refer to `address_features_metadata.json` in this repository.
79
+
80
+ ---
81
+
82
+ ### Global Modality Contribution (SHAP)
83
+
84
+ | Modality | Contribution |
85
+ |---|---|
86
+ | On-Chain | 58.6% |
87
+ | Twitter | 25.8% |
88
+ | Reddit | 10.8% |
89
+ | Market | 4.8% |
90
+
91
+ ### Most Discriminative Features per Modality
92
+
93
+ | Modality | Top Feature |
94
+ |---|---|
95
+ | On-Chain | `eth_net_flow_max` |
96
+ | Twitter | `twitter_avg_retweets_mean` |
97
+ | Reddit | `reddit_total_fraud_mentions_mean` |
98
+ | Market | `market_intraday_volatility_mean` |
99
+
100
+ ---
101
+
102
+ ## Model Hyperparameters
103
+
104
+ | Parameter | Value |
105
+ |---|---|
106
+ | n_estimators | 200 |
107
+ | max_depth | 5 |
108
+ | learning_rate | 0.05 |
109
+ | min_child_weight | 3 |
110
+ | subsample | 0.8 |
111
+ | colsample_bytree | 0.8 |
112
+ | Classification Threshold | 0.5 |
113
+
114
+ ---
115
+
116
+ ## Dataset
117
+
118
+ The FuseChain dataset used to train this model is publicly available on Hugging Face:
119
+
120
+ [FuseChain Multimodal Ethereum Fraud Dataset](https://huggingface.co/datasets/Nileshka/fusechain-data)
121
+
122
+
123
+ ## Citation
124
+
125
+ If you use this model or the FuseChain framework in your research, please cite:
126
+ ```bibtex
127
+ @misc{fusechain2026,
128
+ title={FuseChain: Ethereum Fraud Detection via Multimodal Signal Fusion},
129
+ author={Fernando, Nileshka},
130
+ year={2026},
131
+ publisher={Hugging Face},
132
+ howpublished={\url{https://huggingface.co/datasets/Nileshka/fusechain-data}}
133
+ }
134
+ ```
135
+
136
+ ---
137
+
138
+ ## Related Resources
139
+
140
+ - **Dataset:** [FuseChain Multimodal Ethereum Fraud Dataset](https://huggingface.co/datasets/Nileshka/fusechain-data)
141
+ - **Code Repository:** [FuseChain GitHub](https://github.com/NileshFdo/FuseChain-FYP)