File size: 5,391 Bytes
f61697e
 
 
 
 
 
 
 
 
 
 
e755754
 
 
 
1228690
 
 
 
 
 
 
 
 
 
 
 
e755754
 
 
 
 
 
 
 
 
f52f509
 
 
 
 
 
e755754
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: Cluster Protocol
emoji: πŸ”₯
colorFrom: indigo
colorTo: red
sdk: docker
pinned: false
license: mit
short_description: Behavioral clustering engine for Web3 wallets
---

# Crypto Wallet Clustering

Unsupervised machine learning project to segment cryptocurrency wallets into behavioral personas (e.g., "Whales", "NFT Flippers", "Dormant") based on on-chain transaction data.

## ❓ The Problem
In the Web3 ecosystem, users are anonymous by default. A wallet address (`0x123...`) gives no indication of whether the user is a high-value institution, a retail trader, a bot, or an NFT collector. 
*   **Marketing is blind:** Projects cannot target specific users effectively.
*   **Risk is opaque:** Protocols cannot easily distinguish between organic users and sybil attackers.
*   **Data is noisy:** Raw transaction logs are massive and unreadable without advanced processing.

## πŸ’‘ The Solution: Cluster Protocol
**Cluster Protocol** is an AI-powered engine that "fingerprints" wallets based on their behavior, not their identity.
1.  **Ingest:** Pulls raw on-chain data (Gas spent, NFT volume, DEX trades, etc.) via Dune Analytics.
2.  **Process:** Normalizes skewed financial data using **Yeo-Johnson Power Transformations**.
3.  **Cluster:** Uses **K-Means Clustering** to mathematically group similar wallets.
4.  **Label:** Assigns a human-readable persona (e.g., "Active Retail", "High-Frequency Bot") with a confidence score.

## Key Features
- **Robust Preprocessing:** Handles extreme data skewness (common in financial data) using **Yeo-Johnson Power Transformation**.
- **Smart Filtering:** Heuristic detection to separate **Smart Contracts** from **EOAs** (Externally Owned Accounts).
- **Model Selection:** Benchmarked K-Means, DBSCAN, and GMM. **K-Means (K=4)** was selected as the production model.
- **Inference with Confidence:** Predicts personas for new wallets and provides **probability scores** (e.g., "85% Whale, 15% Trader").
- **Automated Retraining:** GitHub Actions workflow automatically fetches new data and retrains the model weekly to handle data drift.
- **End-to-End API:** Fetch data from Dune and classify a wallet in a single API call.

## ⚠️ Supported Networks
**Cluster Protocol currently supports Ethereum Mainnet (L1) only.**
*   **Supported:** Ethereum (`0x...`).
*   **Not Supported:** L2s (Arbitrum, Optimism, Base), Sidechains (Polygon), or Non-EVM chains (Solana, Bitcoin).
*   **Note:** The engine analyzes the last **2 Years** of history for DeFi/NFTs to ensure relevance and speed.

## Tech Stack
- **Python 3.10+**
- **Pandas & NumPy** (Data manipulation)
- **Scikit-Learn** (Clustering & Preprocessing)
- **Matplotlib & Seaborn** (Visualization)
- **FastAPI** (Inference API)
- **Dune API** (Data ingestion)
- **GitHub Actions** (CI/CD & Automation)

## Project Structure
```
cluster/
β”œβ”€β”€ data/                   # Dataset storage
β”œβ”€β”€ docs/                   # Visualizations & Images
β”œβ”€β”€ notebooks/              # Jupyter notebooks for EDA and modeling
β”œβ”€β”€ src/                    # Core logic (Inference Engine)
β”œβ”€β”€ .github/workflows/      # Automated retraining workflows
β”œβ”€β”€ app.py                  # FastAPI Endpoint
β”œβ”€β”€ predict.py              # CLI Inference Tool
β”œβ”€β”€ train.py                # Production training pipeline
β”œβ”€β”€ request.py              # Script to fetch data from Dune
β”œβ”€β”€ README.md               # Project documentation
└── PROJECT_LOG.md          # Engineering log & decision records
```

## Identified Personas
The model identified 4 distinct behavioral clusters:

1.  **Ultra-Whales / Institutional & Exchange Wallets** (Cluster 3)
    *   *Characteristics:* Massive volume, extremely high transaction counts.
2.  **Active Retail Users / Everyday Traders** (Cluster 2)
    *   *Characteristics:* Consistent daily activity, moderate volume.
3.  **High-Frequency Bots / Automated Traders** (Cluster 1)
    *   *Characteristics:* High transaction count but low human-like variety.
4.  **High-Value NFT & Crypto Traders (Degen Whales)** (Cluster 0)
    *   *Characteristics:* High risk, high NFT volume, specialized activity.

### Visualizations
**t-SNE Projection of Clusters**
![t-SNE Plot](docs/clusters_tsne.png)

**Behavioral Radar Chart**
![Radar Chart](docs/persona_radar_chart.png)

## Getting Started

### Prerequisites
- Python 3.10+
- `uv` (recommended)
- Dune Analytics API Key (for fetching new data)

### Installation
```bash
git clone <repo-url>
cd cluster
uv sync
```
Create a `.env` file with your API key:
```
DUNE_API_KEY=your_key_here
```

### Usage

#### 1. Train the Model
Run the production pipeline to train K-Means and save artifacts (`kmeans_model.pkl`, `wallet_power_transformer.pkl`).
```bash
uv run train.py
```

#### 2. Make Predictions (CLI)
Classify a specific wallet (or row from the dataset) and see confidence scores.
```bash
uv run predict.py --row 0
# Output:
# Cluster: 3
# Persona: Ultra-Whales / Institutional
# Confidence: Ultra-Whales: 0.52, Retail: 0.26...
```

#### 3. Run the API
Start the FastAPI server for real-time inference.
```bash
uv run uvicorn app:app --reload
```
**Analyze a specific wallet (Fetch + Predict):**
```bash
curl "http://localhost:8000/analyze/0x123...abc"
```

#### 4. Visualize Results
Generate fresh t-SNE and Radar charts.
```bash
uv run visualize_clusters.py
```