Spaces:
Sleeping
Sleeping
File size: 5,391 Bytes
f61697e e755754 1228690 e755754 f52f509 e755754 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | ---
title: Cluster Protocol
emoji: π₯
colorFrom: indigo
colorTo: red
sdk: docker
pinned: false
license: mit
short_description: Behavioral clustering engine for Web3 wallets
---
# Crypto Wallet Clustering
Unsupervised machine learning project to segment cryptocurrency wallets into behavioral personas (e.g., "Whales", "NFT Flippers", "Dormant") based on on-chain transaction data.
## β The Problem
In the Web3 ecosystem, users are anonymous by default. A wallet address (`0x123...`) gives no indication of whether the user is a high-value institution, a retail trader, a bot, or an NFT collector.
* **Marketing is blind:** Projects cannot target specific users effectively.
* **Risk is opaque:** Protocols cannot easily distinguish between organic users and sybil attackers.
* **Data is noisy:** Raw transaction logs are massive and unreadable without advanced processing.
## π‘ The Solution: Cluster Protocol
**Cluster Protocol** is an AI-powered engine that "fingerprints" wallets based on their behavior, not their identity.
1. **Ingest:** Pulls raw on-chain data (Gas spent, NFT volume, DEX trades, etc.) via Dune Analytics.
2. **Process:** Normalizes skewed financial data using **Yeo-Johnson Power Transformations**.
3. **Cluster:** Uses **K-Means Clustering** to mathematically group similar wallets.
4. **Label:** Assigns a human-readable persona (e.g., "Active Retail", "High-Frequency Bot") with a confidence score.
## Key Features
- **Robust Preprocessing:** Handles extreme data skewness (common in financial data) using **Yeo-Johnson Power Transformation**.
- **Smart Filtering:** Heuristic detection to separate **Smart Contracts** from **EOAs** (Externally Owned Accounts).
- **Model Selection:** Benchmarked K-Means, DBSCAN, and GMM. **K-Means (K=4)** was selected as the production model.
- **Inference with Confidence:** Predicts personas for new wallets and provides **probability scores** (e.g., "85% Whale, 15% Trader").
- **Automated Retraining:** GitHub Actions workflow automatically fetches new data and retrains the model weekly to handle data drift.
- **End-to-End API:** Fetch data from Dune and classify a wallet in a single API call.
## β οΈ Supported Networks
**Cluster Protocol currently supports Ethereum Mainnet (L1) only.**
* **Supported:** Ethereum (`0x...`).
* **Not Supported:** L2s (Arbitrum, Optimism, Base), Sidechains (Polygon), or Non-EVM chains (Solana, Bitcoin).
* **Note:** The engine analyzes the last **2 Years** of history for DeFi/NFTs to ensure relevance and speed.
## Tech Stack
- **Python 3.10+**
- **Pandas & NumPy** (Data manipulation)
- **Scikit-Learn** (Clustering & Preprocessing)
- **Matplotlib & Seaborn** (Visualization)
- **FastAPI** (Inference API)
- **Dune API** (Data ingestion)
- **GitHub Actions** (CI/CD & Automation)
## Project Structure
```
cluster/
βββ data/ # Dataset storage
βββ docs/ # Visualizations & Images
βββ notebooks/ # Jupyter notebooks for EDA and modeling
βββ src/ # Core logic (Inference Engine)
βββ .github/workflows/ # Automated retraining workflows
βββ app.py # FastAPI Endpoint
βββ predict.py # CLI Inference Tool
βββ train.py # Production training pipeline
βββ request.py # Script to fetch data from Dune
βββ README.md # Project documentation
βββ PROJECT_LOG.md # Engineering log & decision records
```
## Identified Personas
The model identified 4 distinct behavioral clusters:
1. **Ultra-Whales / Institutional & Exchange Wallets** (Cluster 3)
* *Characteristics:* Massive volume, extremely high transaction counts.
2. **Active Retail Users / Everyday Traders** (Cluster 2)
* *Characteristics:* Consistent daily activity, moderate volume.
3. **High-Frequency Bots / Automated Traders** (Cluster 1)
* *Characteristics:* High transaction count but low human-like variety.
4. **High-Value NFT & Crypto Traders (Degen Whales)** (Cluster 0)
* *Characteristics:* High risk, high NFT volume, specialized activity.
### Visualizations
**t-SNE Projection of Clusters**

**Behavioral Radar Chart**

## Getting Started
### Prerequisites
- Python 3.10+
- `uv` (recommended)
- Dune Analytics API Key (for fetching new data)
### Installation
```bash
git clone <repo-url>
cd cluster
uv sync
```
Create a `.env` file with your API key:
```
DUNE_API_KEY=your_key_here
```
### Usage
#### 1. Train the Model
Run the production pipeline to train K-Means and save artifacts (`kmeans_model.pkl`, `wallet_power_transformer.pkl`).
```bash
uv run train.py
```
#### 2. Make Predictions (CLI)
Classify a specific wallet (or row from the dataset) and see confidence scores.
```bash
uv run predict.py --row 0
# Output:
# Cluster: 3
# Persona: Ultra-Whales / Institutional
# Confidence: Ultra-Whales: 0.52, Retail: 0.26...
```
#### 3. Run the API
Start the FastAPI server for real-time inference.
```bash
uv run uvicorn app:app --reload
```
**Analyze a specific wallet (Fetch + Predict):**
```bash
curl "http://localhost:8000/analyze/0x123...abc"
```
#### 4. Visualize Results
Generate fresh t-SNE and Radar charts.
```bash
uv run visualize_clusters.py
``` |