cluster-protocol / README.md
nice-bill's picture
disclaimer added
f52f509
metadata
title: Cluster Protocol
emoji: πŸ”₯
colorFrom: indigo
colorTo: red
sdk: docker
pinned: false
license: mit
short_description: Behavioral clustering engine for Web3 wallets

Crypto Wallet Clustering

Unsupervised machine learning project to segment cryptocurrency wallets into behavioral personas (e.g., "Whales", "NFT Flippers", "Dormant") based on on-chain transaction data.

❓ The Problem

In the Web3 ecosystem, users are anonymous by default. A wallet address (0x123...) gives no indication of whether the user is a high-value institution, a retail trader, a bot, or an NFT collector.

  • Marketing is blind: Projects cannot target specific users effectively.
  • Risk is opaque: Protocols cannot easily distinguish between organic users and sybil attackers.
  • Data is noisy: Raw transaction logs are massive and unreadable without advanced processing.

πŸ’‘ The Solution: Cluster Protocol

Cluster Protocol is an AI-powered engine that "fingerprints" wallets based on their behavior, not their identity.

  1. Ingest: Pulls raw on-chain data (Gas spent, NFT volume, DEX trades, etc.) via Dune Analytics.
  2. Process: Normalizes skewed financial data using Yeo-Johnson Power Transformations.
  3. Cluster: Uses K-Means Clustering to mathematically group similar wallets.
  4. Label: Assigns a human-readable persona (e.g., "Active Retail", "High-Frequency Bot") with a confidence score.

Key Features

  • Robust Preprocessing: Handles extreme data skewness (common in financial data) using Yeo-Johnson Power Transformation.
  • Smart Filtering: Heuristic detection to separate Smart Contracts from EOAs (Externally Owned Accounts).
  • Model Selection: Benchmarked K-Means, DBSCAN, and GMM. K-Means (K=4) was selected as the production model.
  • Inference with Confidence: Predicts personas for new wallets and provides probability scores (e.g., "85% Whale, 15% Trader").
  • Automated Retraining: GitHub Actions workflow automatically fetches new data and retrains the model weekly to handle data drift.
  • End-to-End API: Fetch data from Dune and classify a wallet in a single API call.

⚠️ Supported Networks

Cluster Protocol currently supports Ethereum Mainnet (L1) only.

  • Supported: Ethereum (0x...).
  • Not Supported: L2s (Arbitrum, Optimism, Base), Sidechains (Polygon), or Non-EVM chains (Solana, Bitcoin).
  • Note: The engine analyzes the last 2 Years of history for DeFi/NFTs to ensure relevance and speed.

Tech Stack

  • Python 3.10+
  • Pandas & NumPy (Data manipulation)
  • Scikit-Learn (Clustering & Preprocessing)
  • Matplotlib & Seaborn (Visualization)
  • FastAPI (Inference API)
  • Dune API (Data ingestion)
  • GitHub Actions (CI/CD & Automation)

Project Structure

cluster/
β”œβ”€β”€ data/                   # Dataset storage
β”œβ”€β”€ docs/                   # Visualizations & Images
β”œβ”€β”€ notebooks/              # Jupyter notebooks for EDA and modeling
β”œβ”€β”€ src/                    # Core logic (Inference Engine)
β”œβ”€β”€ .github/workflows/      # Automated retraining workflows
β”œβ”€β”€ app.py                  # FastAPI Endpoint
β”œβ”€β”€ predict.py              # CLI Inference Tool
β”œβ”€β”€ train.py                # Production training pipeline
β”œβ”€β”€ request.py              # Script to fetch data from Dune
β”œβ”€β”€ README.md               # Project documentation
└── PROJECT_LOG.md          # Engineering log & decision records

Identified Personas

The model identified 4 distinct behavioral clusters:

  1. Ultra-Whales / Institutional & Exchange Wallets (Cluster 3)
    • Characteristics: Massive volume, extremely high transaction counts.
  2. Active Retail Users / Everyday Traders (Cluster 2)
    • Characteristics: Consistent daily activity, moderate volume.
  3. High-Frequency Bots / Automated Traders (Cluster 1)
    • Characteristics: High transaction count but low human-like variety.
  4. High-Value NFT & Crypto Traders (Degen Whales) (Cluster 0)
    • Characteristics: High risk, high NFT volume, specialized activity.

Visualizations

t-SNE Projection of Clusters t-SNE Plot

Behavioral Radar Chart Radar Chart

Getting Started

Prerequisites

  • Python 3.10+
  • uv (recommended)
  • Dune Analytics API Key (for fetching new data)

Installation

git clone <repo-url>
cd cluster
uv sync

Create a .env file with your API key:

DUNE_API_KEY=your_key_here

Usage

1. Train the Model

Run the production pipeline to train K-Means and save artifacts (kmeans_model.pkl, wallet_power_transformer.pkl).

uv run train.py

2. Make Predictions (CLI)

Classify a specific wallet (or row from the dataset) and see confidence scores.

uv run predict.py --row 0
# Output:
# Cluster: 3
# Persona: Ultra-Whales / Institutional
# Confidence: Ultra-Whales: 0.52, Retail: 0.26...

3. Run the API

Start the FastAPI server for real-time inference.

uv run uvicorn app:app --reload

Analyze a specific wallet (Fetch + Predict):

curl "http://localhost:8000/analyze/0x123...abc"

4. Visualize Results

Generate fresh t-SNE and Radar charts.

uv run visualize_clusters.py