Vibes Chat Clustering Model

⚠️ Note: This is a demo model trained on a specific WhatsApp group ("The vibez" - tech/AI discussions). For production use, train your own UMAP model on your chat data.

About

This model demonstrates the UMAP-only training approach for WhatsApp chat clustering. It's trained on ~400 conversation bursts from a tech-focused group chat.

Best For

  • Demo/example of clustering approach
  • Comparing topics with similar tech/AI discussion groups
  • Learning how the pipeline works

NOT Recommended For

  • Production clustering of your own chats (train on your data instead!)
  • Different chat domains (family, work, etc.)
  • Long-term deployment

Model Architecture

This uses a UMAP-only training approach:

  1. Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
  2. UMAP (TRAINED): Reduces to 15 dimensions with cosine metric
  3. HDBSCAN (FRESH): Clustering on each inference (min_cluster_size=2)
  4. c-TF-IDF (FRESH): Topic extraction with current vocabulary

Why UMAP-Only?

Unlike BERTopic's transform() which freezes vocabulary and causes high noise rates (94.8% in our tests), this approach:

  • Trains UMAP once for consistent embedding space
  • Re-clusters with fresh HDBSCAN each time
  • Extracts topics with fresh vocabulary each time

This adapts to new vocabulary while maintaining spatial consistency.

Quick Start

from huggingface_hub import hf_hub_download
import pickle
import json
from sentence_transformers import SentenceTransformer
import hdbscan
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Download model (cached after first run)
umap_path = hf_hub_download(
    repo_id="msugimura/vibes-clustering",
    filename="umap_model.pkl"
)
config_path = hf_hub_download(
    repo_id="msugimura/vibes-clustering",
    filename="config.json"
)

# Load model
with open(umap_path, 'rb') as f:
    umap_model = pickle.load(f)
with open(config_path) as f:
    config = json.load(f)

bert_model = SentenceTransformer(config['bert_model'])

# Your texts
texts = ["your chat messages here"]

# 1. Embed
embeddings = bert_model.encode(texts)

# 2. Transform with pre-trained UMAP
reduced = umap_model.transform(embeddings)

# 3. Fresh clustering
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=config['recommended_min_cluster_size'],
    metric='euclidean',
    cluster_selection_method='eom'
)
labels = clusterer.fit_predict(reduced)

# 4. Extract topics (see full code in repo)
# ... c-TF-IDF implementation ...

Training Your Own Model

Recommended for production use:

from sentence_transformers import SentenceTransformer
import umap

# Your chat messages
historical_texts = [...]  # First 80% of your timeline

# Embed
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(historical_texts)

# Train UMAP on YOUR data
umap_model = umap.UMAP(
    n_components=15,
    metric='cosine',
    random_state=42
)
umap_model.fit(embeddings)

# Save for future use
import pickle
with open('my_chat_umap.pkl', 'wb') as f:
    pickle.dump(umap_model, f)

Then use transform() on new messages with fresh clustering/topics each time.

Training Details

  • Training data: ~400 conversation bursts (excluding last 2 months for privacy)
  • Date range: June 2024 - November 2024
  • Domain: Tech/AI discussions (Claude Code, superpowers, coding workflows)
  • Typical topics: agent workflows, coding tools, LLM discussions, infrastructure

Performance

On held-out test data (last 2 months):

  • 8 clusters identified
  • 27.8% noise rate
  • Clear topic differentiation

Limitations

  • Trained on tech/AI vocabulary - may not generalize to other domains
  • Small training set (400 bursts) - larger chats should train their own
  • English only
  • Optimized for min_cluster_size=2 (adjust for your density)

Citation

If you use this approach:

@misc{vibes-clustering-2024,
  title={UMAP-Only Training for Chat Clustering},
  author={Sugimura, Michael},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/msugimura/vibes-clustering}
}

Related Work

License

MIT License - Free to use, but train your own model for production!

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support