Vibes Chat Clustering Model
⚠️ Note: This is a demo model trained on a specific WhatsApp group ("The vibez" - tech/AI discussions). For production use, train your own UMAP model on your chat data.
About
This model demonstrates the UMAP-only training approach for WhatsApp chat clustering. It's trained on ~400 conversation bursts from a tech-focused group chat.
Best For
- Demo/example of clustering approach
- Comparing topics with similar tech/AI discussion groups
- Learning how the pipeline works
NOT Recommended For
- Production clustering of your own chats (train on your data instead!)
- Different chat domains (family, work, etc.)
- Long-term deployment
Model Architecture
This uses a UMAP-only training approach:
- Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
- UMAP (TRAINED): Reduces to 15 dimensions with cosine metric
- HDBSCAN (FRESH): Clustering on each inference (min_cluster_size=2)
- c-TF-IDF (FRESH): Topic extraction with current vocabulary
Why UMAP-Only?
Unlike BERTopic's transform() which freezes vocabulary and causes high noise rates (94.8% in our tests), this approach:
- Trains UMAP once for consistent embedding space
- Re-clusters with fresh HDBSCAN each time
- Extracts topics with fresh vocabulary each time
This adapts to new vocabulary while maintaining spatial consistency.
Quick Start
from huggingface_hub import hf_hub_download
import pickle
import json
from sentence_transformers import SentenceTransformer
import hdbscan
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
# Download model (cached after first run)
umap_path = hf_hub_download(
repo_id="msugimura/vibes-clustering",
filename="umap_model.pkl"
)
config_path = hf_hub_download(
repo_id="msugimura/vibes-clustering",
filename="config.json"
)
# Load model
with open(umap_path, 'rb') as f:
umap_model = pickle.load(f)
with open(config_path) as f:
config = json.load(f)
bert_model = SentenceTransformer(config['bert_model'])
# Your texts
texts = ["your chat messages here"]
# 1. Embed
embeddings = bert_model.encode(texts)
# 2. Transform with pre-trained UMAP
reduced = umap_model.transform(embeddings)
# 3. Fresh clustering
clusterer = hdbscan.HDBSCAN(
min_cluster_size=config['recommended_min_cluster_size'],
metric='euclidean',
cluster_selection_method='eom'
)
labels = clusterer.fit_predict(reduced)
# 4. Extract topics (see full code in repo)
# ... c-TF-IDF implementation ...
Training Your Own Model
Recommended for production use:
from sentence_transformers import SentenceTransformer
import umap
# Your chat messages
historical_texts = [...] # First 80% of your timeline
# Embed
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(historical_texts)
# Train UMAP on YOUR data
umap_model = umap.UMAP(
n_components=15,
metric='cosine',
random_state=42
)
umap_model.fit(embeddings)
# Save for future use
import pickle
with open('my_chat_umap.pkl', 'wb') as f:
pickle.dump(umap_model, f)
Then use transform() on new messages with fresh clustering/topics each time.
Training Details
- Training data: ~400 conversation bursts (excluding last 2 months for privacy)
- Date range: June 2024 - November 2024
- Domain: Tech/AI discussions (Claude Code, superpowers, coding workflows)
- Typical topics: agent workflows, coding tools, LLM discussions, infrastructure
Performance
On held-out test data (last 2 months):
- 8 clusters identified
- 27.8% noise rate
- Clear topic differentiation
Limitations
- Trained on tech/AI vocabulary - may not generalize to other domains
- Small training set (400 bursts) - larger chats should train their own
- English only
- Optimized for min_cluster_size=2 (adjust for your density)
Citation
If you use this approach:
@misc{vibes-clustering-2024,
title={UMAP-Only Training for Chat Clustering},
author={Sugimura, Michael},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/msugimura/vibes-clustering}
}
Related Work
- BERTopic - Full pipeline (this extracts the UMAP-only approach)
- UMAP - Dimensionality reduction
- HDBSCAN - Density-based clustering
- sentence-transformers - Text embeddings
License
MIT License - Free to use, but train your own model for production!
- Downloads last month
- 18