--- license: mit tags: - clustering - topic-modeling - umap - hdbscan - whatsapp-analysis datasets: - private language: - en pipeline_tag: feature-extraction --- # Vibes Chat Clustering Model > ⚠️ **Note**: This is a demo model trained on a specific WhatsApp group ("The vibez" - tech/AI discussions). For production use, **train your own UMAP model on your chat data**. ## About This model demonstrates the UMAP-only training approach for WhatsApp chat clustering. It's trained on ~400 conversation bursts from a tech-focused group chat. ### Best For - Demo/example of clustering approach - Comparing topics with similar tech/AI discussion groups - Learning how the pipeline works ### NOT Recommended For - Production clustering of your own chats (train on your data instead!) - Different chat domains (family, work, etc.) - Long-term deployment ## Model Architecture This uses a **UMAP-only training** approach: 1. **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) 2. **UMAP** (TRAINED): Reduces to 15 dimensions with cosine metric 3. **HDBSCAN** (FRESH): Clustering on each inference (min_cluster_size=2) 4. **c-TF-IDF** (FRESH): Topic extraction with current vocabulary ### Why UMAP-Only? Unlike BERTopic's `transform()` which freezes vocabulary and causes high noise rates (94.8% in our tests), this approach: - Trains UMAP once for consistent embedding space - Re-clusters with fresh HDBSCAN each time - Extracts topics with fresh vocabulary each time This adapts to new vocabulary while maintaining spatial consistency. ## Quick Start ```python from huggingface_hub import hf_hub_download import pickle import json from sentence_transformers import SentenceTransformer import hdbscan from sklearn.feature_extraction.text import CountVectorizer import numpy as np # Download model (cached after first run) umap_path = hf_hub_download( repo_id="msugimura/vibes-clustering", filename="umap_model.pkl" ) config_path = hf_hub_download( repo_id="msugimura/vibes-clustering", filename="config.json" ) # Load model with open(umap_path, 'rb') as f: umap_model = pickle.load(f) with open(config_path) as f: config = json.load(f) bert_model = SentenceTransformer(config['bert_model']) # Your texts texts = ["your chat messages here"] # 1. Embed embeddings = bert_model.encode(texts) # 2. Transform with pre-trained UMAP reduced = umap_model.transform(embeddings) # 3. Fresh clustering clusterer = hdbscan.HDBSCAN( min_cluster_size=config['recommended_min_cluster_size'], metric='euclidean', cluster_selection_method='eom' ) labels = clusterer.fit_predict(reduced) # 4. Extract topics (see full code in repo) # ... c-TF-IDF implementation ... ``` ## Training Your Own Model **Recommended for production use:** ```python from sentence_transformers import SentenceTransformer import umap # Your chat messages historical_texts = [...] # First 80% of your timeline # Embed model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(historical_texts) # Train UMAP on YOUR data umap_model = umap.UMAP( n_components=15, metric='cosine', random_state=42 ) umap_model.fit(embeddings) # Save for future use import pickle with open('my_chat_umap.pkl', 'wb') as f: pickle.dump(umap_model, f) ``` Then use `transform()` on new messages with fresh clustering/topics each time. ## Training Details - **Training data**: ~400 conversation bursts (excluding last 2 months for privacy) - **Date range**: June 2024 - November 2024 - **Domain**: Tech/AI discussions (Claude Code, superpowers, coding workflows) - **Typical topics**: agent workflows, coding tools, LLM discussions, infrastructure ## Performance On held-out test data (last 2 months): - **8 clusters** identified - **27.8% noise** rate - Clear topic differentiation ## Limitations - Trained on tech/AI vocabulary - may not generalize to other domains - Small training set (400 bursts) - larger chats should train their own - English only - Optimized for min_cluster_size=2 (adjust for your density) ## Citation If you use this approach: ```bibtex @misc{vibes-clustering-2024, title={UMAP-Only Training for Chat Clustering}, author={Sugimura, Michael}, year={2024}, publisher={HuggingFace}, url={https://huggingface.co/msugimura/vibes-clustering} } ``` ## Related Work - [BERTopic](https://github.com/MaartenGr/BERTopic) - Full pipeline (this extracts the UMAP-only approach) - [UMAP](https://github.com/lmcinnes/umap) - Dimensionality reduction - [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan) - Density-based clustering - [sentence-transformers](https://www.sbert.net/) - Text embeddings ## License MIT License - Free to use, but train your own model for production!