msugimura
/

vibes-clustering

@@ -1,17 +1,58 @@
 # Vibes Chat Clustering Model
-This model is trained on WhatsApp chat data from "The vibez" group.
 ## Model Architecture
-- **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
-- **UMAP**: Reduces to 15 dimensions with cosine metric
-- **Clustering**: Fresh HDBSCAN on each inference (min_cluster_size=2 recommended)
-- **Topics**: Fresh c-TF-IDF vocabulary extraction on each inference
-## Usage
 ```python
 import pickle
 import json
 from sentence_transformers import SentenceTransformer
@@ -19,26 +60,34 @@ import hdbscan
 from sklearn.feature_extraction.text import CountVectorizer
 import numpy as np
-# Load model and config
-with open('umap_model.pkl', 'rb') as f:
-    umap_model = pickle.load(f)
-with open('config.json') as f:
     config = json.load(f)
-# Load BERT model
 bert_model = SentenceTransformer(config['bert_model'])
-# Your texts to cluster
-texts = ["your", "messages", "here"]
-# 1. Embed with BERT
 embeddings = bert_model.encode(texts)
-# 2. Transform with UMAP
 reduced = umap_model.transform(embeddings)
-# 3. Cluster with HDBSCAN (fresh clustering)
 clusterer = hdbscan.HDBSCAN(
     min_cluster_size=config['recommended_min_cluster_size'],
     metric='euclidean',
@@ -46,48 +95,83 @@ clusterer = hdbscan.HDBSCAN(
 )
 labels = clusterer.fit_predict(reduced)
-# 4. Extract topics with c-TF-IDF (fresh vocabulary)
-# Group texts by cluster
-cluster_docs = {}
-for text, label in zip(texts, labels):
-    if label != -1:
-        cluster_docs.setdefault(label, []).append(text)
-# Concatenate per cluster
-cluster_texts = [" ".join(cluster_docs[cid]) for cid in sorted(cluster_docs.keys())]
-# Vectorize
-vectorizer = CountVectorizer(
-    stop_words="english",
-    min_df=max(1, int(len(cluster_texts) * 0.05)),
-    max_df=config['max_df'],
-    ngram_range=tuple(config['ngram_range'])
 )
-tf = vectorizer.fit_transform(cluster_texts)
-# c-TF-IDF
-n_clusters = len(cluster_texts)
-df = np.array((tf > 0).sum(axis=0)).flatten()
-idf = np.log(n_clusters / (1 + df))
-ctfidf = tf.multiply(idf).toarray()
-# Get top words
-words = vectorizer.get_feature_names_out()
-for i, cid in enumerate(sorted(cluster_docs.keys())):
-    top_indices = ctfidf[i].argsort()[-10:][::-1]
-    top_words = [words[j] for j in top_indices]
-    print(f"Topic {cid}: {', '.join(top_words)}")
 ```
-## Training Data
-Trained on historical WhatsApp chat bursts (excluding last 2 months).
-## Key Insight
-This model uses **UMAP-only training**:
-- UMAP projection is frozen (trained once)
-- HDBSCAN clustering is fresh each inference
-- c-TF-IDF vocabulary is fresh each inference
-This allows the model to adapt to new vocabulary and topics while maintaining a consistent embedding space.

+---
+license: mit
+tags:
+- clustering
+- topic-modeling
+- umap
+- hdbscan
+- whatsapp-analysis
+datasets:
+- private
+language:
+- en
+pipeline_tag: feature-extraction
+---
 # Vibes Chat Clustering Model
+> ⚠️ **Note**: This is a demo model trained on a specific WhatsApp group ("The vibez" - tech/AI discussions). For production use, **train your own UMAP model on your chat data**.
+## About
+This model demonstrates the UMAP-only training approach for WhatsApp chat clustering. It's trained on ~400 conversation bursts from a tech-focused group chat.
+### Best For
+- Demo/example of clustering approach
+- Comparing topics with similar tech/AI discussion groups
+- Learning how the pipeline works
+### NOT Recommended For
+- Production clustering of your own chats (train on your data instead!)
+- Different chat domains (family, work, etc.)
+- Long-term deployment
 ## Model Architecture
+This uses a **UMAP-only training** approach:
+1. **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
+2. **UMAP** (TRAINED): Reduces to 15 dimensions with cosine metric
+3. **HDBSCAN** (FRESH): Clustering on each inference (min_cluster_size=2)
+4. **c-TF-IDF** (FRESH): Topic extraction with current vocabulary
+### Why UMAP-Only?
+Unlike BERTopic's `transform()` which freezes vocabulary and causes high noise rates (94.8% in our tests), this approach:
+- Trains UMAP once for consistent embedding space
+- Re-clusters with fresh HDBSCAN each time
+- Extracts topics with fresh vocabulary each time
+This adapts to new vocabulary while maintaining spatial consistency.
+## Quick Start
 ```python
+from huggingface_hub import hf_hub_download
 import pickle
 import json
 from sentence_transformers import SentenceTransformer
 from sklearn.feature_extraction.text import CountVectorizer
 import numpy as np
+# Download model (cached after first run)
+umap_path = hf_hub_download(
+    repo_id="msugimura/vibes-clustering",
+    filename="umap_model.pkl"
+)
+config_path = hf_hub_download(
+    repo_id="msugimura/vibes-clustering",
+    filename="config.json"
+)
+# Load model
+with open(umap_path, 'rb') as f:
+    umap_model = pickle.load(f)
+with open(config_path) as f:
     config = json.load(f)
 bert_model = SentenceTransformer(config['bert_model'])
+# Your texts
+texts = ["your chat messages here"]
+# 1. Embed
 embeddings = bert_model.encode(texts)
+# 2. Transform with pre-trained UMAP
 reduced = umap_model.transform(embeddings)
+# 3. Fresh clustering
 clusterer = hdbscan.HDBSCAN(
     min_cluster_size=config['recommended_min_cluster_size'],
     metric='euclidean',
 )
 labels = clusterer.fit_predict(reduced)
+# 4. Extract topics (see full code in repo)
+# ... c-TF-IDF implementation ...
+```
+## Training Your Own Model
+**Recommended for production use:**
+```python
+from sentence_transformers import SentenceTransformer
+import umap
+# Your chat messages
+historical_texts = [...]  # First 80% of your timeline
+# Embed
+model = SentenceTransformer('all-MiniLM-L6-v2')
+embeddings = model.encode(historical_texts)
+# Train UMAP on YOUR data
+umap_model = umap.UMAP(
+    n_components=15,
+    metric='cosine',
+    random_state=42
 )
+umap_model.fit(embeddings)
+# Save for future use
+import pickle
+with open('my_chat_umap.pkl', 'wb') as f:
+    pickle.dump(umap_model, f)
 ```
+Then use `transform()` on new messages with fresh clustering/topics each time.
+## Training Details
+- **Training data**: ~400 conversation bursts (excluding last 2 months for privacy)
+- **Date range**: June 2024 - November 2024
+- **Domain**: Tech/AI discussions (Claude Code, superpowers, coding workflows)
+- **Typical topics**: agent workflows, coding tools, LLM discussions, infrastructure
+## Performance
+On held-out test data (last 2 months):
+- **8 clusters** identified
+- **27.8% noise** rate
+- Clear topic differentiation
+## Limitations
+- Trained on tech/AI vocabulary - may not generalize to other domains
+- Small training set (400 bursts) - larger chats should train their own
+- English only
+- Optimized for min_cluster_size=2 (adjust for your density)
+## Citation
+If you use this approach:
+```bibtex
+@misc{vibes-clustering-2024,
+  title={UMAP-Only Training for Chat Clustering},
+  author={Sugimura, Michael},
+  year={2024},
+  publisher={HuggingFace},
+  url={https://huggingface.co/msugimura/vibes-clustering}
+}
+```
+## Related Work
+- [BERTopic](https://github.com/MaartenGr/BERTopic) - Full pipeline (this extracts the UMAP-only approach)
+- [UMAP](https://github.com/lmcinnes/umap) - Dimensionality reduction
+- [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan) - Density-based clustering
+- [sentence-transformers](https://www.sbert.net/) - Text embeddings
+## License
+MIT License - Free to use, but train your own model for production!