vibes-clustering / README.md
msugimura's picture
Upload folder using huggingface_hub
8ed773e verified
---
license: mit
tags:
- clustering
- topic-modeling
- umap
- hdbscan
- whatsapp-analysis
datasets:
- private
language:
- en
pipeline_tag: feature-extraction
---
# Vibes Chat Clustering Model
> ⚠️ **Note**: This is a demo model trained on a specific WhatsApp group ("The vibez" - tech/AI discussions). For production use, **train your own UMAP model on your chat data**.
## About
This model demonstrates the UMAP-only training approach for WhatsApp chat clustering. It's trained on ~400 conversation bursts from a tech-focused group chat.
### Best For
- Demo/example of clustering approach
- Comparing topics with similar tech/AI discussion groups
- Learning how the pipeline works
### NOT Recommended For
- Production clustering of your own chats (train on your data instead!)
- Different chat domains (family, work, etc.)
- Long-term deployment
## Model Architecture
This uses a **UMAP-only training** approach:
1. **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
2. **UMAP** (TRAINED): Reduces to 15 dimensions with cosine metric
3. **HDBSCAN** (FRESH): Clustering on each inference (min_cluster_size=2)
4. **c-TF-IDF** (FRESH): Topic extraction with current vocabulary
### Why UMAP-Only?
Unlike BERTopic's `transform()` which freezes vocabulary and causes high noise rates (94.8% in our tests), this approach:
- Trains UMAP once for consistent embedding space
- Re-clusters with fresh HDBSCAN each time
- Extracts topics with fresh vocabulary each time
This adapts to new vocabulary while maintaining spatial consistency.
## Quick Start
```python
from huggingface_hub import hf_hub_download
import pickle
import json
from sentence_transformers import SentenceTransformer
import hdbscan
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
# Download model (cached after first run)
umap_path = hf_hub_download(
repo_id="msugimura/vibes-clustering",
filename="umap_model.pkl"
)
config_path = hf_hub_download(
repo_id="msugimura/vibes-clustering",
filename="config.json"
)
# Load model
with open(umap_path, 'rb') as f:
umap_model = pickle.load(f)
with open(config_path) as f:
config = json.load(f)
bert_model = SentenceTransformer(config['bert_model'])
# Your texts
texts = ["your chat messages here"]
# 1. Embed
embeddings = bert_model.encode(texts)
# 2. Transform with pre-trained UMAP
reduced = umap_model.transform(embeddings)
# 3. Fresh clustering
clusterer = hdbscan.HDBSCAN(
min_cluster_size=config['recommended_min_cluster_size'],
metric='euclidean',
cluster_selection_method='eom'
)
labels = clusterer.fit_predict(reduced)
# 4. Extract topics (see full code in repo)
# ... c-TF-IDF implementation ...
```
## Training Your Own Model
**Recommended for production use:**
```python
from sentence_transformers import SentenceTransformer
import umap
# Your chat messages
historical_texts = [...] # First 80% of your timeline
# Embed
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(historical_texts)
# Train UMAP on YOUR data
umap_model = umap.UMAP(
n_components=15,
metric='cosine',
random_state=42
)
umap_model.fit(embeddings)
# Save for future use
import pickle
with open('my_chat_umap.pkl', 'wb') as f:
pickle.dump(umap_model, f)
```
Then use `transform()` on new messages with fresh clustering/topics each time.
## Training Details
- **Training data**: ~400 conversation bursts (excluding last 2 months for privacy)
- **Date range**: June 2024 - November 2024
- **Domain**: Tech/AI discussions (Claude Code, superpowers, coding workflows)
- **Typical topics**: agent workflows, coding tools, LLM discussions, infrastructure
## Performance
On held-out test data (last 2 months):
- **8 clusters** identified
- **27.8% noise** rate
- Clear topic differentiation
## Limitations
- Trained on tech/AI vocabulary - may not generalize to other domains
- Small training set (400 bursts) - larger chats should train their own
- English only
- Optimized for min_cluster_size=2 (adjust for your density)
## Citation
If you use this approach:
```bibtex
@misc{vibes-clustering-2024,
title={UMAP-Only Training for Chat Clustering},
author={Sugimura, Michael},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/msugimura/vibes-clustering}
}
```
## Related Work
- [BERTopic](https://github.com/MaartenGr/BERTopic) - Full pipeline (this extracts the UMAP-only approach)
- [UMAP](https://github.com/lmcinnes/umap) - Dimensionality reduction
- [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan) - Density-based clustering
- [sentence-transformers](https://www.sbert.net/) - Text embeddings
## License
MIT License - Free to use, but train your own model for production!