|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- clustering |
|
|
- topic-modeling |
|
|
- umap |
|
|
- hdbscan |
|
|
- whatsapp-analysis |
|
|
datasets: |
|
|
- private |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# Vibes Chat Clustering Model |
|
|
|
|
|
> ⚠️ **Note**: This is a demo model trained on a specific WhatsApp group ("The vibez" - tech/AI discussions). For production use, **train your own UMAP model on your chat data**. |
|
|
|
|
|
## About |
|
|
|
|
|
This model demonstrates the UMAP-only training approach for WhatsApp chat clustering. It's trained on ~400 conversation bursts from a tech-focused group chat. |
|
|
|
|
|
### Best For |
|
|
- Demo/example of clustering approach |
|
|
- Comparing topics with similar tech/AI discussion groups |
|
|
- Learning how the pipeline works |
|
|
|
|
|
### NOT Recommended For |
|
|
- Production clustering of your own chats (train on your data instead!) |
|
|
- Different chat domains (family, work, etc.) |
|
|
- Long-term deployment |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
This uses a **UMAP-only training** approach: |
|
|
|
|
|
1. **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) |
|
|
2. **UMAP** (TRAINED): Reduces to 15 dimensions with cosine metric |
|
|
3. **HDBSCAN** (FRESH): Clustering on each inference (min_cluster_size=2) |
|
|
4. **c-TF-IDF** (FRESH): Topic extraction with current vocabulary |
|
|
|
|
|
### Why UMAP-Only? |
|
|
|
|
|
Unlike BERTopic's `transform()` which freezes vocabulary and causes high noise rates (94.8% in our tests), this approach: |
|
|
- Trains UMAP once for consistent embedding space |
|
|
- Re-clusters with fresh HDBSCAN each time |
|
|
- Extracts topics with fresh vocabulary each time |
|
|
|
|
|
This adapts to new vocabulary while maintaining spatial consistency. |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
import pickle |
|
|
import json |
|
|
from sentence_transformers import SentenceTransformer |
|
|
import hdbscan |
|
|
from sklearn.feature_extraction.text import CountVectorizer |
|
|
import numpy as np |
|
|
|
|
|
# Download model (cached after first run) |
|
|
umap_path = hf_hub_download( |
|
|
repo_id="msugimura/vibes-clustering", |
|
|
filename="umap_model.pkl" |
|
|
) |
|
|
config_path = hf_hub_download( |
|
|
repo_id="msugimura/vibes-clustering", |
|
|
filename="config.json" |
|
|
) |
|
|
|
|
|
# Load model |
|
|
with open(umap_path, 'rb') as f: |
|
|
umap_model = pickle.load(f) |
|
|
with open(config_path) as f: |
|
|
config = json.load(f) |
|
|
|
|
|
bert_model = SentenceTransformer(config['bert_model']) |
|
|
|
|
|
# Your texts |
|
|
texts = ["your chat messages here"] |
|
|
|
|
|
# 1. Embed |
|
|
embeddings = bert_model.encode(texts) |
|
|
|
|
|
# 2. Transform with pre-trained UMAP |
|
|
reduced = umap_model.transform(embeddings) |
|
|
|
|
|
# 3. Fresh clustering |
|
|
clusterer = hdbscan.HDBSCAN( |
|
|
min_cluster_size=config['recommended_min_cluster_size'], |
|
|
metric='euclidean', |
|
|
cluster_selection_method='eom' |
|
|
) |
|
|
labels = clusterer.fit_predict(reduced) |
|
|
|
|
|
# 4. Extract topics (see full code in repo) |
|
|
# ... c-TF-IDF implementation ... |
|
|
``` |
|
|
|
|
|
## Training Your Own Model |
|
|
|
|
|
**Recommended for production use:** |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
import umap |
|
|
|
|
|
# Your chat messages |
|
|
historical_texts = [...] # First 80% of your timeline |
|
|
|
|
|
# Embed |
|
|
model = SentenceTransformer('all-MiniLM-L6-v2') |
|
|
embeddings = model.encode(historical_texts) |
|
|
|
|
|
# Train UMAP on YOUR data |
|
|
umap_model = umap.UMAP( |
|
|
n_components=15, |
|
|
metric='cosine', |
|
|
random_state=42 |
|
|
) |
|
|
umap_model.fit(embeddings) |
|
|
|
|
|
# Save for future use |
|
|
import pickle |
|
|
with open('my_chat_umap.pkl', 'wb') as f: |
|
|
pickle.dump(umap_model, f) |
|
|
``` |
|
|
|
|
|
Then use `transform()` on new messages with fresh clustering/topics each time. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Training data**: ~400 conversation bursts (excluding last 2 months for privacy) |
|
|
- **Date range**: June 2024 - November 2024 |
|
|
- **Domain**: Tech/AI discussions (Claude Code, superpowers, coding workflows) |
|
|
- **Typical topics**: agent workflows, coding tools, LLM discussions, infrastructure |
|
|
|
|
|
## Performance |
|
|
|
|
|
On held-out test data (last 2 months): |
|
|
- **8 clusters** identified |
|
|
- **27.8% noise** rate |
|
|
- Clear topic differentiation |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained on tech/AI vocabulary - may not generalize to other domains |
|
|
- Small training set (400 bursts) - larger chats should train their own |
|
|
- English only |
|
|
- Optimized for min_cluster_size=2 (adjust for your density) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this approach: |
|
|
|
|
|
```bibtex |
|
|
@misc{vibes-clustering-2024, |
|
|
title={UMAP-Only Training for Chat Clustering}, |
|
|
author={Sugimura, Michael}, |
|
|
year={2024}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/msugimura/vibes-clustering} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Related Work |
|
|
|
|
|
- [BERTopic](https://github.com/MaartenGr/BERTopic) - Full pipeline (this extracts the UMAP-only approach) |
|
|
- [UMAP](https://github.com/lmcinnes/umap) - Dimensionality reduction |
|
|
- [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan) - Density-based clustering |
|
|
- [sentence-transformers](https://www.sbert.net/) - Text embeddings |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - Free to use, but train your own model for production! |
|
|
|