File size: 4,766 Bytes
8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e e976c9c 8ed773e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
---
license: mit
tags:
- clustering
- topic-modeling
- umap
- hdbscan
- whatsapp-analysis
datasets:
- private
language:
- en
pipeline_tag: feature-extraction
---
# Vibes Chat Clustering Model
> ⚠️ **Note**: This is a demo model trained on a specific WhatsApp group ("The vibez" - tech/AI discussions). For production use, **train your own UMAP model on your chat data**.
## About
This model demonstrates the UMAP-only training approach for WhatsApp chat clustering. It's trained on ~400 conversation bursts from a tech-focused group chat.
### Best For
- Demo/example of clustering approach
- Comparing topics with similar tech/AI discussion groups
- Learning how the pipeline works
### NOT Recommended For
- Production clustering of your own chats (train on your data instead!)
- Different chat domains (family, work, etc.)
- Long-term deployment
## Model Architecture
This uses a **UMAP-only training** approach:
1. **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
2. **UMAP** (TRAINED): Reduces to 15 dimensions with cosine metric
3. **HDBSCAN** (FRESH): Clustering on each inference (min_cluster_size=2)
4. **c-TF-IDF** (FRESH): Topic extraction with current vocabulary
### Why UMAP-Only?
Unlike BERTopic's `transform()` which freezes vocabulary and causes high noise rates (94.8% in our tests), this approach:
- Trains UMAP once for consistent embedding space
- Re-clusters with fresh HDBSCAN each time
- Extracts topics with fresh vocabulary each time
This adapts to new vocabulary while maintaining spatial consistency.
## Quick Start
```python
from huggingface_hub import hf_hub_download
import pickle
import json
from sentence_transformers import SentenceTransformer
import hdbscan
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
# Download model (cached after first run)
umap_path = hf_hub_download(
repo_id="msugimura/vibes-clustering",
filename="umap_model.pkl"
)
config_path = hf_hub_download(
repo_id="msugimura/vibes-clustering",
filename="config.json"
)
# Load model
with open(umap_path, 'rb') as f:
umap_model = pickle.load(f)
with open(config_path) as f:
config = json.load(f)
bert_model = SentenceTransformer(config['bert_model'])
# Your texts
texts = ["your chat messages here"]
# 1. Embed
embeddings = bert_model.encode(texts)
# 2. Transform with pre-trained UMAP
reduced = umap_model.transform(embeddings)
# 3. Fresh clustering
clusterer = hdbscan.HDBSCAN(
min_cluster_size=config['recommended_min_cluster_size'],
metric='euclidean',
cluster_selection_method='eom'
)
labels = clusterer.fit_predict(reduced)
# 4. Extract topics (see full code in repo)
# ... c-TF-IDF implementation ...
```
## Training Your Own Model
**Recommended for production use:**
```python
from sentence_transformers import SentenceTransformer
import umap
# Your chat messages
historical_texts = [...] # First 80% of your timeline
# Embed
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(historical_texts)
# Train UMAP on YOUR data
umap_model = umap.UMAP(
n_components=15,
metric='cosine',
random_state=42
)
umap_model.fit(embeddings)
# Save for future use
import pickle
with open('my_chat_umap.pkl', 'wb') as f:
pickle.dump(umap_model, f)
```
Then use `transform()` on new messages with fresh clustering/topics each time.
## Training Details
- **Training data**: ~400 conversation bursts (excluding last 2 months for privacy)
- **Date range**: June 2024 - November 2024
- **Domain**: Tech/AI discussions (Claude Code, superpowers, coding workflows)
- **Typical topics**: agent workflows, coding tools, LLM discussions, infrastructure
## Performance
On held-out test data (last 2 months):
- **8 clusters** identified
- **27.8% noise** rate
- Clear topic differentiation
## Limitations
- Trained on tech/AI vocabulary - may not generalize to other domains
- Small training set (400 bursts) - larger chats should train their own
- English only
- Optimized for min_cluster_size=2 (adjust for your density)
## Citation
If you use this approach:
```bibtex
@misc{vibes-clustering-2024,
title={UMAP-Only Training for Chat Clustering},
author={Sugimura, Michael},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/msugimura/vibes-clustering}
}
```
## Related Work
- [BERTopic](https://github.com/MaartenGr/BERTopic) - Full pipeline (this extracts the UMAP-only approach)
- [UMAP](https://github.com/lmcinnes/umap) - Dimensionality reduction
- [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan) - Density-based clustering
- [sentence-transformers](https://www.sbert.net/) - Text embeddings
## License
MIT License - Free to use, but train your own model for production!
|