File size: 4,766 Bytes
8ed773e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e976c9c
 
8ed773e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e976c9c
 
 
8ed773e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e976c9c
8ed773e
e976c9c
 
8ed773e
e976c9c
 
 
 
 
 
 
8ed773e
 
 
 
 
 
 
 
 
e976c9c
8ed773e
 
 
 
e976c9c
 
 
 
8ed773e
 
e976c9c
8ed773e
e976c9c
 
8ed773e
e976c9c
 
8ed773e
e976c9c
 
 
 
 
 
 
8ed773e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e976c9c
8ed773e
 
 
 
 
 
e976c9c
 
8ed773e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e976c9c
8ed773e
e976c9c
8ed773e
 
 
 
e976c9c
8ed773e
e976c9c
8ed773e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
license: mit
tags:
- clustering
- topic-modeling
- umap
- hdbscan
- whatsapp-analysis
datasets:
- private
language:
- en
pipeline_tag: feature-extraction
---

# Vibes Chat Clustering Model

> ⚠️ **Note**: This is a demo model trained on a specific WhatsApp group ("The vibez" - tech/AI discussions). For production use, **train your own UMAP model on your chat data**.

## About

This model demonstrates the UMAP-only training approach for WhatsApp chat clustering. It's trained on ~400 conversation bursts from a tech-focused group chat.

### Best For
- Demo/example of clustering approach
- Comparing topics with similar tech/AI discussion groups
- Learning how the pipeline works

### NOT Recommended For
- Production clustering of your own chats (train on your data instead!)
- Different chat domains (family, work, etc.)
- Long-term deployment

## Model Architecture

This uses a **UMAP-only training** approach:

1. **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
2. **UMAP** (TRAINED): Reduces to 15 dimensions with cosine metric
3. **HDBSCAN** (FRESH): Clustering on each inference (min_cluster_size=2)
4. **c-TF-IDF** (FRESH): Topic extraction with current vocabulary

### Why UMAP-Only?

Unlike BERTopic's `transform()` which freezes vocabulary and causes high noise rates (94.8% in our tests), this approach:
- Trains UMAP once for consistent embedding space
- Re-clusters with fresh HDBSCAN each time
- Extracts topics with fresh vocabulary each time

This adapts to new vocabulary while maintaining spatial consistency.

## Quick Start

```python
from huggingface_hub import hf_hub_download
import pickle
import json
from sentence_transformers import SentenceTransformer
import hdbscan
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Download model (cached after first run)
umap_path = hf_hub_download(
    repo_id="msugimura/vibes-clustering",
    filename="umap_model.pkl"
)
config_path = hf_hub_download(
    repo_id="msugimura/vibes-clustering",
    filename="config.json"
)

# Load model
with open(umap_path, 'rb') as f:
    umap_model = pickle.load(f)
with open(config_path) as f:
    config = json.load(f)

bert_model = SentenceTransformer(config['bert_model'])

# Your texts
texts = ["your chat messages here"]

# 1. Embed
embeddings = bert_model.encode(texts)

# 2. Transform with pre-trained UMAP
reduced = umap_model.transform(embeddings)

# 3. Fresh clustering
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=config['recommended_min_cluster_size'],
    metric='euclidean',
    cluster_selection_method='eom'
)
labels = clusterer.fit_predict(reduced)

# 4. Extract topics (see full code in repo)
# ... c-TF-IDF implementation ...
```

## Training Your Own Model

**Recommended for production use:**

```python
from sentence_transformers import SentenceTransformer
import umap

# Your chat messages
historical_texts = [...]  # First 80% of your timeline

# Embed
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(historical_texts)

# Train UMAP on YOUR data
umap_model = umap.UMAP(
    n_components=15,
    metric='cosine',
    random_state=42
)
umap_model.fit(embeddings)

# Save for future use
import pickle
with open('my_chat_umap.pkl', 'wb') as f:
    pickle.dump(umap_model, f)
```

Then use `transform()` on new messages with fresh clustering/topics each time.

## Training Details

- **Training data**: ~400 conversation bursts (excluding last 2 months for privacy)
- **Date range**: June 2024 - November 2024
- **Domain**: Tech/AI discussions (Claude Code, superpowers, coding workflows)
- **Typical topics**: agent workflows, coding tools, LLM discussions, infrastructure

## Performance

On held-out test data (last 2 months):
- **8 clusters** identified
- **27.8% noise** rate
- Clear topic differentiation

## Limitations

- Trained on tech/AI vocabulary - may not generalize to other domains
- Small training set (400 bursts) - larger chats should train their own
- English only
- Optimized for min_cluster_size=2 (adjust for your density)

## Citation

If you use this approach:

```bibtex
@misc{vibes-clustering-2024,
  title={UMAP-Only Training for Chat Clustering},
  author={Sugimura, Michael},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/msugimura/vibes-clustering}
}
```

## Related Work

- [BERTopic](https://github.com/MaartenGr/BERTopic) - Full pipeline (this extracts the UMAP-only approach)
- [UMAP](https://github.com/lmcinnes/umap) - Dimensionality reduction
- [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan) - Density-based clustering
- [sentence-transformers](https://www.sbert.net/) - Text embeddings

## License

MIT License - Free to use, but train your own model for production!