msugimura commited on
Commit
8ed773e
·
verified ·
1 Parent(s): e976c9c

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +138 -54
README.md CHANGED
@@ -1,17 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Vibes Chat Clustering Model
2
 
3
- This model is trained on WhatsApp chat data from "The vibez" group.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## Model Architecture
6
 
7
- - **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
8
- - **UMAP**: Reduces to 15 dimensions with cosine metric
9
- - **Clustering**: Fresh HDBSCAN on each inference (min_cluster_size=2 recommended)
10
- - **Topics**: Fresh c-TF-IDF vocabulary extraction on each inference
 
 
 
 
 
 
 
 
 
 
 
11
 
12
- ## Usage
13
 
14
  ```python
 
15
  import pickle
16
  import json
17
  from sentence_transformers import SentenceTransformer
@@ -19,26 +60,34 @@ import hdbscan
19
  from sklearn.feature_extraction.text import CountVectorizer
20
  import numpy as np
21
 
22
- # Load model and config
23
- with open('umap_model.pkl', 'rb') as f:
24
- umap_model = pickle.load(f)
 
 
 
 
 
 
25
 
26
- with open('config.json') as f:
 
 
 
27
  config = json.load(f)
28
 
29
- # Load BERT model
30
  bert_model = SentenceTransformer(config['bert_model'])
31
 
32
- # Your texts to cluster
33
- texts = ["your", "messages", "here"]
34
 
35
- # 1. Embed with BERT
36
  embeddings = bert_model.encode(texts)
37
 
38
- # 2. Transform with UMAP
39
  reduced = umap_model.transform(embeddings)
40
 
41
- # 3. Cluster with HDBSCAN (fresh clustering)
42
  clusterer = hdbscan.HDBSCAN(
43
  min_cluster_size=config['recommended_min_cluster_size'],
44
  metric='euclidean',
@@ -46,48 +95,83 @@ clusterer = hdbscan.HDBSCAN(
46
  )
47
  labels = clusterer.fit_predict(reduced)
48
 
49
- # 4. Extract topics with c-TF-IDF (fresh vocabulary)
50
- # Group texts by cluster
51
- cluster_docs = {}
52
- for text, label in zip(texts, labels):
53
- if label != -1:
54
- cluster_docs.setdefault(label, []).append(text)
55
-
56
- # Concatenate per cluster
57
- cluster_texts = [" ".join(cluster_docs[cid]) for cid in sorted(cluster_docs.keys())]
58
-
59
- # Vectorize
60
- vectorizer = CountVectorizer(
61
- stop_words="english",
62
- min_df=max(1, int(len(cluster_texts) * 0.05)),
63
- max_df=config['max_df'],
64
- ngram_range=tuple(config['ngram_range'])
 
 
 
 
 
 
 
 
65
  )
66
- tf = vectorizer.fit_transform(cluster_texts)
67
-
68
- # c-TF-IDF
69
- n_clusters = len(cluster_texts)
70
- df = np.array((tf > 0).sum(axis=0)).flatten()
71
- idf = np.log(n_clusters / (1 + df))
72
- ctfidf = tf.multiply(idf).toarray()
73
-
74
- # Get top words
75
- words = vectorizer.get_feature_names_out()
76
- for i, cid in enumerate(sorted(cluster_docs.keys())):
77
- top_indices = ctfidf[i].argsort()[-10:][::-1]
78
- top_words = [words[j] for j in top_indices]
79
- print(f"Topic {cid}: {', '.join(top_words)}")
80
  ```
81
 
82
- ## Training Data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
- Trained on historical WhatsApp chat bursts (excluding last 2 months).
85
 
86
- ## Key Insight
 
 
 
87
 
88
- This model uses **UMAP-only training**:
89
- - UMAP projection is frozen (trained once)
90
- - HDBSCAN clustering is fresh each inference
91
- - c-TF-IDF vocabulary is fresh each inference
92
 
93
- This allows the model to adapt to new vocabulary and topics while maintaining a consistent embedding space.
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - clustering
5
+ - topic-modeling
6
+ - umap
7
+ - hdbscan
8
+ - whatsapp-analysis
9
+ datasets:
10
+ - private
11
+ language:
12
+ - en
13
+ pipeline_tag: feature-extraction
14
+ ---
15
+
16
  # Vibes Chat Clustering Model
17
 
18
+ > ⚠️ **Note**: This is a demo model trained on a specific WhatsApp group ("The vibez" - tech/AI discussions). For production use, **train your own UMAP model on your chat data**.
19
+
20
+ ## About
21
+
22
+ This model demonstrates the UMAP-only training approach for WhatsApp chat clustering. It's trained on ~400 conversation bursts from a tech-focused group chat.
23
+
24
+ ### Best For
25
+ - Demo/example of clustering approach
26
+ - Comparing topics with similar tech/AI discussion groups
27
+ - Learning how the pipeline works
28
+
29
+ ### NOT Recommended For
30
+ - Production clustering of your own chats (train on your data instead!)
31
+ - Different chat domains (family, work, etc.)
32
+ - Long-term deployment
33
 
34
  ## Model Architecture
35
 
36
+ This uses a **UMAP-only training** approach:
37
+
38
+ 1. **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
39
+ 2. **UMAP** (TRAINED): Reduces to 15 dimensions with cosine metric
40
+ 3. **HDBSCAN** (FRESH): Clustering on each inference (min_cluster_size=2)
41
+ 4. **c-TF-IDF** (FRESH): Topic extraction with current vocabulary
42
+
43
+ ### Why UMAP-Only?
44
+
45
+ Unlike BERTopic's `transform()` which freezes vocabulary and causes high noise rates (94.8% in our tests), this approach:
46
+ - Trains UMAP once for consistent embedding space
47
+ - Re-clusters with fresh HDBSCAN each time
48
+ - Extracts topics with fresh vocabulary each time
49
+
50
+ This adapts to new vocabulary while maintaining spatial consistency.
51
 
52
+ ## Quick Start
53
 
54
  ```python
55
+ from huggingface_hub import hf_hub_download
56
  import pickle
57
  import json
58
  from sentence_transformers import SentenceTransformer
 
60
  from sklearn.feature_extraction.text import CountVectorizer
61
  import numpy as np
62
 
63
+ # Download model (cached after first run)
64
+ umap_path = hf_hub_download(
65
+ repo_id="msugimura/vibes-clustering",
66
+ filename="umap_model.pkl"
67
+ )
68
+ config_path = hf_hub_download(
69
+ repo_id="msugimura/vibes-clustering",
70
+ filename="config.json"
71
+ )
72
 
73
+ # Load model
74
+ with open(umap_path, 'rb') as f:
75
+ umap_model = pickle.load(f)
76
+ with open(config_path) as f:
77
  config = json.load(f)
78
 
 
79
  bert_model = SentenceTransformer(config['bert_model'])
80
 
81
+ # Your texts
82
+ texts = ["your chat messages here"]
83
 
84
+ # 1. Embed
85
  embeddings = bert_model.encode(texts)
86
 
87
+ # 2. Transform with pre-trained UMAP
88
  reduced = umap_model.transform(embeddings)
89
 
90
+ # 3. Fresh clustering
91
  clusterer = hdbscan.HDBSCAN(
92
  min_cluster_size=config['recommended_min_cluster_size'],
93
  metric='euclidean',
 
95
  )
96
  labels = clusterer.fit_predict(reduced)
97
 
98
+ # 4. Extract topics (see full code in repo)
99
+ # ... c-TF-IDF implementation ...
100
+ ```
101
+
102
+ ## Training Your Own Model
103
+
104
+ **Recommended for production use:**
105
+
106
+ ```python
107
+ from sentence_transformers import SentenceTransformer
108
+ import umap
109
+
110
+ # Your chat messages
111
+ historical_texts = [...] # First 80% of your timeline
112
+
113
+ # Embed
114
+ model = SentenceTransformer('all-MiniLM-L6-v2')
115
+ embeddings = model.encode(historical_texts)
116
+
117
+ # Train UMAP on YOUR data
118
+ umap_model = umap.UMAP(
119
+ n_components=15,
120
+ metric='cosine',
121
+ random_state=42
122
  )
123
+ umap_model.fit(embeddings)
124
+
125
+ # Save for future use
126
+ import pickle
127
+ with open('my_chat_umap.pkl', 'wb') as f:
128
+ pickle.dump(umap_model, f)
 
 
 
 
 
 
 
 
129
  ```
130
 
131
+ Then use `transform()` on new messages with fresh clustering/topics each time.
132
+
133
+ ## Training Details
134
+
135
+ - **Training data**: ~400 conversation bursts (excluding last 2 months for privacy)
136
+ - **Date range**: June 2024 - November 2024
137
+ - **Domain**: Tech/AI discussions (Claude Code, superpowers, coding workflows)
138
+ - **Typical topics**: agent workflows, coding tools, LLM discussions, infrastructure
139
+
140
+ ## Performance
141
+
142
+ On held-out test data (last 2 months):
143
+ - **8 clusters** identified
144
+ - **27.8% noise** rate
145
+ - Clear topic differentiation
146
+
147
+ ## Limitations
148
+
149
+ - Trained on tech/AI vocabulary - may not generalize to other domains
150
+ - Small training set (400 bursts) - larger chats should train their own
151
+ - English only
152
+ - Optimized for min_cluster_size=2 (adjust for your density)
153
+
154
+ ## Citation
155
+
156
+ If you use this approach:
157
+
158
+ ```bibtex
159
+ @misc{vibes-clustering-2024,
160
+ title={UMAP-Only Training for Chat Clustering},
161
+ author={Sugimura, Michael},
162
+ year={2024},
163
+ publisher={HuggingFace},
164
+ url={https://huggingface.co/msugimura/vibes-clustering}
165
+ }
166
+ ```
167
 
168
+ ## Related Work
169
 
170
+ - [BERTopic](https://github.com/MaartenGr/BERTopic) - Full pipeline (this extracts the UMAP-only approach)
171
+ - [UMAP](https://github.com/lmcinnes/umap) - Dimensionality reduction
172
+ - [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan) - Density-based clustering
173
+ - [sentence-transformers](https://www.sbert.net/) - Text embeddings
174
 
175
+ ## License
 
 
 
176
 
177
+ MIT License - Free to use, but train your own model for production!