vibes-clustering / README.md

Upload folder using huggingface_hub

8ed773e verified 14 days ago

4.77 kB

	---
	license: mit
	tags:
	- clustering
	- topic-modeling
	- umap
	- hdbscan
	- whatsapp-analysis
	datasets:
	- private
	language:
	- en
	pipeline_tag: feature-extraction
	---

	# Vibes Chat Clustering Model

	> ⚠️ Note: This is a demo model trained on a specific WhatsApp group ("The vibez" - tech/AI discussions). For production use, train your own UMAP model on your chat data.

	## About

	This model demonstrates the UMAP-only training approach for WhatsApp chat clustering. It's trained on ~400 conversation bursts from a tech-focused group chat.

	### Best For
	- Demo/example of clustering approach
	- Comparing topics with similar tech/AI discussion groups
	- Learning how the pipeline works

	### NOT Recommended For
	- Production clustering of your own chats (train on your data instead!)
	- Different chat domains (family, work, etc.)
	- Long-term deployment

	## Model Architecture

	This uses a UMAP-only training approach:

	1. Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
	2. UMAP (TRAINED): Reduces to 15 dimensions with cosine metric
	3. HDBSCAN (FRESH): Clustering on each inference (min_cluster_size=2)
	4. c-TF-IDF (FRESH): Topic extraction with current vocabulary

	### Why UMAP-Only?

	Unlike BERTopic's `transform()` which freezes vocabulary and causes high noise rates (94.8% in our tests), this approach:
	- Trains UMAP once for consistent embedding space
	- Re-clusters with fresh HDBSCAN each time
	- Extracts topics with fresh vocabulary each time

	This adapts to new vocabulary while maintaining spatial consistency.

	## Quick Start

	```python
	from huggingface_hub import hf_hub_download
	import pickle
	import json
	from sentence_transformers import SentenceTransformer
	import hdbscan
	from sklearn.feature_extraction.text import CountVectorizer
	import numpy as np

	# Download model (cached after first run)
	umap_path = hf_hub_download(
	repo_id="msugimura/vibes-clustering",
	filename="umap_model.pkl"
	)
	config_path = hf_hub_download(
	repo_id="msugimura/vibes-clustering",
	filename="config.json"
	)

	# Load model
	with open(umap_path, 'rb') as f:
	umap_model = pickle.load(f)
	with open(config_path) as f:
	config = json.load(f)

	bert_model = SentenceTransformer(config['bert_model'])

	# Your texts
	texts = ["your chat messages here"]

	# 1. Embed
	embeddings = bert_model.encode(texts)

	# 2. Transform with pre-trained UMAP
	reduced = umap_model.transform(embeddings)

	# 3. Fresh clustering
	clusterer = hdbscan.HDBSCAN(
	min_cluster_size=config['recommended_min_cluster_size'],
	metric='euclidean',
	cluster_selection_method='eom'
	)
	labels = clusterer.fit_predict(reduced)

	# 4. Extract topics (see full code in repo)
	# ... c-TF-IDF implementation ...
	```

	## Training Your Own Model

	Recommended for production use:

	```python
	from sentence_transformers import SentenceTransformer
	import umap

	# Your chat messages
	historical_texts = [...] # First 80% of your timeline

	# Embed
	model = SentenceTransformer('all-MiniLM-L6-v2')
	embeddings = model.encode(historical_texts)

	# Train UMAP on YOUR data
	umap_model = umap.UMAP(
	n_components=15,
	metric='cosine',
	random_state=42
	)
	umap_model.fit(embeddings)

	# Save for future use
	import pickle
	with open('my_chat_umap.pkl', 'wb') as f:
	pickle.dump(umap_model, f)
	```

	Then use `transform()` on new messages with fresh clustering/topics each time.

	## Training Details

	- Training data: ~400 conversation bursts (excluding last 2 months for privacy)
	- Date range: June 2024 - November 2024
	- Domain: Tech/AI discussions (Claude Code, superpowers, coding workflows)
	- Typical topics: agent workflows, coding tools, LLM discussions, infrastructure

	## Performance

	On held-out test data (last 2 months):
	- 8 clusters identified
	- 27.8% noise rate
	- Clear topic differentiation

	## Limitations

	- Trained on tech/AI vocabulary - may not generalize to other domains
	- Small training set (400 bursts) - larger chats should train their own
	- English only
	- Optimized for min_cluster_size=2 (adjust for your density)

	## Citation

	If you use this approach:

	```bibtex
	@misc{vibes-clustering-2024,
	title={UMAP-Only Training for Chat Clustering},
	author={Sugimura, Michael},
	year={2024},
	publisher={HuggingFace},
	url={https://huggingface.co/msugimura/vibes-clustering}
	}
	```

	## Related Work

	- [BERTopic](https://github.com/MaartenGr/BERTopic) - Full pipeline (this extracts the UMAP-only approach)
	- [UMAP](https://github.com/lmcinnes/umap) - Dimensionality reduction
	- [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan) - Density-based clustering
	- [sentence-transformers](https://www.sbert.net/) - Text embeddings

	## License

	MIT License - Free to use, but train your own model for production!