File size: 4,766 Bytes

---
license: mit
tags:
- clustering
- topic-modeling
- umap
- hdbscan
- whatsapp-analysis
datasets:
- private
language:
- en
pipeline_tag: feature-extraction
---

# Vibes Chat Clustering Model

> ⚠️ **Note**: This is a demo model trained on a specific WhatsApp group ("The vibez" - tech/AI discussions). For production use, **train your own UMAP model on your chat data**.

## About

This model demonstrates the UMAP-only training approach for WhatsApp chat clustering. It's trained on ~400 conversation bursts from a tech-focused group chat.

### Best For
- Demo/example of clustering approach
- Comparing topics with similar tech/AI discussion groups
- Learning how the pipeline works

### NOT Recommended For
- Production clustering of your own chats (train on your data instead!)
- Different chat domains (family, work, etc.)
- Long-term deployment

## Model Architecture

This uses a **UMAP-only training** approach:

1. **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
2. **UMAP** (TRAINED): Reduces to 15 dimensions with cosine metric
3. **HDBSCAN** (FRESH): Clustering on each inference (min_cluster_size=2)
4. **c-TF-IDF** (FRESH): Topic extraction with current vocabulary

### Why UMAP-Only?

Unlike BERTopic's `transform()` which freezes vocabulary and causes high noise rates (94.8% in our tests), this approach:
- Trains UMAP once for consistent embedding space
- Re-clusters with fresh HDBSCAN each time
- Extracts topics with fresh vocabulary each time

This adapts to new vocabulary while maintaining spatial consistency.

## Quick Start

```python
from huggingface_hub import hf_hub_download
import pickle
import json
from sentence_transformers import SentenceTransformer
import hdbscan
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Download model (cached after first run)
umap_path = hf_hub_download(
    repo_id="msugimura/vibes-clustering",
    filename="umap_model.pkl"
)
config_path = hf_hub_download(
    repo_id="msugimura/vibes-clustering",
    filename="config.json"
)

# Load model
with open(umap_path, 'rb') as f:
    umap_model = pickle.load(f)
with open(config_path) as f:
    config = json.load(f)

bert_model = SentenceTransformer(config['bert_model'])

# Your texts
texts = ["your chat messages here"]

# 1. Embed
embeddings = bert_model.encode(texts)

# 2. Transform with pre-trained UMAP
reduced = umap_model.transform(embeddings)

# 3. Fresh clustering
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=config['recommended_min_cluster_size'],
    metric='euclidean',
    cluster_selection_method='eom'
)
labels = clusterer.fit_predict(reduced)

# 4. Extract topics (see full code in repo)
# ... c-TF-IDF implementation ...
```

## Training Your Own Model

**Recommended for production use:**

```python
from sentence_transformers import SentenceTransformer
import umap

# Your chat messages
historical_texts = [...]  # First 80% of your timeline

# Embed
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(historical_texts)

# Train UMAP on YOUR data
umap_model = umap.UMAP(
    n_components=15,
    metric='cosine',
    random_state=42
)
umap_model.fit(embeddings)

# Save for future use
import pickle
with open('my_chat_umap.pkl', 'wb') as f:
    pickle.dump(umap_model, f)
```

Then use `transform()` on new messages with fresh clustering/topics each time.

## Training Details

- **Training data**: ~400 conversation bursts (excluding last 2 months for privacy)
- **Date range**: June 2024 - November 2024
- **Domain**: Tech/AI discussions (Claude Code, superpowers, coding workflows)
- **Typical topics**: agent workflows, coding tools, LLM discussions, infrastructure

## Performance

On held-out test data (last 2 months):
- **8 clusters** identified
- **27.8% noise** rate
- Clear topic differentiation

## Limitations

- Trained on tech/AI vocabulary - may not generalize to other domains
- Small training set (400 bursts) - larger chats should train their own
- English only
- Optimized for min_cluster_size=2 (adjust for your density)

## Citation

If you use this approach:

```bibtex
@misc{vibes-clustering-2024,
  title={UMAP-Only Training for Chat Clustering},
  author={Sugimura, Michael},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/msugimura/vibes-clustering}
}
```

## Related Work

- [BERTopic](https://github.com/MaartenGr/BERTopic) - Full pipeline (this extracts the UMAP-only approach)
- [UMAP](https://github.com/lmcinnes/umap) - Dimensionality reduction
- [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan) - Density-based clustering
- [sentence-transformers](https://www.sbert.net/) - Text embeddings

## License

MIT License - Free to use, but train your own model for production!