Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +93 -0
config.json +12 -0
umap_model.pkl +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,93 @@

+# Vibes Chat Clustering Model
+This model is trained on WhatsApp chat data from "The vibez" group.
+## Model Architecture
+- **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
+- **UMAP**: Reduces to 15 dimensions with cosine metric
+- **Clustering**: Fresh HDBSCAN on each inference (min_cluster_size=2 recommended)
+- **Topics**: Fresh c-TF-IDF vocabulary extraction on each inference
+## Usage
+```python
+import pickle
+import json
+from sentence_transformers import SentenceTransformer
+import hdbscan
+from sklearn.feature_extraction.text import CountVectorizer
+import numpy as np
+# Load model and config
+with open('umap_model.pkl', 'rb') as f:
+    umap_model = pickle.load(f)
+with open('config.json') as f:
+    config = json.load(f)
+# Load BERT model
+bert_model = SentenceTransformer(config['bert_model'])
+# Your texts to cluster
+texts = ["your", "messages", "here"]
+# 1. Embed with BERT
+embeddings = bert_model.encode(texts)
+# 2. Transform with UMAP
+reduced = umap_model.transform(embeddings)
+# 3. Cluster with HDBSCAN (fresh clustering)
+clusterer = hdbscan.HDBSCAN(
+    min_cluster_size=config['recommended_min_cluster_size'],
+    metric='euclidean',
+    cluster_selection_method='eom'
+)
+labels = clusterer.fit_predict(reduced)
+# 4. Extract topics with c-TF-IDF (fresh vocabulary)
+# Group texts by cluster
+cluster_docs = {}
+for text, label in zip(texts, labels):
+    if label != -1:
+        cluster_docs.setdefault(label, []).append(text)
+# Concatenate per cluster
+cluster_texts = [" ".join(cluster_docs[cid]) for cid in sorted(cluster_docs.keys())]
+# Vectorize
+vectorizer = CountVectorizer(
+    stop_words="english",
+    min_df=max(1, int(len(cluster_texts) * 0.05)),
+    max_df=config['max_df'],
+    ngram_range=tuple(config['ngram_range'])
+)
+tf = vectorizer.fit_transform(cluster_texts)
+# c-TF-IDF
+n_clusters = len(cluster_texts)
+df = np.array((tf > 0).sum(axis=0)).flatten()
+idf = np.log(n_clusters / (1 + df))
+ctfidf = tf.multiply(idf).toarray()
+# Get top words
+words = vectorizer.get_feature_names_out()
+for i, cid in enumerate(sorted(cluster_docs.keys())):
+    top_indices = ctfidf[i].argsort()[-10:][::-1]
+    top_words = [words[j] for j in top_indices]
+    print(f"Topic {cid}: {', '.join(top_words)}")
+```
+## Training Data
+Trained on historical WhatsApp chat bursts (excluding last 2 months).
+## Key Insight
+This model uses **UMAP-only training**:
+- UMAP projection is frozen (trained once)
+- HDBSCAN clustering is fresh each inference
+- c-TF-IDF vocabulary is fresh each inference
+This allows the model to adapt to new vocabulary and topics while maintaining a consistent embedding space.

config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "bert_model": "all-MiniLM-L6-v2",
+  "umap_n_components": 15,
+  "umap_metric": "cosine",
+  "recommended_min_cluster_size": 2,
+  "min_df_percent": 0.05,
+  "max_df": 0.85,
+  "ngram_range": [
+    1,
+    2
+  ]
+}

umap_model.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:67713963c33d3b1e1c3a8887c5d9e1e8e4c5b8cdc93dbe430147021965cc1672
+size 727443