Spaces:

MatanKriel
/

social-assistent

Sleeping

App Files Files Community

Matan Kriel commited on Jan 21

Commit

2f9170f

1 Parent(s): 7803d6a

updated clustering metric in model test

Browse files

Files changed (6) hide show

README.md +3 -3
app.py +1 -1
model-prep.py +28 -6
project_plots/embedding_benchmark.png +2 -2
project_plots/regression_comparison.png +2 -2
viral_model.pkl +2 -2

README.md CHANGED Viewed

@@ -21,9 +21,9 @@ This project consists of two main components: a training pipeline (`model-prep.p
 the `model-prep.py` script handles the end-to-end model creation process:
 1.  **Cloud Data Loading**: It fetches the latest synthetic dataset directly from **Hugging Face** (`MatanKriel/social-assitent-synthetic-data`).
-2.  **Embedding Benchmark**: It evaluates 3 state-of-the-art models (`MiniLM`, `mpnet-base`, `bge-small`) to find the best text encoder.
-    *   *Metrics*: Encoding Speed vs. Clustering Quality (Silhouette Score).
-    *   *Winner*: Defaults to `sentence-transformers/all-mpnet-base-v2`.
 3.  **Feature Engineering**:
     *   Encodes categorical inputs: `category`, `gender`, `day_of_week`, `age`.
     *   Combines text embeddings with metadata (`followers`, `duration`, `hour`).

 the `model-prep.py` script handles the end-to-end model creation process:
 1.  **Cloud Data Loading**: It fetches the latest synthetic dataset directly from **Hugging Face** (`MatanKriel/social-assitent-synthetic-data`).
+2.  **Embedding Benchmark**: It evaluates 3 state-of-the-art models (`MiniLM`, `mpnet-base`, `bge-small`) using **Silhouette Score** on **Composite Labels** (`Category_ViralClass`).
+    *   *Why?* Instead of just clustering by topic (e.g., "Gaming"), this forces the model to distinguish between "Viral Gaming Videos" and "Average Gaming Videos".
+    *   *Selection*: Automatically picks the best model for this high-resolution task.
 3.  **Feature Engineering**:
     *   Encodes categorical inputs: `category`, `gender`, `day_of_week`, `age`.
     *   Combines text embeddings with metadata (`followers`, `duration`, `hour`).

app.py CHANGED Viewed

@@ -70,7 +70,7 @@ def initialize_app():
     # 4. Load SentenceTransformer
     print("🔌 Loading SentenceTransformer...")
-    embedding_model_name = "sentence-transformers/all-mpnet-base-v2"
     print(f"    -> Model: {embedding_model_name}")
     import torch

     # 4. Load SentenceTransformer
     print("🔌 Loading SentenceTransformer...")
+    embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
     print(f"    -> Model: {embedding_model_name}")
     import torch

model-prep.py CHANGED Viewed

@@ -68,16 +68,38 @@ def benchmark_and_select_model(df):
     results = []
-    if 'category' not in df.columns:
-        print("⚠️ No 'category' column. Skipping quality metric.")
-        labels = np.zeros(len(df))
     else:
-        labels = df['category'].values
-    # Sample for speed
     sample_df = df.sample(min(len(df), 3000), random_state=42)
     sample_texts = sample_df['description'].fillna("").tolist()
-    sample_labels = sample_df['category'].values
     print(f"{'Model':<40} | {'Time (s)':<10} | {'Silhouette':<10}")
     print("-" * 65)

     results = []
+    # Create Composite Labels for Silhouette Score
+    # Goal: Use "Category_ViralClass" (e.g., "Fitness_High") to measure separation
+    # 1. Ensure viral_class exists for benchmarking
+    if 'viral_class' not in df.columns and 'views' in df.columns:
+         threshold = df['views'].quantile(0.75)
+         df['viral_class'] = np.where(df['views'] > threshold, 'High', 'Low')
+         print(f"    -> ℹ️ Created temporary 'viral_class' (High/Low) for benchmarking.")
+    # 2. Define Labels
+    if 'category' in df.columns and 'viral_class' in df.columns:
+        print("    -> 🏷️ Using Composite Labels (Category + Viral Class) for metrics.")
+        # We need to perform this on the SAMPLE, not the whole DF if we sample later.
+        # But to be safe, let's just use the column if it exists.
+        pass # Logic handled after sampling
+    elif 'category' in df.columns:
+        print("    -> ⚠️ 'viral_class' missing. Falling back to 'category' only.")
     else:
+        print("    -> ⚠️ No categories found. Skipping quality metric.")
+    # Sample for speed (using the updated df which might have viral_class)
     sample_df = df.sample(min(len(df), 3000), random_state=42)
     sample_texts = sample_df['description'].fillna("").tolist()
+    if 'category' in sample_df.columns and 'viral_class' in sample_df.columns:
+        # Composite Label Formula
+        sample_labels = sample_df['category'].astype(str) + "_" + sample_df['viral_class'].astype(str)
+        sample_labels = sample_labels.values
+    elif 'category' in sample_df.columns:
+        sample_labels = sample_df['category'].values
+    else:
+        sample_labels = np.zeros(len(sample_df))
     print(f"{'Model':<40} | {'Time (s)':<10} | {'Silhouette':<10}")
     print("-" * 65)

project_plots/embedding_benchmark.png CHANGED Viewed

Git LFS Details

SHA256: da6fc1de241e564e73cc87eb57cd8b71d5a84c829e2975e476f937ac28f78b06
Pointer size: 130 Bytes
Size of remote file: 26.7 kB

Git LFS Details

SHA256: 5da26c856671fcedc44d92d1de23252ea6c37ad159a0db9d38384cb795d0a027
Pointer size: 130 Bytes
Size of remote file: 29.8 kB

project_plots/regression_comparison.png CHANGED Viewed

Git LFS Details

SHA256: d5c0e245d4cdcbc3124c2ab43c89c956099acceacaae041d889c7b2b5627d01a
Pointer size: 130 Bytes
Size of remote file: 27.4 kB

Git LFS Details

SHA256: 4fcee525946abe3cec8bec7416efd6d0345fdd980f635b945c27627a362f5d02
Pointer size: 130 Bytes
Size of remote file: 28.3 kB

viral_model.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4dc2e1d1bfe66f8970d7f2770e10f70ad426f78d7493ac4d0250f18ef878e9d7
-size 327349

 version https://git-lfs.github.com/spec/v1
+oid sha256:fede1adc51df6bc4148b7fdd758625ea3bfd17bbea07f435b0416acfef28c9e9
+size 337752