Spaces:

odeyaaa
/

social-assistent

Sleeping

App Files Files Community

odeyaaa commited on Jan 15

Commit

cf49347

verified ·

1 Parent(s): 373c82c

Upload 10 files

Browse files

Files changed (10) hide show

README.md +107 -5
app.py +192 -0
env +1 -0
gitattributes +3 -0
model-prep.py +302 -0
model-search.py +21 -0
requirements.txt +9 -0
tfidf_vectorizer.pkl +3 -0
tiktok_knowledge_base.parquet +3 -0
viral_model.json +0 -0

README.md CHANGED Viewed

@@ -1,12 +1,114 @@
 ---
-title: Social Assistent
-emoji: 📉
 colorFrom: green
-colorTo: red
 sdk: gradio
-sdk_version: 6.3.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Social Media Virality
+emoji: 📺
 colorFrom: green
+colorTo: yellow
 sdk: gradio
+sdk_version: 5.9.0
 app_file: app.py
 pinned: false
 ---
+# Social Media Virality Prediction & Optimization Project
+**Course**: Data Science & Machine Learning Applications
+**Project**: Viral Content Assistant
+## 1. Project Overview
+This project aims to develop a data-driven system capable of predicting the viral potential of short-form video content (e.g., TikTok) and optimizing it using Generative AI. By leveraging Natural Language Processing (NLP) and Machine Learning (ML), the system analyzes video descriptions and metadata to forecast view counts and prescribes actionable improvements to maximize engagement.
+The core solution consists of a machine learning pipeline for virality prediction and a web application (Gradio) for real-time user interaction.
+## 2. Data Science Methodology
+### 2.1 Data Acquisition & Generation
+Due to privacy restrictions and API limitations of social platforms, we simulated a realistic dataset reflecting 2025 social media trends.
+*   **Source**: Synthetic generation options using the `Faker` library and `numpy` probabilistic distributions.
+*   **Volume**: 10,000 samples.
+*   **Features**:
+    *   **Textual**: Video descriptions rich in slang (e.g., "Skibidi", "Girl Dinner"), hashtags, and emojis.
+    *   **Temporal**: Upload hour, day of week.
+    *   **Meta**: Video duration, category (Gaming, Beauty, etc.).
+*   **Target Variable**: `views` (Log-normally distributed to mimic real-world viral discrepancies).
+### 2.2 Exploratory Data Analysis (EDA)
+We analyzed the distribution of the target variable and feature correlations.
+*   **Observation**: View counts follow a "power law" distribution; most videos have few views, while a few "viral hits" capture the majority.
+*   **Preprocessing**: We applied a Log-transformation (`np.log1p`) to the `views` variable to normalize the distribution for regression models.
+![Views Distribution](project_plots/eda_distribution.png)
+*Figure 1: Distribution of Log-Transformed View Counts.*
+### 2.3 Feature Engineering
+*   **Text Embeddings**: We used **TF-IDF Vectorization** (Top 2,000 features) to convert unstructured text descriptions into numerical vectors.
+*   **Meta Features**: Encoded `is_weekend`, `hour_of_day`, and `video_duration_sec`.
+*   **Data Splitting**: A **Temporal Split** (80/20) was used instead of a random split to prevent data leakage, ensuring the model predicts future videos based on past trends.
+## 3. Model Development & Evaluation
+We evaluated three distinct algorithms to solve the regression problem (predicting log-views):
+1.  **Linear Regression**: Baseline model for interpretability.
+2.  **Random Forest Regressor**: Ensemble method to capture non-linear relationships.
+3.  **XGBoost Regressor**: Gradient boosting machine known for state-of-the-art tabular performance.
+### 3.1 Comparative Metrics
+Models were assessed using:
+*   **RMSE (Root Mean Squared Error)**: The primary metric for regression accuracy.
+*   **R² (Coefficient of Determination)**: Explains the variance captured by the model.
+*   **F1-Score**: Used to proxy classification performance (predicting if a video hits the "Viral Threshold" (top 20%)).
+![Model Leaderboard](project_plots/model_leaderboard.png)
+*Figure 2: Performance comparison across different architectures.*
+### 3.2 Result
+The **XGBoost Regressor** outperformed other models, achieving the lowest RMSE on the test set. This model was selected for the final deployment.
+## 4. Advanced Analysis: Embeddings & Semantic Search
+Beyond simple regression, we implemented a semantic search engine using **SentenceTransformers** (`all-MiniLM-L6-v2`).
+*   **Purpose**: To retrieve historical viral hits conceptually similar to the user's new idea.
+*   **Clustering**: We visualized the semantic space using PCA (Principal Component Analysis).
+![Embedding Clusters](project_plots/embedding_clusters.png)
+*Figure 3: Semantic clustering of video descriptions.*
+## 5. Application & Deployment
+The final deliverable is an interactive web application built with **Gradio**.
+### 5.1 System Architecture
+The system is decoupled into two main components:
+1.  **Training Pipeline (`model-prep.py`)**: Runs offline to generate synthetic data, train the XGBoost model, and create the vector database. It saves these artifacts (`viral_model.json`, `tfidf_vectorizer.pkl`, `tiktok_knowledge_base.parquet`).
+2.  **Inference App (`app.py`)**: A lightweight Gradio app that loads the pre-trained artifacts to serve real-time predictions without needing to retrain.
+**Data Flow**:
+1.  **Input**: User provided video description.
+2.  **Inference**: Loaded XGBoost model predicts view count.
+3.  **Retrieval**: App searches the pre-computed Parquet knowledge base for similar viral videos.
+4.  **Generative Optimization**: **Google Gemini 2.5 Flash Lite** rewrites the draft.
+5.  **Output**: Predictions, Similar Videos, and AI-Optimized content.
+### 5.2 Usage Instructions
+To run the project locally for assessment:
+1.  **Environment Setup**:
+    ```bash
+    python3 -m venv .venv
+    source .venv/bin/activate
+    pip install -r requirements.txt
+    ```
+2.  **Configuration**:
+    Ensure the `.env` file contains a valid `GEMINI_API_KEY`.
+3.  **Execution**:
+    ```bash
+    python app.py
+    ```
+    Access the UI at `http://localhost:7860`.
+## 6. Conclusion
+This project demonstrates a complete end-to-end Data Science workflow: from synthetic data creation and rigorous model evaluation to the deployment of a user-facing AI application. The integration of predictive analytics (XGBoost) with generative AI (Gemini) provides a robust tool for content creators.
+## 🏆 Credits
+* **Project Author:** Matan Kriel
+* **Project Author:** Odeya Shmuel

app.py ADDED Viewed

	@@ -0,0 +1,192 @@

+import gradio as gr
+import pandas as pd
+import numpy as np
+import os
+import google.generativeai as genai
+from sentence_transformers import SentenceTransformer
+from sklearn.metrics.pairwise import cosine_similarity
+from dotenv import load_dotenv
+# Import functions from model-prep
+from xgboost import XGBRegressor # Use Regressor as per model-prep
+import pickle
+from importlib.util import spec_from_file_location
+import sys
+# Since we are loading artifacts, we don't strictly need model-prep.py logic anymore.
+# But keeping basic imports is fine.
+# Load environment variables
+load_dotenv()
+# --- GLOBAL STATE ---
+MODEL = None
+VECTORIZER = None
+KNOWLEDGE_DF = None
+ST_MODEL = None
+def initialize_app():
+    """Initializes the model and data on app startup."""
+    global MODEL, VECTORIZER, KNOWLEDGE_DF, ST_MODEL
+    print("⏳ initializing app: Loading pre-computed artifacts...")
+    # 1. Load Parquet Data (Knowledge Base)
+    # We expect this file to exist now.
+    parquet_path = 'tiktok_knowledge_base.parquet'
+    if not os.path.exists(parquet_path):
+        raise FileNotFoundError(f"Required file '{parquet_path}' not found! Run model-prep.py first.")
+    print(f"📂 Loading data from {parquet_path}...")
+    knowledge_df = pd.read_parquet(parquet_path)
+    # 2. Load Model
+    print("🧠 Loading XGBoost Model...")
+    model = XGBRegressor()
+    model.load_model("viral_model.json")
+    # 3. Load Vectorizer
+    print("🔤 Loading TF-IDF Vectorizer...")
+    with open("tfidf_vectorizer.pkl", "rb") as f:
+        tfidf = pickle.load(f)
+    # 4. Load Sentence Transformer
+    print("🔌 Loading SentenceTransformer...")
+    # device=model_prep.device might fail if we don't import model_prep executed.
+    # Just use defaults or check pytorch standardly.
+    import torch
+    device = "mps" if torch.backends.mps.is_available() else "cpu"
+    st_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
+    MODEL = model
+    VECTORIZER = tfidf
+    KNOWLEDGE_DF = knowledge_df
+    ST_MODEL = st_model
+    print("✅ App initialized (Inference Mode)!")
+def predict_and_optimize(user_input):
+    if not user_input:
+        return "Please enter a video description.", "", "", "", ""
+    # --- 1. INITIAL PREDICTION ---
+    text_vec = VECTORIZER.transform([user_input]).toarray()
+    # Assume default meta: 15s duration, 18:00 (6 PM), weekday (0), hashtag count from input
+    meta_vec = np.array([[15, 18, 0, user_input.count('#')]])
+    feat_vec = np.hstack((text_vec, meta_vec))
+    initial_log = MODEL.predict(feat_vec)[0]
+    initial_views = int(np.expm1(initial_log))
+    # --- 2. VECTOR SEARCH ---
+    # Filter for viral hits in knowledge base (top 25%)
+    high_perf_df = KNOWLEDGE_DF[KNOWLEDGE_DF['views'] > KNOWLEDGE_DF['views'].quantile(0.75)].copy()
+    user_embedding = ST_MODEL.encode([user_input], convert_to_numpy=True)
+    target_embeddings = np.stack(high_perf_df['embedding'].values)
+    similarities = cosine_similarity(user_embedding, target_embeddings)
+    top_3_indices = similarities[0].argsort()[-3:][::-1]
+    top_3_videos = high_perf_df.iloc[top_3_indices]['description'].tolist()
+    similar_videos_str = "\n\n".join([f"{i+1}. {v}" for i, v in enumerate(top_3_videos)])
+    # --- 3. GEMINI OPTIMIZATION ---
+    api_key = os.getenv("GEMINI_API_KEY")
+    if not api_key:
+        return f"{initial_views:,}", similar_videos_str, "Error: GEMINI_API_KEY not found.", "N/A", "N/A"
+    genai.configure(api_key=api_key)
+    # Using the updated model from the user's latest change
+    try:
+        llm = genai.GenerativeModel('gemini-2.5-flash-lite')
+    except:
+         llm = genai.GenerativeModel('gemini-1.5-flash')
+    prompt = f"""
+    You are a TikTok Virality Expert.
+    My Draft Description: "{user_input}"
+    Here are 3 successful, viral videos that are similar to my topic:
+    1. {top_3_videos[0]}
+    2. {top_3_videos[1]}
+    3. {top_3_videos[2]}
+    Task: Rewrite my draft description to make it go viral and full video plan.
+    Use the slang, hashtag style, and structure of the successful examples provided.
+    Keep it under 60 words plus hashtags. Return ONLY the new description.
+    """
+    try:
+        response = llm.generate_content(prompt)
+        improved_idea = response.text.strip()
+        # --- 4. RE-SCORING ---
+        new_text_vec = VECTORIZER.transform([improved_idea]).toarray()
+        new_meta_vec = np.array([[15, 18, 0, improved_idea.count('#')]])
+        new_feat_vec = np.hstack((new_text_vec, new_meta_vec))
+        new_log = MODEL.predict(new_feat_vec)[0]
+        new_views = int(np.expm1(new_log))
+        uplift_pct = ((new_views - initial_views) / initial_views) * 100
+        uplift_str = f"+{uplift_pct:.1f}%" if uplift_pct > 0 else "No significant uplift"
+        return f"{initial_views:,}", similar_videos_str, improved_idea, f"{new_views:,}", uplift_str
+    except Exception as e:
+        return f"{initial_views:,}", similar_videos_str, f"Error calling AI: {str(e)}", "N/A", "N/A"
+# --- GRADIO UI ---
+with gr.Blocks(theme=gr.themes.Soft()) as demo:
+    gr.Markdown("# 🚀 Viral Content Optimizer")
+    gr.Markdown("Enter your video idea to predict its views and get AI-powered optimizations based on 2025 trends.")
+    with gr.Row():
+        with gr.Column(scale=1):
+            input_text = gr.Textbox(
+                label="Your Video Description",
+                placeholder="e.g., POV: trying the new grimace shake #viral",
+                lines=3
+            )
+            with gr.Row():
+                submit_btn = gr.Button("Analyze & Optimize ⚡", variant="primary")
+                demo_btn = gr.Button("🎲 Try Demo", variant="secondary")
+        with gr.Column(scale=1):
+            with gr.Group():
+                gr.Markdown("### 📊 Predictions")
+                initial_views = gr.Textbox(label="Predicted Views (Original)", interactive=False)
+            with gr.Group():
+                gr.Markdown("### ✨ AI Optimization")
+                improved_text = gr.Textbox(label="Improved Description", interactive=False)
+                with gr.Row():
+                    new_views = gr.Textbox(label="New Predicted Views", interactive=False)
+                    uplift = gr.Textbox(label="Potential Uplift", interactive=False)
+    with gr.Accordion("🔍 Similar Viral Videos (Reference)", open=False):
+        similar_videos = gr.Textbox(label="Top 3 Context Matches", interactive=False, lines=5)
+    submit_btn.click(
+        fn=predict_and_optimize,
+        inputs=[input_text],
+        outputs=[initial_views, similar_videos, improved_text, new_views, uplift]
+    )
+    # Demo Button Logic: 1. Fill Text -> 2. Run Prediction
+    demo_text = "POV: You realize you forgot to turn off your mic during the all-hands meeting 💀 #fail #fyp #corporate"
+    demo_btn.click(
+        fn=lambda: demo_text,
+        inputs=None,
+        outputs=input_text
+    ).then(
+        fn=predict_and_optimize,
+        inputs=gr.State(demo_text), # Pass directly to avoid race condition with UI update
+        outputs=[initial_views, similar_videos, improved_text, new_views, uplift]
+    )
+# Run initialization
+if __name__ == "__main__":
+    initialize_app()
+    demo.launch(server_name="0.0.0.0", server_port=7860)

env ADDED Viewed

	@@ -0,0 +1 @@


1	+ GEMINI_API_KEY=AIzaSyDv0m8cjeMuN5ue_VtSz9sMiQfsJ_GpvKI

gitattributes ADDED Viewed

	@@ -0,0 +1,3 @@

+*.png filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text

model-prep.py ADDED Viewed

	@@ -0,0 +1,302 @@

+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+import warnings
+import os
+import torch
+import google.generativeai as genai
+from faker import Faker
+from datetime import datetime, timedelta
+from sklearn.metrics.pairwise import cosine_similarity
+import pickle
+from dotenv import load_dotenv
+# Load environment variables from the .env filea monk
+load_dotenv()
+# Machine Learning Imports
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.ensemble import RandomForestRegressor
+from xgboost import XGBRegressor
+from sklearn.linear_model import LinearRegression
+from sklearn.metrics import mean_squared_error, f1_score
+from sklearn.decomposition import PCA
+from sentence_transformers import SentenceTransformer
+# ---------------------------------------------------------
+# 0. SETUP & CONFIGURATION
+# ---------------------------------------------------------
+warnings.filterwarnings('ignore')
+pd.set_option('display.max_columns', None)
+# OPTIMIZATION: Check for Apple Silicon (MPS)
+device = "mps" if torch.backends.mps.is_available() else "cpu"
+print(f"🚀 Optimization: Running on {device.upper()} device")
+if not os.path.exists('project_plots'):
+    os.makedirs('project_plots')
+# ---------------------------------------------------------
+# 1. DATA GENERATION (With 2025 Trends)
+# ---------------------------------------------------------
+def generate_enhanced_data(n_rows=10000):
+    print(f"\n[1/8] Generating {n_rows} rows of Real-World 2025 Data...")
+    fake = Faker()
+    trends = [
+        'Delulu', 'Girl Dinner', 'Roman Empire', 'Silent Slay', 'Soft Life',
+        'Grimace Shake', 'Wes Anderson Style', 'Beige Flag', 'Canon Event',
+        'NPC Stream', 'Skibidi', 'Fanum Tax', 'Yapping', 'Glow Up', 'Fit Check'
+    ]
+    formats = [
+        'POV: You realize...', 'GRWM for...', 'Day in the life:',
+        'Storytime:', 'Trying the viral...', 'ASMR packing orders',
+        'Rating my exes...', 'Turn the lights off challenge'
+    ]
+    categories = ['Gaming', 'Beauty', 'Comedy', 'Edutainment', 'Lifestyle', 'Food']
+    data = []
+    start_date = datetime(2024, 1, 1)
+    for _ in range(n_rows):
+        upload_time = start_date + timedelta(days=np.random.randint(0, 365), hours=np.random.randint(0, 23))
+        trend = np.random.choice(trends)
+        fmt = np.random.choice(formats)
+        cat = np.random.choice(categories)
+        description = f"{fmt} {trend} edition! {fake.sentence(nb_words=6)}"
+        tags = ['#fyp', '#foryou', '#viral', f'#{trend.replace(" ", "").lower()}', f'#{cat.lower()}']
+        if np.random.random() > 0.5: tags.append('#trending2025')
+        full_text = f"{description} {' '.join(tags)}"
+        # Meta Features
+        duration = np.random.randint(5, 180)
+        hour = upload_time.hour
+        is_weekend = 1 if upload_time.weekday() >= 5 else 0
+        # View Count Logic
+        base_virality = np.random.lognormal(mean=9.5, sigma=1.8)
+        multiplier = 1.0
+        if is_weekend: multiplier *= 1.2
+        if duration < 15: multiplier *= 1.4
+        if "Delulu" in full_text or "POV" in full_text: multiplier *= 1.6
+        if hour >= 18: multiplier *= 1.1
+        views = int(base_virality * multiplier)
+        data.append({
+            'upload_date': upload_time,
+            'description': full_text,
+            'category': cat,
+            'video_duration_sec': duration,
+            'hour_of_day': hour,
+            'is_weekend': is_weekend,
+            'hashtag_count': len(tags),
+            'views': views
+        })
+    df = pd.DataFrame(data)
+    df = df.sort_values('upload_date').reset_index(drop=True)
+    threshold = df['views'].quantile(0.80)
+    df['is_viral_binary'] = (df['views'] > threshold).astype(int)
+    df['log_views'] = np.log1p(df['views'])
+    return df, threshold
+# ---------------------------------------------------------
+# 2. EDA & PREPROCESSING
+# ---------------------------------------------------------
+def process_data_pipeline(df):
+    print("\n[2/8] Processing Data Pipeline...")
+    # Simple EDA Save
+    clean_df = df[df['video_duration_sec'] > 0].copy()
+    plt.figure(figsize=(6,4))
+    sns.histplot(clean_df['log_views'], color='teal')
+    plt.title('Log Views Distribution')
+    plt.savefig('project_plots/eda_distribution.png')
+    plt.close()
+    # TF-IDF & Split
+    tfidf = TfidfVectorizer(max_features=2000, stop_words='english')
+    X_text = tfidf.fit_transform(df['description']).toarray()
+    num_cols = ['video_duration_sec', 'hour_of_day', 'is_weekend', 'hashtag_count']
+    X_num = df[num_cols].values
+    X = np.hstack((X_text, X_num))
+    y = df['log_views'].values
+    y_bin = df['is_viral_binary'].values
+    split_idx = int(len(df) * 0.80)
+    return X[:split_idx], X[split_idx:], y[:split_idx], y[split_idx:], y_bin[split_idx:], tfidf
+# ---------------------------------------------------------
+# 3. TRAINING
+# ---------------------------------------------------------
+def train_best_model(X_train, y_train, X_test, y_test):
+    print("\n[3/8] Training Model (XGBoost)...")
+    model = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=6, n_jobs=-1)
+    model.fit(X_train, y_train)
+    rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
+    print(f"    - Model RMSE: {rmse:.3f}")
+    return model
+# ---------------------------------------------------------
+# 4. EMBEDDINGS GENERATION (For Search)
+# ---------------------------------------------------------
+def create_search_index(df):
+    print("\n[4/8] Creating Vector Search Index...")
+    # Generate embeddings for ALL data so we can search the whole history
+    st_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
+    embeddings = st_model.encode(df['description'].tolist(), convert_to_numpy=True, show_progress_bar=True)
+    df['embedding'] = list(embeddings)
+    # Save to Parquet (The Knowledge Base)
+    save_path = 'tiktok_knowledge_base.parquet'
+    df.to_parquet(save_path)
+    print(f"    - Knowledge Base saved to {save_path}")
+    return df, st_model
+# ---------------------------------------------------------
+# 5. RETRIEVAL & IMPROVEMENT ENGINE (The Magic Step)
+# ---------------------------------------------------------
+def optimize_content_with_gemini(user_input, model, vectorizer, knowledge_df, st_model):
+    """
+    1. Scores original idea.
+    2. Finds top 3 similar VIRAL videos.
+    3. Asks Gemini to rewrite the idea.
+    4. Re-scores the new idea.
+    """
+    print("\n" + "="*50)
+    print("🚀 VIRAL OPTIMIZATION ENGINE")
+    print("="*50)
+    # --- STEP 1: INITIAL SCORE ---
+    text_vec = vectorizer.transform([user_input]).toarray()
+    # Assume default meta for prediction (15s, 6 PM, weekday)
+    meta_vec = np.array([[15, 18, 0, user_input.count('#')]])
+    feat_vec = np.hstack((text_vec, meta_vec))
+    initial_log = model.predict(feat_vec)[0]
+    initial_views = int(np.expm1(initial_log))
+    print(f"\n📝 ORIGINAL IDEA: {user_input}")
+    print(f"📊 Predicted Views: {initial_views:,}")
+    # --- STEP 2: VECTOR SEARCH (Find similar successful videos) ---
+    print("\n🔍 Searching for similar viral hits in Parquet file...")
+    # Filter only for successful videos (e.g., top 25% of views)
+    high_performance_df = knowledge_df[knowledge_df['views'] > knowledge_df['views'].quantile(0.75)].copy()
+    # Encode user input
+    user_embedding = st_model.encode([user_input], convert_to_numpy=True)
+    # Stack embeddings from the dataframe into a matrix
+    target_embeddings = np.stack(high_performance_df['embedding'].values)
+    # Calculate Cosine Similarity
+    similarities = cosine_similarity(user_embedding, target_embeddings)
+    # Get Top 3 indices
+    top_3_indices = similarities[0].argsort()[-3:][::-1]
+    top_3_videos = high_performance_df.iloc[top_3_indices]['description'].tolist()
+    print("    -> Found 3 similar viral videos to learn from:")
+    for i, vid in enumerate(top_3_videos, 1):
+        print(f"       {i}. {vid[:80]}...")
+    # --- STEP 3: GEMINI OPTIMIZATION ---
+    api_key = os.getenv("GEMINI_API_KEY")
+    if not api_key:
+        print("\n⚠️  SKIPPING AI REWRITE: No 'GEMINI_API_KEY' found in environment variables.")
+        print("    (Set it via 'export GEMINI_API_KEY=your_key' in terminal)")
+        return
+    print("\n🤖 Sending context to Gemini LLM for optimization...")
+    genai.configure(api_key=api_key)
+    llm = genai.GenerativeModel('gemini-2.5-flash-lite')
+    prompt = f"""
+    You are a TikTok Virality Expert.
+    My Draft Description: "{user_input}"
+    Here are 3 successful, viral videos that are similar to my topic:
+    1. {top_3_videos[0]}
+    2. {top_3_videos[1]}
+    3. {top_3_videos[2]}
+    Task: Rewrite my draft description to make it go viral.
+    Use the slang, hashtag style, and structure of the successful examples provided.
+    Keep it under 20 words plus hashtags. Return ONLY the new description.
+    """
+    try:
+        response = llm.generate_content(prompt)
+        improved_idea = response.text.strip()
+        print(f"\n✨ IMPROVED IDEA (By Gemini): {improved_idea}")
+        # --- STEP 4: RE-EVALUATION ---
+        new_text_vec = vectorizer.transform([improved_idea]).toarray()
+        # Update hashtag count for new features
+        new_meta_vec = np.array([[15, 18, 0, improved_idea.count('#')]])
+        new_feat_vec = np.hstack((new_text_vec, new_meta_vec))
+        new_log = model.predict(new_feat_vec)[0]
+        new_views = int(np.expm1(new_log))
+        print(f"📊 New Predicted Views: {new_views:,}")
+        improvement = ((new_views - initial_views) / initial_views) * 100
+        if improvement > 0:
+            print(f"🚀 POTENTIAL UPLIFT: +{improvement:.1f}%")
+        else:
+            print(f"😐 No significant uplift predicted (Model is strict!).")
+    except Exception as e:
+        print(f"❌ Error calling Gemini API: {e}")
+# ---------------------------------------------------------
+# MAIN EXECUTION
+# ---------------------------------------------------------
+if __name__ == "__main__":
+    # 1. Pipeline
+    df, _ = generate_enhanced_data(10000)
+    X_train, X_test, y_train, y_test, _, tfidf = process_data_pipeline(df)
+    # 2. Train Prediction Model
+    best_model = train_best_model(X_train, y_train, X_test, y_test)
+    # 3. Create Knowledge Base (Embeddings)
+    knowledge_df, st_model = create_search_index(df)
+    # 4. Save Artifacts for App
+    print("\n[5/8] Saving Model Artifacts for Production...")
+    best_model.save_model("viral_model.json")
+    print("    - Model saved to 'viral_model.json'")
+    with open("tfidf_vectorizer.pkl", "wb") as f:
+        pickle.dump(tfidf, f)
+    print("    - Vectorizer saved to 'tfidf_vectorizer.pkl'")
+    # 5. User Interaction Loop
+    while True:
+        print("\n" + "-"*30)
+        user_input = input("Enter your video idea (or 'q' to quit): ")
+        if user_input.lower() == 'q':
+            break
+        optimize_content_with_gemini(
+            user_input=user_input,
+            model=best_model,
+            vectorizer=tfidf,
+            knowledge_df=knowledge_df,
+            st_model=st_model
+        )

model-search.py ADDED Viewed

	@@ -0,0 +1,21 @@

+import google.generativeai as genai
+import os
+from dotenv import load_dotenv
+# 1. Load your API key
+load_dotenv()
+api_key = os.getenv("GEMINI_API_KEY")
+if not api_key:
+    print("Error: API key not found. Make sure it is in your .env file.")
+else:
+    genai.configure(api_key=api_key)
+    print("--- Available Gemini Models ---")
+    # 2. List all models and filter for those that generate content (text/chat)
+    for m in genai.list_models():
+        if 'generateContent' in m.supported_generation_methods:
+            print(f"Name: {m.name}")
+            print(f"   - Display Name: {m.display_name}")
+            print(f"   - Input Limit: {m.input_token_limit} tokens")
+            print("-" * 30)

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+gradio>=5.0
+pandas
+numpy
+xgboost
+scikit-learn
+sentence-transformers
+google-generativeai
+python-dotenv
+faker

tfidf_vectorizer.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:96a8c9e0af89e51756ab34cd1219571dea1cd2ca9f2558895fcd00250a3a6c8b
+size 29989

tiktok_knowledge_base.parquet ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:12a2bf4192ada1bb32d08a87f9e811c28a933c40ae0d4831143f1f1fdccd6579
+size 16651296

viral_model.json ADDED Viewed

The diff for this file is too large to render. See raw diff