Spaces:

minhajHP
/

two_tower_recsys

Sleeping

minhajHP commited on Aug 24, 2025

Commit

d32ca60

1 Parent(s): e69bfae

Major codebase cleanup and feature additions

New Features:
- Add dual API implementations (api_2phase.py, api_joint.py)
- Add 2-phase and joint training scripts
- Add enhanced recommendation engine with 128D embeddings
- Add improved joint training with curriculum learning
- Add enhanced two-tower model architecture
- Add optimized dataset creator

Enhancements:
- Enhance main API with new endpoints and features
- Improve frontend UI with advanced styling and components
- Upgrade inference engines with better recommendation logic
- Enhance data preprocessing with categorical demographics
- Improve training modules with better optimization

Cleanup:
- Remove debug, analysis, and documentation files
- Update gitignore with ML-specific patterns
- Remove backup directories and temporary files
- Streamline codebase structure

Files: 25 changed, 5843 insertions, 829 deletions

Files changed (25) hide show

.gitignore +29 -1
CATEGORICAL_DEMOGRAPHICS_SUMMARY.md +0 -113
analyze_recommendation_quality.py +0 -556
analyze_recommendations.py +543 -0
api/main.py +171 -5
api_2phase.py +521 -0
api_joint.py +522 -0
frontend/src/App.css +1030 -15
frontend/src/App.js +505 -46
run_2phase_training.py +206 -0
run_joint_training.py +453 -0
src/inference/enhanced_recommendation_engine_128d.py +499 -0
src/inference/faiss_index.py +17 -13
src/inference/recommendation_engine.py +46 -19
src/models/enhanced_two_tower.py +574 -0
src/models/item_tower.py +2 -2
src/models/user_tower.py +2 -2
src/preprocessing/data_loader.py +72 -41
src/preprocessing/optimized_dataset_creator.py +111 -0
src/preprocessing/user_data_preparation.py +58 -3
src/training/fast_joint_training.py +1 -1
src/training/improved_joint_training.py +462 -0
src/training/item_pretraining.py +15 -8
src/training/joint_training.py +2 -2
src/training/optimized_joint_training.py +2 -2

.gitignore CHANGED Viewed

@@ -193,4 +193,32 @@ data/
 .Spotlight-V100
 .Trashes
 ehthumbs.db
-Thumbs.db

 .Spotlight-V100
 .Trashes
 ehthumbs.db
+Thumbs.db
+# Analysis and debug files
+debug_*.py
+test_*.py
+analyze_enhanced_*.py
+analyze_recommendation_quality.py
+*_analysis_*
+*_report.*
+simple_*.py
+# Keep analyze_recommendations.py - it's wanted
+!analyze_recommendations.py
+# Backup directories
+*_backup/
+*_bak/
+artifacts_backup/
+# Temporary files
+*.tmp
+*.temp
+*.cache
+# Configuration files with sensitive data
+config.json
+secrets.json
+.secret
+.key

CATEGORICAL_DEMOGRAPHICS_SUMMARY.md DELETED Viewed

@@ -1,113 +0,0 @@
-# Categorical Demographics Implementation Summary
-## ✅ **IMPLEMENTATION COMPLETE**
-Successfully converted age and income from continuous normalized features to categorical embeddings, achieving the goal of reducing demographics to 25% of total input dimensions.
----
-## 🎯 **Key Changes Made**
-### **1. Age Categorization (6 Categories)**
-- **Teen (0)**: Under 18
-- **Young Adult (1)**: 18-25
-- **Adult (2)**: 26-35
-- **Middle Age (3)**: 36-50
-- **Mature (4)**: 51-65
-- **Senior (5)**: 65+
-### **2. Income Categorization (5 Categories)**
-- **Low Income (0)**: Bottom 20% (≤$56,276)
-- **Lower Middle (1)**: 20-40% ($56,276-$69,236)
-- **Middle (2)**: 40-60% ($69,236-$80,661)
-- **Upper Middle (3)**: 60-80% ($80,661-$94,284)
-- **High Income (4)**: Top 20% (≥$94,284)
-### **3. Embedding Dimensions**
-**Original Tower (64D):**
-- Age: 4D, Income: 4D, Gender: 4D
-- **Total Demographics**: 12D (18.8% of input)
-**Improved Tower (128D):**
-- Age: 8D, Income: 8D, Gender: 8D
-- **Total Demographics**: 24D (18.8% of input)
----
-## 📁 **Files Modified**
-### **Data Preparation**
-- `src/preprocessing/user_data_preparation.py`
-  - Added `categorize_age()` and `categorize_income()` functions
-  - Updated `prepare_user_features()` to output categorical features
-### **Model Architecture**
-- `src/models/user_tower.py`
-  - Replaced normalization layers with embedding layers
-  - Updated forward pass for categorical inputs
-- `src/models/improved_two_tower.py`
-  - Same embedding updates as original tower
-  - Maintained sophisticated history aggregation
-### **Training Scripts**
-- `src/training/optimized_joint_training.py`
-- `src/training/joint_training.py`
-- `src/training/fast_joint_training.py`
-  - Removed normalization adaptation calls
-### **Inference Engine**
-- `src/inference/recommendation_engine.py`
-  - Added categorization functions for real-time inference
-  - Updated `prepare_user_features()` to categorize raw inputs
-  - Added income threshold loading from training data
----
-## 🔍 **Verification Results**
-✅ **All Tests Pass:**
-- Age categorization: 6 categories (0-5) ✅
-- Income categorization: 5 categories (0-4) ✅
-- Training features: Correct int32 dtypes ✅
-- User towers: Proper embedding dimensions ✅
-- Inference engine: Successful categorical conversion ✅
-- Recommendation engines: Working with categorical inputs ✅
----
-## 📊 **Benefits Achieved**
-### **1. Balanced Feature Representation**
-- **Before**: Demographics 75% (96D), History 25% (32D)
-- **After**: Demographics 19% (24D), History 81% (104D)
-### **2. Better Learning Patterns**
-- **Interpretable segments**: Clear demographic groups vs continuous values
-- **Non-linear relationships**: Each category learns distinct behaviors
-- **Reduced bias**: Less dependence on exact age/income values
-- **Better generalization**: Discrete categories vs continuous normalization
-### **3. Improved Model Architecture**
-- **Smaller demographics footprint**: More capacity for behavioral signals
-- **Category-specific patterns**: Age/income groups with unique preferences
-- **Embedding benefits**: Learned representations vs fixed normalization
----
-## 🚀 **Ready for Training**
-The categorical demographics implementation is complete and verified. The system now:
-1. **Prioritizes behavioral signals** (81%) over demographics (19%)
-2. **Uses interpretable demographic segments** instead of continuous values
-3. **Maintains all existing functionality** with enhanced representation
-4. **Is ready for improved model training** with better feature balance
-To train the improved model with categorical demographics:
-```bash
-python train_improved_model.py --embedding-dim 128 --epochs-per-stage 15
-```
-The enhanced recommendation system should now achieve better personalization through balanced feature representation and categorical demographic learning.

analyze_recommendation_quality.py DELETED Viewed

@@ -1,556 +0,0 @@
-#!/usr/bin/env python3
-"""
-Comprehensive analysis of recommendation quality from the two-tower model.
-"""
-import sys
-import os
-import numpy as np
-import pandas as pd
-from collections import Counter, defaultdict
-import time
-sys.path.append('/home/user/Desktop/RecSys-HP')
-from src.inference.recommendation_engine import RecommendationEngine
-from src.utils.real_user_selector import RealUserSelector
-def analyze_score_distribution():
-    """Analyze the distribution of recommendation scores."""
-    print("📊 SCORE DISTRIBUTION ANALYSIS")
-    print("="*50)
-    try:
-        engine = RecommendationEngine()
-        real_user_selector = RealUserSelector()
-        # Get multiple users for comprehensive analysis
-        test_users = real_user_selector.get_real_users(n=10, min_interactions=10)
-        all_scores = {
-            'collaborative': [],
-            'hybrid': [],
-            'content': []
-        }
-        print(f"Testing with {len(test_users)} users...")
-        for i, user in enumerate(test_users):
-            print(f"\nUser {i+1}/10 - {user['user_id']} ({user['age']}yr {user['gender']}):")
-            # Test collaborative filtering
-            try:
-                collab_recs = engine.recommend_items_collaborative(
-                    age=user['age'],
-                    gender=user['gender'],
-                    income=user['income'],
-                    interaction_history=user['interaction_history'][:20],
-                    k=20
-                )
-                collab_scores = [score for _, score, _ in collab_recs]
-                all_scores['collaborative'].extend(collab_scores)
-                print(f"   Collaborative: {min(collab_scores):.4f} - {max(collab_scores):.4f} (std: {np.std(collab_scores):.4f})")
-            except Exception as e:
-                print(f"   Collaborative failed: {e}")
-            # Test hybrid
-            try:
-                hybrid_recs = engine.recommend_items_hybrid(
-                    age=user['age'],
-                    gender=user['gender'],
-                    income=user['income'],
-                    interaction_history=user['interaction_history'][:20],
-                    k=20,
-                    collaborative_weight=0.7
-                )
-                hybrid_scores = [score for _, score, _ in hybrid_recs]
-                all_scores['hybrid'].extend(hybrid_scores)
-                print(f"   Hybrid: {min(hybrid_scores):.4f} - {max(hybrid_scores):.4f} (std: {np.std(hybrid_scores):.4f})")
-            except Exception as e:
-                print(f"   Hybrid failed: {e}")
-            # Test content-based (if user has history)
-            if user['interaction_history']:
-                try:
-                    content_recs = engine.recommend_items_content_based(
-                        seed_item_id=user['interaction_history'][0],
-                        k=20
-                    )
-                    content_scores = [score for _, score, _ in content_recs]
-                    all_scores['content'].extend(content_scores)
-                    print(f"   Content: {min(content_scores):.4f} - {max(content_scores):.4f} (std: {np.std(content_scores):.4f})")
-                except Exception as e:
-                    print(f"   Content failed: {e}")
-        # Overall score analysis
-        print(f"\n📈 OVERALL SCORE STATISTICS:")
-        for method, scores in all_scores.items():
-            if scores:
-                print(f"\n{method.upper()}:")
-                print(f"   Total scores: {len(scores)}")
-                print(f"   Range: {min(scores):.4f} - {max(scores):.4f}")
-                print(f"   Mean: {np.mean(scores):.4f}")
-                print(f"   Std: {np.std(scores):.4f}")
-                print(f"   Variance: {np.var(scores):.6f}")
-                # Score distribution percentiles
-                percentiles = [10, 25, 50, 75, 90]
-                perc_values = np.percentile(scores, percentiles)
-                print(f"   Percentiles: {dict(zip(percentiles, perc_values))}")
-                # Quality assessment
-                score_range = max(scores) - min(scores)
-                if score_range < 0.1:
-                    print(f"   ⚠️  WARNING: Low score range ({score_range:.4f}) - poor discrimination")
-                elif score_range < 0.3:
-                    print(f"   ⚠️  CAUTION: Moderate score range ({score_range:.4f})")
-                else:
-                    print(f"   ✅ GOOD: Wide score range ({score_range:.4f})")
-                if np.var(scores) < 0.001:
-                    print(f"   ⚠️  WARNING: Very low variance - poor ranking ability")
-                elif np.var(scores) < 0.01:
-                    print(f"   ⚠️  CAUTION: Low variance")
-                else:
-                    print(f"   ✅ GOOD: Adequate variance for ranking")
-        return all_scores
-    except Exception as e:
-        print(f"❌ Score analysis failed: {e}")
-        return None
-def analyze_category_alignment():
-    """Analyze how well recommendations align with user category preferences."""
-    print(f"\n🎯 CATEGORY ALIGNMENT ANALYSIS")
-    print("="*40)
-    try:
-        engine = RecommendationEngine()
-        real_user_selector = RealUserSelector()
-        test_users = real_user_selector.get_real_users(n=5, min_interactions=15)
-        alignment_results = []
-        for user in test_users:
-            print(f"\nUser {user['user_id']} ({user['age']}yr {user['gender']}):")
-            # Get user's detailed interactions
-            user_details = real_user_selector.get_user_interaction_details(user['user_id'])
-            # Analyze user's category preferences
-            user_categories = []
-            for interaction in user_details['timeline']:
-                category = interaction.get('category_code', 'Unknown')
-                user_categories.append(category)
-            user_category_counts = Counter(user_categories)
-            total_user_interactions = len(user_categories)
-            print(f"   User's top categories:")
-            for category, count in user_category_counts.most_common(3):
-                percentage = (count / total_user_interactions) * 100
-                print(f"     {category}: {count} ({percentage:.1f}%)")
-            # Get recommendations
-            recs = engine.recommend_items_hybrid(
-                age=user['age'],
-                gender=user['gender'],
-                income=user['income'],
-                interaction_history=user['interaction_history'][:20],
-                k=20,
-                collaborative_weight=0.7
-            )
-            # Analyze recommendation categories
-            rec_categories = []
-            for _, _, item_info in recs:
-                category = item_info.get('category_code', 'Unknown')
-                rec_categories.append(category)
-            rec_category_counts = Counter(rec_categories)
-            print(f"   Recommendation categories:")
-            for category, count in rec_category_counts.most_common(3):
-                percentage = (count / len(rec_categories)) * 100
-                match = "✅" if category in user_category_counts else "🆕"
-                print(f"     {category}: {count} ({percentage:.1f}%) {match}")
-            # Calculate alignment metrics
-            user_cats = set(user_category_counts.keys())
-            rec_cats = set(rec_category_counts.keys())
-            intersection = user_cats & rec_cats
-            alignment_percentage = len(intersection) / len(rec_cats) * 100 if rec_cats else 0
-            # Calculate weighted alignment (by user preference strength)
-            weighted_alignment = 0
-            for category in intersection:
-                user_weight = user_category_counts[category] / total_user_interactions
-                rec_weight = rec_category_counts[category] / len(rec_categories)
-                weighted_alignment += min(user_weight, rec_weight)
-            alignment_results.append({
-                'user_id': user['user_id'],
-                'alignment_percentage': alignment_percentage,
-                'weighted_alignment': weighted_alignment * 100,
-                'user_categories': len(user_cats),
-                'rec_categories': len(rec_cats),
-                'matched_categories': len(intersection)
-            })
-            print(f"   Alignment: {alignment_percentage:.1f}% ({len(intersection)}/{len(rec_cats)} categories)")
-            print(f"   Weighted alignment: {weighted_alignment * 100:.1f}%")
-        # Overall alignment analysis
-        print(f"\n📊 OVERALL ALIGNMENT STATISTICS:")
-        avg_alignment = np.mean([r['alignment_percentage'] for r in alignment_results])
-        avg_weighted = np.mean([r['weighted_alignment'] for r in alignment_results])
-        avg_user_cats = np.mean([r['user_categories'] for r in alignment_results])
-        avg_rec_cats = np.mean([r['rec_categories'] for r in alignment_results])
-        print(f"   Average alignment: {avg_alignment:.1f}%")
-        print(f"   Average weighted alignment: {avg_weighted:.1f}%")
-        print(f"   Average user categories: {avg_user_cats:.1f}")
-        print(f"   Average rec categories: {avg_rec_cats:.1f}")
-        # Quality assessment
-        if avg_alignment < 20:
-            print(f"   ❌ POOR: Very low category alignment")
-        elif avg_alignment < 40:
-            print(f"   ⚠️  FAIR: Low category alignment")
-        elif avg_alignment < 60:
-            print(f"   ✅ GOOD: Moderate category alignment")
-        else:
-            print(f"   🎉 EXCELLENT: High category alignment")
-        return alignment_results
-    except Exception as e:
-        print(f"❌ Category alignment analysis failed: {e}")
-        return None
-def analyze_diversity_metrics():
-    """Analyze diversity metrics in recommendations."""
-    print(f"\n🌈 DIVERSITY ANALYSIS")
-    print("="*30)
-    try:
-        engine = RecommendationEngine()
-        real_user_selector = RealUserSelector()
-        test_users = real_user_selector.get_real_users(n=5, min_interactions=10)
-        diversity_results = []
-        for user in test_users:
-            print(f"\nUser {user['user_id']}:")
-            # Get recommendations
-            recs = engine.recommend_items_hybrid(
-                age=user['age'],
-                gender=user['gender'],
-                income=user['income'],
-                interaction_history=user['interaction_history'][:20],
-                k=20,
-                collaborative_weight=0.7
-            )
-            # Extract features for diversity analysis
-            categories = [item_info.get('category_code', 'Unknown') for _, _, item_info in recs]
-            brands = [item_info.get('brand', 'Unknown') for _, _, item_info in recs]
-            prices = [item_info.get('price', 0) for _, _, item_info in recs]
-            # Calculate diversity metrics
-            category_diversity = len(set(categories)) / len(categories) if categories else 0
-            brand_diversity = len(set(brands)) / len(brands) if brands else 0
-            # Price diversity (coefficient of variation)
-            price_diversity = np.std(prices) / np.mean(prices) if np.mean(prices) > 0 else 0
-            # Intra-list diversity (average pairwise dissimilarity)
-            category_counts = Counter(categories)
-            gini_categories = 1 - sum((count / len(categories)) ** 2 for count in category_counts.values())
-            diversity_results.append({
-                'user_id': user['user_id'],
-                'category_diversity': category_diversity,
-                'brand_diversity': brand_diversity,
-                'price_diversity': price_diversity,
-                'gini_categories': gini_categories,
-                'unique_categories': len(set(categories)),
-                'unique_brands': len(set(brands))
-            })
-            print(f"   Categories: {len(set(categories))} unique ({category_diversity:.2f} ratio)")
-            print(f"   Brands: {len(set(brands))} unique ({brand_diversity:.2f} ratio)")
-            print(f"   Price range: ${min(prices):.2f} - ${max(prices):.2f}")
-            print(f"   Gini (categories): {gini_categories:.2f}")
-        # Overall diversity statistics
-        print(f"\n📊 OVERALL DIVERSITY STATISTICS:")
-        avg_cat_diversity = np.mean([r['category_diversity'] for r in diversity_results])
-        avg_brand_diversity = np.mean([r['brand_diversity'] for r in diversity_results])
-        avg_gini = np.mean([r['gini_categories'] for r in diversity_results])
-        avg_unique_cats = np.mean([r['unique_categories'] for r in diversity_results])
-        print(f"   Average category diversity: {avg_cat_diversity:.2f}")
-        print(f"   Average brand diversity: {avg_brand_diversity:.2f}")
-        print(f"   Average Gini coefficient: {avg_gini:.2f}")
-        print(f"   Average unique categories: {avg_unique_cats:.1f}")
-        # Quality assessment
-        if avg_cat_diversity < 0.3:
-            print(f"   ❌ POOR: Low category diversity - recommendations too similar")
-        elif avg_cat_diversity < 0.5:
-            print(f"   ⚠️  FAIR: Moderate category diversity")
-        else:
-            print(f"   ✅ GOOD: High category diversity")
-        return diversity_results
-    except Exception as e:
-        print(f"❌ Diversity analysis failed: {e}")
-        return None
-def analyze_embedding_quality():
-    """Analyze the quality of user and item embeddings."""
-    print(f"\n🧠 EMBEDDING QUALITY ANALYSIS")
-    print("="*35)
-    try:
-        engine = RecommendationEngine()
-        real_user_selector = RealUserSelector()
-        test_users = real_user_selector.get_real_users(n=3, min_interactions=10)
-        user_embeddings = []
-        user_item_similarities = []
-        for user in test_users:
-            print(f"\nUser {user['user_id']}:")
-            # Get user embedding
-            user_emb = engine.get_user_embedding(
-                age=user['age'],
-                gender=user['gender'],
-                income=user['income'],
-                interaction_history=user['interaction_history'][:10]
-            )
-            user_embeddings.append(user_emb)
-            print(f"   User embedding shape: {user_emb.shape}")
-            print(f"   User embedding norm: {np.linalg.norm(user_emb):.4f}")
-            print(f"   User embedding mean: {user_emb.mean():.4f}")
-            print(f"   User embedding std: {user_emb.std():.4f}")
-            # Get embeddings for user's interaction history
-            item_similarities = []
-            for item_id in user['interaction_history'][:5]:
-                item_emb = engine.get_item_embedding(item_id)
-                if item_emb is not None:
-                    similarity = np.dot(user_emb, item_emb)
-                    item_similarities.append(similarity)
-            if item_similarities:
-                user_item_similarities.extend(item_similarities)
-                print(f"   Avg similarity with interacted items: {np.mean(item_similarities):.4f}")
-                print(f"   Similarity range: {min(item_similarities):.4f} - {max(item_similarities):.4f}")
-        # Analyze user embedding diversity
-        if len(user_embeddings) > 1:
-            user_embeddings = np.array(user_embeddings)
-            # User-user similarities
-            user_similarities = []
-            for i in range(len(user_embeddings)):
-                for j in range(i+1, len(user_embeddings)):
-                    sim = np.dot(user_embeddings[i], user_embeddings[j])
-                    user_similarities.append(sim)
-            print(f"\n📊 USER EMBEDDING ANALYSIS:")
-            print(f"   User-user similarities: {np.mean(user_similarities):.4f} ± {np.std(user_similarities):.4f}")
-            print(f"   User-item similarities: {np.mean(user_item_similarities):.4f} ± {np.std(user_item_similarities):.4f}")
-            # Quality assessment
-            if np.mean(user_similarities) > 0.9:
-                print(f"   ⚠️  WARNING: Users too similar - possible embedding collapse")
-            elif np.mean(user_similarities) > 0.7:
-                print(f"   ⚠️  CAUTION: High user similarity - limited personalization")
-            else:
-                print(f"   ✅ GOOD: Adequate user embedding diversity")
-        return {
-            'user_embeddings': user_embeddings,
-            'user_similarities': user_similarities if len(user_embeddings) > 1 else [],
-            'user_item_similarities': user_item_similarities
-        }
-    except Exception as e:
-        print(f"❌ Embedding analysis failed: {e}")
-        return None
-def analyze_performance_metrics():
-    """Analyze performance and efficiency metrics."""
-    print(f"\n⚡ PERFORMANCE ANALYSIS")
-    print("="*25)
-    try:
-        engine = RecommendationEngine()
-        real_user_selector = RealUserSelector()
-        test_user = real_user_selector.get_real_users(n=1, min_interactions=10)[0]
-        # Test recommendation generation speed
-        print("Testing recommendation generation speed...")
-        methods = [
-            ('Collaborative', lambda: engine.recommend_items_collaborative(
-                age=test_user['age'], gender=test_user['gender'],
-                income=test_user['income'], interaction_history=test_user['interaction_history'][:20], k=10
-            )),
-            ('Hybrid', lambda: engine.recommend_items_hybrid(
-                age=test_user['age'], gender=test_user['gender'],
-                income=test_user['income'], interaction_history=test_user['interaction_history'][:20], k=10
-            )),
-        ]
-        for method_name, method_func in methods:
-            times = []
-            for _ in range(5):  # Run 5 times for average
-                start_time = time.time()
-                recs = method_func()
-                end_time = time.time()
-                times.append(end_time - start_time)
-            avg_time = np.mean(times)
-            print(f"   {method_name}: {avg_time:.3f}s ± {np.std(times):.3f}s")
-            if avg_time > 1.0:
-                print(f"     ⚠️  SLOW: Consider optimization")
-            elif avg_time > 0.5:
-                print(f"     ⚠️  MODERATE: Acceptable for real-time")
-            else:
-                print(f"     ✅ FAST: Good for real-time recommendations")
-        # Test scalability with different recommendation counts
-        print(f"\nTesting scalability...")
-        for k in [10, 50, 100]:
-            start_time = time.time()
-            recs = engine.recommend_items_hybrid(
-                age=test_user['age'], gender=test_user['gender'],
-                income=test_user['income'], interaction_history=test_user['interaction_history'][:20], k=k
-            )
-            end_time = time.time()
-            print(f"   {k} recommendations: {end_time - start_time:.3f}s")
-        return True
-    except Exception as e:
-        print(f"❌ Performance analysis failed: {e}")
-        return False
-def generate_quality_report():
-    """Generate a comprehensive quality report."""
-    print(f"\n📋 COMPREHENSIVE QUALITY REPORT")
-    print("="*40)
-    # Run all analyses
-    score_results = analyze_score_distribution()
-    alignment_results = analyze_category_alignment()
-    diversity_results = analyze_diversity_metrics()
-    embedding_results = analyze_embedding_quality()
-    performance_results = analyze_performance_metrics()
-    # Generate summary
-    print(f"\n🎯 QUALITY SUMMARY:")
-    issues = []
-    strengths = []
-    # Check score quality
-    if score_results:
-        for method, scores in score_results.items():
-            if scores:
-                score_variance = np.var(scores)
-                score_range = max(scores) - min(scores)
-                if score_variance < 0.001:
-                    issues.append(f"Low {method} score variance ({score_variance:.6f})")
-                if score_range < 0.1:
-                    issues.append(f"Narrow {method} score range ({score_range:.4f})")
-                if score_variance > 0.01 and score_range > 0.3:
-                    strengths.append(f"Good {method} score discrimination")
-    # Check alignment quality
-    if alignment_results:
-        avg_alignment = np.mean([r['alignment_percentage'] for r in alignment_results])
-        if avg_alignment < 30:
-            issues.append(f"Low category alignment ({avg_alignment:.1f}%)")
-        elif avg_alignment > 50:
-            strengths.append(f"Good category alignment ({avg_alignment:.1f}%)")
-    # Check diversity
-    if diversity_results:
-        avg_diversity = np.mean([r['category_diversity'] for r in diversity_results])
-        if avg_diversity < 0.3:
-            issues.append(f"Low category diversity ({avg_diversity:.2f})")
-        elif avg_diversity > 0.5:
-            strengths.append(f"Good category diversity ({avg_diversity:.2f})")
-    # Print results
-    if issues:
-        print(f"\n❌ ISSUES IDENTIFIED:")
-        for issue in issues:
-            print(f"   • {issue}")
-    if strengths:
-        print(f"\n✅ STRENGTHS:")
-        for strength in strengths:
-            print(f"   • {strength}")
-    # Recommendations
-    print(f"\n💡 RECOMMENDATIONS:")
-    if any("score variance" in issue for issue in issues):
-        print("   • Increase embedding dimensions or add temperature scaling")
-    if any("alignment" in issue for issue in issues):
-        print("   • Implement category-aware recommendation boosting")
-    if any("diversity" in issue for issue in issues):
-        print("   • Add diversity regularization to recommendation algorithm")
-    if not issues:
-        print("   • No major issues detected - model performing well!")
-def main():
-    """Main analysis function."""
-    print("🔍 TWO-TOWER RECOMMENDATION QUALITY ANALYSIS")
-    print("="*60)
-    try:
-        generate_quality_report()
-        print(f"\n✅ Analysis completed successfully!")
-    except Exception as e:
-        print(f"❌ Analysis failed: {e}")
-        import traceback
-        traceback.print_exc()
-if __name__ == "__main__":
-    main()

analyze_recommendations.py ADDED Viewed

	@@ -0,0 +1,543 @@

+#!/usr/bin/env python3
+"""
+Recommendation Analysis Script
+This script compares recommendations from both training approaches:
+1. 2-phase training (pre-trained item tower + joint fine-tuning)
+2. Single joint training (end-to-end optimization)
+It analyzes:
+- Category alignment between user interactions and recommendations
+- Diversity of recommended categories
+- Overlap between the two approaches
+- Performance on real users
+Usage:
+    python analyze_recommendations.py
+"""
+import os
+import sys
+import numpy as np
+import pandas as pd
+from collections import defaultdict, Counter
+from typing import Dict, List, Tuple
+import matplotlib.pyplot as plt
+import seaborn as sns
+# Add src to path
+sys.path.append('src')
+from src.inference.recommendation_engine import RecommendationEngine
+from src.utils.real_user_selector import RealUserSelector
+class RecommendationAnalyzer:
+    """Analyzer for comparing different recommendation approaches."""
+    def __init__(self):
+        self.recommendation_engine = None
+        self.real_user_selector = None
+        self.items_df = None
+        self.setup_engines()
+    def setup_engines(self):
+        """Setup recommendation engines and data."""
+        print("Loading recommendation engines...")
+        try:
+            # Load recommendation engine (assumes trained model artifacts exist)
+            self.recommendation_engine = RecommendationEngine()
+            print("✅ Recommendation engine loaded")
+        except Exception as e:
+            print(f"❌ Error loading recommendation engine: {e}")
+            return
+        try:
+            # Load real user selector
+            self.real_user_selector = RealUserSelector()
+            print("✅ Real user selector loaded")
+        except Exception as e:
+            print(f"❌ Error loading real user selector: {e}")
+        # Load items data for category analysis
+        self.items_df = pd.read_csv("datasets/items.csv")
+        print(f"✅ Loaded {len(self.items_df)} items")
+    def get_item_categories(self, item_ids: List[int]) -> List[str]:
+        """Get category codes for given item IDs."""
+        categories = []
+        for item_id in item_ids:
+            item_row = self.items_df[self.items_df['product_id'] == item_id]
+            if len(item_row) > 0:
+                categories.append(item_row.iloc[0]['category_code'])
+            else:
+                categories.append('unknown')
+        return categories
+    def analyze_user_recommendations(self,
+                                   user_profile: Dict,
+                                   recommendation_types: List[str] = None) -> Dict:
+        """Analyze recommendations for a single user across different approaches."""
+        if recommendation_types is None:
+            recommendation_types = ['collaborative', 'hybrid', 'content']
+        results = {
+            'user_profile': user_profile,
+            'interaction_categories': [],
+            'recommendations': {},
+            'category_analysis': {}
+        }
+        # Get categories from user's interaction history
+        if user_profile['interaction_history']:
+            results['interaction_categories'] = self.get_item_categories(
+                user_profile['interaction_history']
+            )
+        # Get recommendations for each type
+        for rec_type in recommendation_types:
+            try:
+                if rec_type == 'collaborative':
+                    recs = self.recommendation_engine.recommend_items_collaborative(
+                        age=user_profile['age'],
+                        gender=user_profile['gender'],
+                        income=user_profile['income'],
+                        interaction_history=user_profile['interaction_history'],
+                        k=10
+                    )
+                elif rec_type == 'hybrid':
+                    recs = self.recommendation_engine.recommend_items_hybrid(
+                        age=user_profile['age'],
+                        gender=user_profile['gender'],
+                        income=user_profile['income'],
+                        interaction_history=user_profile['interaction_history'],
+                        k=10
+                    )
+                elif rec_type == 'content' and user_profile['interaction_history']:
+                    recs = self.recommendation_engine.recommend_items_content_based(
+                        seed_item_id=user_profile['interaction_history'][-1],
+                        k=10
+                    )
+                else:
+                    continue
+                # Extract item IDs and categories
+                item_ids = [item_id for item_id, score, info in recs]
+                rec_categories = self.get_item_categories(item_ids)
+                results['recommendations'][rec_type] = {
+                    'items': recs,
+                    'item_ids': item_ids,
+                    'categories': rec_categories,
+                    'scores': [score for item_id, score, info in recs]
+                }
+                # Analyze category alignment
+                results['category_analysis'][rec_type] = self.analyze_category_alignment(
+                    results['interaction_categories'],
+                    rec_categories
+                )
+            except Exception as e:
+                print(f"Error generating {rec_type} recommendations: {e}")
+        return results
+    def analyze_category_alignment(self,
+                                 interaction_categories: List[str],
+                                 recommendation_categories: List[str]) -> Dict:
+        """Analyze alignment between interaction and recommendation categories."""
+        if not interaction_categories:
+            return {
+                'overlap_ratio': 0.0,
+                'unique_interaction_categories': 0,
+                'unique_recommendation_categories': len(set(recommendation_categories)),
+                'common_categories': [],
+                'category_distribution': Counter(recommendation_categories)
+            }
+        interaction_set = set(interaction_categories)
+        recommendation_set = set(recommendation_categories)
+        common_categories = interaction_set.intersection(recommendation_set)
+        overlap_ratio = len(common_categories) / len(interaction_set) if interaction_set else 0.0
+        return {
+            'overlap_ratio': overlap_ratio,
+            'unique_interaction_categories': len(interaction_set),
+            'unique_recommendation_categories': len(recommendation_set),
+            'common_categories': list(common_categories),
+            'category_distribution': Counter(recommendation_categories),
+            'interaction_category_distribution': Counter(interaction_categories)
+        }
+    def compare_recommendation_approaches(self,
+                                        users_sample: List[Dict],
+                                        approaches: List[str] = None) -> Dict:
+        """Compare different recommendation approaches across multiple users."""
+        if approaches is None:
+            approaches = ['collaborative', 'hybrid', 'content']
+        comparison_results = {
+            'approach_stats': defaultdict(list),
+            'cross_approach_analysis': {},
+            'user_results': []
+        }
+        print(f"Analyzing {len(users_sample)} users across {len(approaches)} approaches...")
+        for i, user in enumerate(users_sample):
+            print(f"Analyzing user {i+1}/{len(users_sample)}...")
+            user_results = self.analyze_user_recommendations(user, approaches)
+            comparison_results['user_results'].append(user_results)
+            # Aggregate stats by approach
+            for approach in approaches:
+                if approach in user_results['category_analysis']:
+                    analysis = user_results['category_analysis'][approach]
+                    comparison_results['approach_stats'][approach].append({
+                        'overlap_ratio': analysis['overlap_ratio'],
+                        'unique_rec_categories': analysis['unique_recommendation_categories'],
+                        'common_categories_count': len(analysis['common_categories'])
+                    })
+        # Calculate aggregate statistics
+        for approach in approaches:
+            stats = comparison_results['approach_stats'][approach]
+            if stats:
+                comparison_results['approach_stats'][approach] = {
+                    'avg_overlap_ratio': np.mean([s['overlap_ratio'] for s in stats]),
+                    'std_overlap_ratio': np.std([s['overlap_ratio'] for s in stats]),
+                    'avg_unique_categories': np.mean([s['unique_rec_categories'] for s in stats]),
+                    'avg_common_categories': np.mean([s['common_categories_count'] for s in stats]),
+                    'total_users': len(stats)
+                }
+        # Cross-approach analysis
+        comparison_results['cross_approach_analysis'] = self.cross_approach_analysis(
+            comparison_results['user_results'], approaches
+        )
+        return comparison_results
+    def cross_approach_analysis(self, user_results: List[Dict], approaches: List[str]) -> Dict:
+        """Analyze similarities and differences between approaches."""
+        cross_analysis = {
+            'item_overlap': defaultdict(dict),
+            'category_overlap': defaultdict(dict),
+            'score_correlation': defaultdict(dict)
+        }
+        for user_result in user_results:
+            recommendations = user_result['recommendations']
+            # Compare each pair of approaches
+            for i, approach1 in enumerate(approaches):
+                for approach2 in approaches[i+1:]:
+                    if approach1 in recommendations and approach2 in recommendations:
+                        # Item overlap
+                        items1 = set(recommendations[approach1]['item_ids'])
+                        items2 = set(recommendations[approach2]['item_ids'])
+                        item_overlap_ratio = len(items1.intersection(items2)) / len(items1.union(items2))
+                        # Category overlap
+                        cats1 = set(recommendations[approach1]['categories'])
+                        cats2 = set(recommendations[approach2]['categories'])
+                        cat_overlap_ratio = len(cats1.intersection(cats2)) / len(cats1.union(cats2)) if cats1.union(cats2) else 0
+                        # Store results
+                        pair_key = f"{approach1}_vs_{approach2}"
+                        if pair_key not in cross_analysis['item_overlap']:
+                            cross_analysis['item_overlap'][pair_key] = []
+                            cross_analysis['category_overlap'][pair_key] = []
+                        cross_analysis['item_overlap'][pair_key].append(item_overlap_ratio)
+                        cross_analysis['category_overlap'][pair_key].append(cat_overlap_ratio)
+        # Calculate averages
+        for pair_key in cross_analysis['item_overlap']:
+            cross_analysis['item_overlap'][pair_key] = {
+                'avg': np.mean(cross_analysis['item_overlap'][pair_key]),
+                'std': np.std(cross_analysis['item_overlap'][pair_key])
+            }
+            cross_analysis['category_overlap'][pair_key] = {
+                'avg': np.mean(cross_analysis['category_overlap'][pair_key]),
+                'std': np.std(cross_analysis['category_overlap'][pair_key])
+            }
+        return cross_analysis
+    def generate_report(self, comparison_results: Dict, output_file: str = "recommendation_analysis_report.md"):
+        """Generate a comprehensive analysis report."""
+        report = []
+        report.append("# Recommendation System Analysis Report")
+        report.append(f"Generated: {pd.Timestamp.now()}")
+        report.append("")
+        # Overall Statistics
+        report.append("## Overall Statistics")
+        report.append("")
+        for approach, stats in comparison_results['approach_stats'].items():
+            if isinstance(stats, dict):
+                report.append(f"### {approach.title()} Recommendations")
+                report.append(f"- **Average Category Overlap**: {stats['avg_overlap_ratio']:.3f} ± {stats['std_overlap_ratio']:.3f}")
+                report.append(f"- **Average Unique Categories per User**: {stats['avg_unique_categories']:.1f}")
+                report.append(f"- **Average Common Categories**: {stats['avg_common_categories']:.1f}")
+                report.append(f"- **Users Analyzed**: {stats['total_users']}")
+                report.append("")
+        # Cross-Approach Analysis
+        report.append("## Cross-Approach Comparison")
+        report.append("")
+        cross_analysis = comparison_results['cross_approach_analysis']
+        report.append("### Item Overlap Between Approaches")
+        for pair, overlap_stats in cross_analysis['item_overlap'].items():
+            report.append(f"- **{pair.replace('_', ' ').title()}**: {overlap_stats['avg']:.3f} ± {overlap_stats['std']:.3f}")
+        report.append("")
+        report.append("### Category Overlap Between Approaches")
+        for pair, overlap_stats in cross_analysis['category_overlap'].items():
+            report.append(f"- **{pair.replace('_', ' ').title()}**: {overlap_stats['avg']:.3f} ± {overlap_stats['std']:.3f}")
+        report.append("")
+        # Category Alignment Analysis
+        report.append("## Category Alignment Analysis")
+        report.append("")
+        report.append("Category alignment measures how well recommendations match the categories")
+        report.append("of items users have previously interacted with.")
+        report.append("")
+        # Find best performing approach
+        best_approach = max(
+            comparison_results['approach_stats'].keys(),
+            key=lambda k: comparison_results['approach_stats'][k]['avg_overlap_ratio']
+            if isinstance(comparison_results['approach_stats'][k], dict) else 0
+        )
+        report.append(f"**Best Category Alignment**: {best_approach.title()} approach")
+        report.append("")
+        # Recommendations
+        report.append("## Key Findings & Recommendations")
+        report.append("")
+        # Analyze overlap ratios to provide insights
+        overlap_ratios = {
+            k: v['avg_overlap_ratio'] for k, v in comparison_results['approach_stats'].items()
+            if isinstance(v, dict)
+        }
+        if overlap_ratios:
+            avg_overlap = np.mean(list(overlap_ratios.values()))
+            if avg_overlap > 0.5:
+                report.append("✅ **Strong Category Alignment**: Recommendations show good alignment with user interaction patterns.")
+            elif avg_overlap > 0.3:
+                report.append("⚠️ **Moderate Category Alignment**: Some alignment present but room for improvement.")
+            else:
+                report.append("❌ **Weak Category Alignment**: Recommendations may be too diverse or not well-aligned with user preferences.")
+            report.append("")
+            # Compare approaches
+            if len(overlap_ratios) > 1:
+                sorted_approaches = sorted(overlap_ratios.items(), key=lambda x: x[1], reverse=True)
+                report.append("### Approach Rankings (by category alignment):")
+                for i, (approach, ratio) in enumerate(sorted_approaches, 1):
+                    report.append(f"{i}. **{approach.title()}**: {ratio:.3f}")
+                report.append("")
+        # Write report
+        with open(output_file, 'w') as f:
+            f.write('\n'.join(report))
+        print(f"✅ Analysis report saved to: {output_file}")
+        return '\n'.join(report)
+    def visualize_results(self, comparison_results: Dict, save_plots: bool = True):
+        """Create visualizations for the analysis results."""
+        # Set up plotting style
+        plt.style.use('default')
+        sns.set_palette("husl")
+        # Create figure with subplots
+        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
+        fig.suptitle('Recommendation System Analysis', fontsize=16, fontweight='bold')
+        # 1. Category Overlap by Approach
+        ax1 = axes[0, 0]
+        approaches = []
+        overlap_means = []
+        overlap_stds = []
+        for approach, stats in comparison_results['approach_stats'].items():
+            if isinstance(stats, dict):
+                approaches.append(approach.title())
+                overlap_means.append(stats['avg_overlap_ratio'])
+                overlap_stds.append(stats['std_overlap_ratio'])
+        bars1 = ax1.bar(approaches, overlap_means, yerr=overlap_stds, capsize=5, alpha=0.7)
+        ax1.set_title('Average Category Overlap by Approach')
+        ax1.set_ylabel('Category Overlap Ratio')
+        ax1.set_ylim(0, 1)
+        # Add value labels on bars
+        for bar, mean in zip(bars1, overlap_means):
+            height = bar.get_height()
+            ax1.text(bar.get_x() + bar.get_width()/2., height + 0.01,
+                    f'{mean:.3f}', ha='center', va='bottom')
+        # 2. Cross-Approach Item Overlap
+        ax2 = axes[0, 1]
+        cross_analysis = comparison_results['cross_approach_analysis']
+        pair_names = []
+        item_overlaps = []
+        for pair, overlap_stats in cross_analysis['item_overlap'].items():
+            pair_names.append(pair.replace('_vs_', ' vs ').title())
+            item_overlaps.append(overlap_stats['avg'])
+        if pair_names:
+            bars2 = ax2.bar(pair_names, item_overlaps, alpha=0.7, color='coral')
+            ax2.set_title('Item Overlap Between Approaches')
+            ax2.set_ylabel('Item Overlap Ratio')
+            ax2.set_ylim(0, 1)
+            plt.setp(ax2.get_xticklabels(), rotation=45, ha='right')
+            # Add value labels
+            for bar, overlap in zip(bars2, item_overlaps):
+                height = bar.get_height()
+                ax2.text(bar.get_x() + bar.get_width()/2., height + 0.01,
+                        f'{overlap:.3f}', ha='center', va='bottom')
+        # 3. Category Diversity
+        ax3 = axes[1, 0]
+        unique_categories = []
+        for approach, stats in comparison_results['approach_stats'].items():
+            if isinstance(stats, dict):
+                unique_categories.append(stats['avg_unique_categories'])
+        bars3 = ax3.bar(approaches, unique_categories, alpha=0.7, color='lightgreen')
+        ax3.set_title('Average Unique Categories per Recommendation')
+        ax3.set_ylabel('Number of Unique Categories')
+        for bar, cats in zip(bars3, unique_categories):
+            height = bar.get_height()
+            ax3.text(bar.get_x() + bar.get_width()/2., height + 0.1,
+                    f'{cats:.1f}', ha='center', va='bottom')
+        # 4. Category vs Item Overlap Comparison
+        ax4 = axes[1, 1]
+        if cross_analysis['item_overlap'] and cross_analysis['category_overlap']:
+            pairs = list(cross_analysis['item_overlap'].keys())
+            item_overlaps = [cross_analysis['item_overlap'][p]['avg'] for p in pairs]
+            cat_overlaps = [cross_analysis['category_overlap'][p]['avg'] for p in pairs]
+            x = np.arange(len(pairs))
+            width = 0.35
+            bars4a = ax4.bar(x - width/2, item_overlaps, width, label='Item Overlap', alpha=0.7)
+            bars4b = ax4.bar(x + width/2, cat_overlaps, width, label='Category Overlap', alpha=0.7)
+            ax4.set_title('Item vs Category Overlap Between Approaches')
+            ax4.set_ylabel('Overlap Ratio')
+            ax4.set_xticks(x)
+            ax4.set_xticklabels([p.replace('_vs_', ' vs ') for p in pairs], rotation=45, ha='right')
+            ax4.legend()
+            ax4.set_ylim(0, 1)
+        plt.tight_layout()
+        if save_plots:
+            plt.savefig('recommendation_analysis_plots.png', dpi=300, bbox_inches='tight')
+            print("✅ Plots saved to: recommendation_analysis_plots.png")
+        plt.show()
+def main():
+    """Main function to run the recommendation analysis."""
+    print("🔍 Starting Recommendation Analysis...")
+    print("=" * 50)
+    # Initialize analyzer
+    analyzer = RecommendationAnalyzer()
+    if analyzer.recommendation_engine is None:
+        print("❌ Cannot proceed without recommendation engine. Please ensure model is trained.")
+        return
+    # Get sample of real users for analysis
+    print("Getting real user sample...")
+    try:
+        real_users = analyzer.real_user_selector.get_real_users(n=20, min_interactions=3)
+        print(f"✅ Loaded {len(real_users)} real users for analysis")
+    except Exception as e:
+        print(f"❌ Error loading real users: {e}")
+        # Fallback to synthetic users
+        real_users = [
+            {
+                'age': 32, 'gender': 'male', 'income': 75000,
+                'interaction_history': [1000978, 1001588, 1001618, 1002039]
+            },
+            {
+                'age': 28, 'gender': 'female', 'income': 45000,
+                'interaction_history': [1003456, 1004567, 1005678]
+            },
+            {
+                'age': 45, 'gender': 'male', 'income': 85000,
+                'interaction_history': [1006789, 1007890, 1008901, 1009012, 1010123]
+            }
+        ]
+        print(f"Using {len(real_users)} synthetic users for analysis")
+    # Run comprehensive analysis
+    print("Running recommendation analysis...")
+    approaches = ['collaborative', 'hybrid', 'content']
+    comparison_results = analyzer.compare_recommendation_approaches(
+        users_sample=real_users,
+        approaches=approaches
+    )
+    # Generate report
+    print("Generating analysis report...")
+    report = analyzer.generate_report(comparison_results)
+    # Create visualizations
+    print("Creating visualizations...")
+    try:
+        analyzer.visualize_results(comparison_results, save_plots=True)
+    except Exception as e:
+        print(f"Warning: Could not create visualizations: {e}")
+    # Print summary
+    print("\n" + "=" * 50)
+    print("📊 ANALYSIS SUMMARY")
+    print("=" * 50)
+    for approach, stats in comparison_results['approach_stats'].items():
+        if isinstance(stats, dict):
+            print(f"{approach.title()}: {stats['avg_overlap_ratio']:.3f} avg category overlap")
+    print(f"\n✅ Analysis complete! Check:")
+    print("   📄 recommendation_analysis_report.md")
+    print("   📊 recommendation_analysis_plots.png")
+if __name__ == "__main__":
+    main()

api/main.py CHANGED Viewed

@@ -13,6 +13,7 @@ sys.path.append(parent_dir)
 os.chdir(parent_dir)  # Change to project root directory
 from src.inference.recommendation_engine import RecommendationEngine
 # Initialize FastAPI app
 app = FastAPI(
@@ -30,8 +31,10 @@ app.add_middleware(
     allow_headers=["*"],
 )
-# Global recommendation engine instance
 recommendation_engine = None
 # Pydantic models for request/response
@@ -45,8 +48,11 @@ class UserProfile(BaseModel):
 class RecommendationRequest(BaseModel):
     user_profile: UserProfile
     num_recommendations: int = 10
-    recommendation_type: str = "hybrid"  # "collaborative", "content", "hybrid"
     collaborative_weight: Optional[float] = 0.7
 class ItemSimilarityRequest(BaseModel):
@@ -87,10 +93,27 @@ class RatingPredictionResponse(BaseModel):
     item_info: ItemInfo
 @app.on_event("startup")
 async def startup_event():
-    """Initialize the recommendation engine on startup."""
-    global recommendation_engine
     try:
         print("Loading recommendation engine...")
@@ -99,6 +122,30 @@ async def startup_event():
     except Exception as e:
         print(f"Error loading recommendation engine: {e}")
         recommendation_engine = None
 @app.get("/")
@@ -120,6 +167,68 @@ async def health_check():
     }
 @app.post("/recommendations", response_model=RecommendationsResponse)
 async def get_recommendations(request: RecommendationRequest):
     """Get item recommendations for a user."""
@@ -164,10 +273,67 @@ async def get_recommendations(request: RecommendationRequest):
                 collaborative_weight=request.collaborative_weight
             )
         else:
             raise HTTPException(
                 status_code=400,
-                detail="Invalid recommendation_type. Must be 'collaborative', 'content', or 'hybrid'"
             )
         # Format response

 os.chdir(parent_dir)  # Change to project root directory
 from src.inference.recommendation_engine import RecommendationEngine
+from src.utils.real_user_selector import RealUserSelector
 # Initialize FastAPI app
 app = FastAPI(
     allow_headers=["*"],
 )
+# Global instances
 recommendation_engine = None
+enhanced_recommendation_engine = None
+real_user_selector = None
 # Pydantic models for request/response
 class RecommendationRequest(BaseModel):
     user_profile: UserProfile
     num_recommendations: int = 10
+    recommendation_type: str = "hybrid"  # "collaborative", "content", "hybrid", "enhanced", "enhanced_128d", "category_focused"
     collaborative_weight: Optional[float] = 0.7
+    category_boost: Optional[float] = 1.5  # For enhanced recommendations
+    enable_category_boost: Optional[bool] = True
+    enable_diversity: Optional[bool] = True
 class ItemSimilarityRequest(BaseModel):
     item_info: ItemInfo
+class RealUserProfile(BaseModel):
+    user_id: int
+    age: int
+    gender: str
+    income: int
+    interaction_history: List[int]
+    interaction_stats: Dict[str, int]
+    interaction_pattern: str
+    summary: str
+class RealUsersResponse(BaseModel):
+    users: List[RealUserProfile]
+    total_count: int
+    dataset_summary: Dict[str, Any]
 @app.on_event("startup")
 async def startup_event():
+    """Initialize the recommendation engines and real user selector on startup."""
+    global recommendation_engine, enhanced_recommendation_engine, real_user_selector
     try:
         print("Loading recommendation engine...")
     except Exception as e:
         print(f"Error loading recommendation engine: {e}")
         recommendation_engine = None
+    try:
+        print("Loading enhanced recommendation engine...")
+        # Try enhanced 128D engine first, fallback to regular enhanced
+        try:
+            from src.inference.enhanced_recommendation_engine_128d import Enhanced128DRecommendationEngine
+            enhanced_recommendation_engine = Enhanced128DRecommendationEngine()
+            print("✅ Using Enhanced 128D Recommendation Engine")
+        except:
+            from src.inference.enhanced_recommendation_engine import EnhancedRecommendationEngine
+            enhanced_recommendation_engine = EnhancedRecommendationEngine()
+            print("⚠️  Using fallback Enhanced Recommendation Engine")
+        print("Enhanced recommendation engine loaded successfully!")
+    except Exception as e:
+        print(f"Error loading enhanced recommendation engine: {e}")
+        enhanced_recommendation_engine = None
+    try:
+        print("Loading real user selector...")
+        real_user_selector = RealUserSelector()
+        print("Real user selector loaded successfully!")
+    except Exception as e:
+        print(f"Error loading real user selector: {e}")
+        real_user_selector = None
 @app.get("/")
     }
+@app.get("/real-users", response_model=RealUsersResponse)
+async def get_real_users(count: int = 100, min_interactions: int = 5):
+    """Get real user profiles with genuine interaction histories."""
+    if real_user_selector is None:
+        raise HTTPException(status_code=503, detail="Real user selector not available")
+    try:
+        # Get real user profiles
+        real_users = real_user_selector.get_real_users(n=count, min_interactions=min_interactions)
+        # Get dataset summary
+        dataset_summary = real_user_selector.get_dataset_summary()
+        # Format users for response
+        formatted_users = []
+        for user in real_users:
+            formatted_users.append(RealUserProfile(**user))
+        return RealUsersResponse(
+            users=formatted_users,
+            total_count=len(formatted_users),
+            dataset_summary=dataset_summary
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving real users: {str(e)}")
+@app.get("/real-users/{user_id}")
+async def get_real_user_details(user_id: int):
+    """Get detailed interaction breakdown for a specific real user."""
+    if real_user_selector is None:
+        raise HTTPException(status_code=503, detail="Real user selector not available")
+    try:
+        user_details = real_user_selector.get_user_interaction_details(user_id)
+        if "error" in user_details:
+            raise HTTPException(status_code=404, detail=user_details["error"])
+        return user_details
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving user details: {str(e)}")
+@app.get("/dataset-summary")
+async def get_dataset_summary():
+    """Get summary statistics of the real dataset."""
+    if real_user_selector is None:
+        raise HTTPException(status_code=503, detail="Real user selector not available")
+    try:
+        return real_user_selector.get_dataset_summary()
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving dataset summary: {str(e)}")
 @app.post("/recommendations", response_model=RecommendationsResponse)
 async def get_recommendations(request: RecommendationRequest):
     """Get item recommendations for a user."""
                 collaborative_weight=request.collaborative_weight
             )
+        elif request.recommendation_type == "enhanced":
+            if enhanced_recommendation_engine is None:
+                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
+            # Check if it's the 128D engine or fallback
+            if hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
+                # 128D Enhanced engine
+                recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
+                    age=user_profile.age,
+                    gender=user_profile.gender,
+                    income=user_profile.income,
+                    interaction_history=user_profile.interaction_history,
+                    k=request.num_recommendations,
+                    diversity_weight=0.3 if request.enable_diversity else 0.0,
+                    category_boost=request.category_boost if request.enable_category_boost else 1.0
+                )
+            else:
+                # Fallback enhanced engine
+                recommendations = enhanced_recommendation_engine.recommend_items_enhanced_hybrid(
+                    age=user_profile.age,
+                    gender=user_profile.gender,
+                    income=user_profile.income,
+                    interaction_history=user_profile.interaction_history,
+                    k=request.num_recommendations,
+                    collaborative_weight=request.collaborative_weight,
+                    category_boost=request.category_boost,
+                    enable_category_boost=request.enable_category_boost,
+                    enable_diversity=request.enable_diversity
+                )
+        elif request.recommendation_type == "enhanced_128d":
+            if enhanced_recommendation_engine is None or not hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
+                raise HTTPException(status_code=503, detail="Enhanced 128D recommendation engine not available")
+            recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
+                age=user_profile.age,
+                gender=user_profile.gender,
+                income=user_profile.income,
+                interaction_history=user_profile.interaction_history,
+                k=request.num_recommendations,
+                diversity_weight=0.3 if request.enable_diversity else 0.0,
+                category_boost=request.category_boost if request.enable_category_boost else 1.0
+            )
+        elif request.recommendation_type == "category_focused":
+            if enhanced_recommendation_engine is None:
+                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
+            recommendations = enhanced_recommendation_engine.recommend_items_category_focused(
+                age=user_profile.age,
+                gender=user_profile.gender,
+                income=user_profile.income,
+                interaction_history=user_profile.interaction_history,
+                k=request.num_recommendations,
+                focus_percentage=0.8
+            )
         else:
             raise HTTPException(
                 status_code=400,
+                detail="Invalid recommendation_type. Must be 'collaborative', 'content', 'hybrid', 'enhanced', 'enhanced_128d', or 'category_focused'"
             )
         # Format response

api_2phase.py ADDED Viewed

	@@ -0,0 +1,521 @@

+#!/usr/bin/env python3
+"""
+API for 2-Phase Trained Recommendation System
+This API serves recommendations from a model trained using the 2-phase approach:
+1. Pre-trained item tower
+2. Joint training with fine-tuned item tower
+Usage:
+    python api_2phase.py
+Then access: http://localhost:8000
+"""
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import List, Optional, Dict, Any
+import uvicorn
+import os
+import sys
+import pandas as pd
+# Add src to path for imports and set working directory
+parent_dir = os.path.dirname(__file__)
+sys.path.append(parent_dir)
+os.chdir(parent_dir)  # Change to project root directory
+from src.inference.recommendation_engine import RecommendationEngine
+from src.utils.real_user_selector import RealUserSelector
+# Initialize FastAPI app
+app = FastAPI(
+    title="Two-Tower Recommendation API (2-Phase Training)",
+    description="API for serving recommendations using a two-tower architecture trained with 2-phase approach",
+    version="1.0.0-2phase"
+)
+# Add CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # Configure appropriately for production
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Global instances
+recommendation_engine = None
+enhanced_recommendation_engine = None
+real_user_selector = None
+# Pydantic models for request/response
+class UserProfile(BaseModel):
+    age: int
+    gender: str  # "male" or "female"
+    income: float
+    interaction_history: Optional[List[int]] = []
+class RecommendationRequest(BaseModel):
+    user_profile: UserProfile
+    num_recommendations: int = 10
+    recommendation_type: str = "hybrid"  # "collaborative", "content", "hybrid", "enhanced", "enhanced_128d", "category_focused"
+    collaborative_weight: Optional[float] = 0.7
+    category_boost: Optional[float] = 1.5  # For enhanced recommendations
+    enable_category_boost: Optional[bool] = True
+    enable_diversity: Optional[bool] = True
+class ItemSimilarityRequest(BaseModel):
+    item_id: int
+    num_recommendations: int = 10
+class RatingPredictionRequest(BaseModel):
+    user_profile: UserProfile
+    item_id: int
+class ItemInfo(BaseModel):
+    product_id: int
+    category_id: int
+    category_code: str
+    brand: str
+    price: float
+class RecommendationResponse(BaseModel):
+    item_id: int
+    score: float
+    item_info: ItemInfo
+class RecommendationsResponse(BaseModel):
+    recommendations: List[RecommendationResponse]
+    user_profile: UserProfile
+    recommendation_type: str
+    total_count: int
+    training_approach: str = "2-phase"
+class RatingPredictionResponse(BaseModel):
+    user_profile: UserProfile
+    item_id: int
+    predicted_rating: float
+    item_info: ItemInfo
+class RealUserProfile(BaseModel):
+    user_id: int
+    age: int
+    gender: str
+    income: int
+    interaction_history: List[int]
+    interaction_stats: Dict[str, int]
+    interaction_pattern: str
+    summary: str
+class RealUsersResponse(BaseModel):
+    users: List[RealUserProfile]
+    total_count: int
+    dataset_summary: Dict[str, Any]
+@app.on_event("startup")
+async def startup_event():
+    """Initialize the recommendation engines and real user selector on startup."""
+    global recommendation_engine, enhanced_recommendation_engine, real_user_selector
+    print("🚀 Starting 2-Phase Training API...")
+    print("   Training approach: Pre-trained item tower + Joint fine-tuning")
+    try:
+        print("Loading 2-phase trained recommendation engine...")
+        recommendation_engine = RecommendationEngine()
+        print("✅ 2-phase recommendation engine loaded successfully!")
+    except Exception as e:
+        print(f"❌ Error loading recommendation engine: {e}")
+        recommendation_engine = None
+    try:
+        print("Loading enhanced recommendation engine...")
+        # Try enhanced 128D engine first, fallback to regular enhanced
+        try:
+            from src.inference.enhanced_recommendation_engine_128d import Enhanced128DRecommendationEngine
+            enhanced_recommendation_engine = Enhanced128DRecommendationEngine()
+            print("✅ Using Enhanced 128D Recommendation Engine")
+        except:
+            from src.inference.enhanced_recommendation_engine import EnhancedRecommendationEngine
+            enhanced_recommendation_engine = EnhancedRecommendationEngine()
+            print("⚠️  Using fallback Enhanced Recommendation Engine")
+        print("Enhanced recommendation engine loaded successfully!")
+    except Exception as e:
+        print(f"Error loading enhanced recommendation engine: {e}")
+        enhanced_recommendation_engine = None
+    try:
+        print("Loading real user selector...")
+        real_user_selector = RealUserSelector()
+        print("Real user selector loaded successfully!")
+    except Exception as e:
+        print(f"Error loading real user selector: {e}")
+        real_user_selector = None
+    print("🎯 2-Phase API ready to serve recommendations!")
+@app.get("/")
+async def root():
+    """Root endpoint with API information."""
+    return {
+        "message": "Two-Tower Recommendation API (2-Phase Training)",
+        "version": "1.0.0-2phase",
+        "training_approach": "2-phase (pre-trained item tower + joint fine-tuning)",
+        "status": "active" if recommendation_engine is not None else "initialization_failed"
+    }
+@app.get("/health")
+async def health_check():
+    """Health check endpoint."""
+    return {
+        "status": "healthy" if recommendation_engine is not None else "unhealthy",
+        "engine_loaded": recommendation_engine is not None,
+        "training_approach": "2-phase"
+    }
+@app.get("/model-info")
+async def model_info():
+    """Get information about the loaded model."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    return {
+        "training_approach": "2-phase",
+        "description": "Pre-trained item tower followed by joint training with user tower",
+        "phases": [
+            "Phase 1: Item tower pre-training on item features only",
+            "Phase 2: Joint training of user tower + fine-tuning pre-trained item tower"
+        ],
+        "embedding_dimension": 128,
+        "item_vocab_size": len(recommendation_engine.data_processor.item_vocab) if recommendation_engine.data_processor else "unknown",
+        "artifacts_loaded": {
+            "item_tower_pretrained": "src/artifacts/item_tower_weights",
+            "item_tower_finetuned": "src/artifacts/item_tower_weights_finetuned_best",
+            "user_tower": "src/artifacts/user_tower_weights_best",
+            "rating_model": "src/artifacts/rating_model_weights_best"
+        }
+    }
+@app.get("/real-users", response_model=RealUsersResponse)
+async def get_real_users(count: int = 100, min_interactions: int = 5):
+    """Get real user profiles with genuine interaction histories."""
+    if real_user_selector is None:
+        raise HTTPException(status_code=503, detail="Real user selector not available")
+    try:
+        # Get real user profiles
+        real_users = real_user_selector.get_real_users(n=count, min_interactions=min_interactions)
+        # Get dataset summary
+        dataset_summary = real_user_selector.get_dataset_summary()
+        # Format users for response
+        formatted_users = []
+        for user in real_users:
+            formatted_users.append(RealUserProfile(**user))
+        return RealUsersResponse(
+            users=formatted_users,
+            total_count=len(formatted_users),
+            dataset_summary=dataset_summary
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving real users: {str(e)}")
+@app.get("/real-users/{user_id}")
+async def get_real_user_details(user_id: int):
+    """Get detailed interaction breakdown for a specific real user."""
+    if real_user_selector is None:
+        raise HTTPException(status_code=503, detail="Real user selector not available")
+    try:
+        user_details = real_user_selector.get_user_interaction_details(user_id)
+        if "error" in user_details:
+            raise HTTPException(status_code=404, detail=user_details["error"])
+        return user_details
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving user details: {str(e)}")
+@app.get("/dataset-summary")
+async def get_dataset_summary():
+    """Get summary statistics of the real dataset."""
+    if real_user_selector is None:
+        raise HTTPException(status_code=503, detail="Real user selector not available")
+    try:
+        return real_user_selector.get_dataset_summary()
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving dataset summary: {str(e)}")
+@app.post("/recommendations", response_model=RecommendationsResponse)
+async def get_recommendations(request: RecommendationRequest):
+    """Get item recommendations for a user."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    try:
+        user_profile = request.user_profile
+        # Generate recommendations based on type
+        if request.recommendation_type == "collaborative":
+            recommendations = recommendation_engine.recommend_items_collaborative(
+                age=user_profile.age,
+                gender=user_profile.gender,
+                income=user_profile.income,
+                interaction_history=user_profile.interaction_history,
+                k=request.num_recommendations
+            )
+        elif request.recommendation_type == "content":
+            if not user_profile.interaction_history:
+                raise HTTPException(
+                    status_code=400,
+                    detail="Content-based recommendations require interaction history"
+                )
+            # Use most recent interaction as seed
+            seed_item = user_profile.interaction_history[-1]
+            recommendations = recommendation_engine.recommend_items_content_based(
+                seed_item_id=seed_item,
+                k=request.num_recommendations
+            )
+        elif request.recommendation_type == "hybrid":
+            recommendations = recommendation_engine.recommend_items_hybrid(
+                age=user_profile.age,
+                gender=user_profile.gender,
+                income=user_profile.income,
+                interaction_history=user_profile.interaction_history,
+                k=request.num_recommendations,
+                collaborative_weight=request.collaborative_weight
+            )
+        elif request.recommendation_type == "enhanced":
+            if enhanced_recommendation_engine is None:
+                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
+            # Check if it's the 128D engine or fallback
+            if hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
+                # 128D Enhanced engine
+                recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
+                    age=user_profile.age,
+                    gender=user_profile.gender,
+                    income=user_profile.income,
+                    interaction_history=user_profile.interaction_history,
+                    k=request.num_recommendations,
+                    diversity_weight=0.3 if request.enable_diversity else 0.0,
+                    category_boost=request.category_boost if request.enable_category_boost else 1.0
+                )
+            else:
+                # Fallback enhanced engine
+                recommendations = enhanced_recommendation_engine.recommend_items_enhanced_hybrid(
+                    age=user_profile.age,
+                    gender=user_profile.gender,
+                    income=user_profile.income,
+                    interaction_history=user_profile.interaction_history,
+                    k=request.num_recommendations,
+                    collaborative_weight=request.collaborative_weight,
+                    category_boost=request.category_boost,
+                    enable_category_boost=request.enable_category_boost,
+                    enable_diversity=request.enable_diversity
+                )
+        elif request.recommendation_type == "enhanced_128d":
+            if enhanced_recommendation_engine is None or not hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
+                raise HTTPException(status_code=503, detail="Enhanced 128D recommendation engine not available")
+            recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
+                age=user_profile.age,
+                gender=user_profile.gender,
+                income=user_profile.income,
+                interaction_history=user_profile.interaction_history,
+                k=request.num_recommendations,
+                diversity_weight=0.3 if request.enable_diversity else 0.0,
+                category_boost=request.category_boost if request.enable_category_boost else 1.0
+            )
+        elif request.recommendation_type == "category_focused":
+            if enhanced_recommendation_engine is None:
+                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
+            recommendations = enhanced_recommendation_engine.recommend_items_category_focused(
+                age=user_profile.age,
+                gender=user_profile.gender,
+                income=user_profile.income,
+                interaction_history=user_profile.interaction_history,
+                k=request.num_recommendations,
+                focus_percentage=0.8
+            )
+        else:
+            raise HTTPException(
+                status_code=400,
+                detail="Invalid recommendation_type. Must be 'collaborative', 'content', 'hybrid', 'enhanced', 'enhanced_128d', or 'category_focused'"
+            )
+        # Format response
+        formatted_recommendations = []
+        for item_id, score, item_info in recommendations:
+            formatted_recommendations.append(
+                RecommendationResponse(
+                    item_id=item_id,
+                    score=score,
+                    item_info=ItemInfo(**item_info)
+                )
+            )
+        return RecommendationsResponse(
+            recommendations=formatted_recommendations,
+            user_profile=user_profile,
+            recommendation_type=request.recommendation_type,
+            total_count=len(formatted_recommendations),
+            training_approach="2-phase"
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error generating recommendations: {str(e)}")
+@app.post("/item-similarity", response_model=List[RecommendationResponse])
+async def get_similar_items(request: ItemSimilarityRequest):
+    """Get items similar to a given item."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    try:
+        recommendations = recommendation_engine.recommend_items_content_based(
+            seed_item_id=request.item_id,
+            k=request.num_recommendations
+        )
+        formatted_recommendations = []
+        for item_id, score, item_info in recommendations:
+            formatted_recommendations.append(
+                RecommendationResponse(
+                    item_id=item_id,
+                    score=score,
+                    item_info=ItemInfo(**item_info)
+                )
+            )
+        return formatted_recommendations
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error finding similar items: {str(e)}")
+@app.post("/predict-rating", response_model=RatingPredictionResponse)
+async def predict_user_item_rating(request: RatingPredictionRequest):
+    """Predict rating for a user-item pair."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    try:
+        user_profile = request.user_profile
+        predicted_rating = recommendation_engine.predict_rating(
+            age=user_profile.age,
+            gender=user_profile.gender,
+            income=user_profile.income,
+            interaction_history=user_profile.interaction_history,
+            item_id=request.item_id
+        )
+        item_info = recommendation_engine._get_item_info(request.item_id)
+        return RatingPredictionResponse(
+            user_profile=user_profile,
+            item_id=request.item_id,
+            predicted_rating=predicted_rating,
+            item_info=ItemInfo(**item_info)
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error predicting rating: {str(e)}")
+@app.get("/items/{item_id}", response_model=ItemInfo)
+async def get_item_info(item_id: int):
+    """Get information about a specific item."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    try:
+        item_info = recommendation_engine._get_item_info(item_id)
+        return ItemInfo(**item_info)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving item info: {str(e)}")
+@app.get("/items")
+async def get_sample_items(limit: int = 20):
+    """Get a sample of items for testing."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    try:
+        # Get sample items from the dataframe
+        sample_items = recommendation_engine.items_df.sample(n=min(limit, len(recommendation_engine.items_df)))
+        items = []
+        for _, row in sample_items.iterrows():
+            items.append({
+                "product_id": int(row['product_id']),
+                "category_id": int(row['category_id']),
+                "category_code": str(row['category_code']),
+                "brand": str(row['brand']) if pd.notna(row['brand']) else 'Unknown',
+                "price": float(row['price'])
+            })
+        return {"items": items, "total": len(items), "training_approach": "2-phase"}
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving sample items: {str(e)}")
+if __name__ == "__main__":
+    print("🚀 Starting 2-Phase Training Recommendation API...")
+    print("📊 Training approach: Pre-trained item tower + Joint fine-tuning")
+    print("🌐 Server will be available at: http://localhost:8000")
+    print("📚 API docs at: http://localhost:8000/docs")
+    uvicorn.run(
+        "api_2phase:app",
+        host="0.0.0.0",
+        port=8000,
+        reload=True
+    )

api_joint.py ADDED Viewed

	@@ -0,0 +1,522 @@

+#!/usr/bin/env python3
+"""
+API for Single Joint Trained Recommendation System
+This API serves recommendations from a model trained using the single joint approach:
+- Both user and item towers trained simultaneously from scratch
+- End-to-end optimization without pre-training phases
+Usage:
+    python api_joint.py
+Then access: http://localhost:8000
+"""
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import List, Optional, Dict, Any
+import uvicorn
+import os
+import sys
+import pandas as pd
+# Add src to path for imports and set working directory
+parent_dir = os.path.dirname(__file__)
+sys.path.append(parent_dir)
+os.chdir(parent_dir)  # Change to project root directory
+from src.inference.recommendation_engine import RecommendationEngine
+from src.utils.real_user_selector import RealUserSelector
+# Initialize FastAPI app
+app = FastAPI(
+    title="Two-Tower Recommendation API (Single Joint Training)",
+    description="API for serving recommendations using a two-tower architecture trained with single joint approach",
+    version="1.0.0-joint"
+)
+# Add CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # Configure appropriately for production
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Global instances
+recommendation_engine = None
+enhanced_recommendation_engine = None
+real_user_selector = None
+# Pydantic models for request/response
+class UserProfile(BaseModel):
+    age: int
+    gender: str  # "male" or "female"
+    income: float
+    interaction_history: Optional[List[int]] = []
+class RecommendationRequest(BaseModel):
+    user_profile: UserProfile
+    num_recommendations: int = 10
+    recommendation_type: str = "hybrid"  # "collaborative", "content", "hybrid", "enhanced", "enhanced_128d", "category_focused"
+    collaborative_weight: Optional[float] = 0.7
+    category_boost: Optional[float] = 1.5  # For enhanced recommendations
+    enable_category_boost: Optional[bool] = True
+    enable_diversity: Optional[bool] = True
+class ItemSimilarityRequest(BaseModel):
+    item_id: int
+    num_recommendations: int = 10
+class RatingPredictionRequest(BaseModel):
+    user_profile: UserProfile
+    item_id: int
+class ItemInfo(BaseModel):
+    product_id: int
+    category_id: int
+    category_code: str
+    brand: str
+    price: float
+class RecommendationResponse(BaseModel):
+    item_id: int
+    score: float
+    item_info: ItemInfo
+class RecommendationsResponse(BaseModel):
+    recommendations: List[RecommendationResponse]
+    user_profile: UserProfile
+    recommendation_type: str
+    total_count: int
+    training_approach: str = "single-joint"
+class RatingPredictionResponse(BaseModel):
+    user_profile: UserProfile
+    item_id: int
+    predicted_rating: float
+    item_info: ItemInfo
+class RealUserProfile(BaseModel):
+    user_id: int
+    age: int
+    gender: str
+    income: int
+    interaction_history: List[int]
+    interaction_stats: Dict[str, int]
+    interaction_pattern: str
+    summary: str
+class RealUsersResponse(BaseModel):
+    users: List[RealUserProfile]
+    total_count: int
+    dataset_summary: Dict[str, Any]
+@app.on_event("startup")
+async def startup_event():
+    """Initialize the recommendation engines and real user selector on startup."""
+    global recommendation_engine, enhanced_recommendation_engine, real_user_selector
+    print("🚀 Starting Single Joint Training API...")
+    print("   Training approach: End-to-end joint optimization from scratch")
+    try:
+        print("Loading single joint trained recommendation engine...")
+        recommendation_engine = RecommendationEngine()
+        print("✅ Single joint recommendation engine loaded successfully!")
+    except Exception as e:
+        print(f"❌ Error loading recommendation engine: {e}")
+        recommendation_engine = None
+    try:
+        print("Loading enhanced recommendation engine...")
+        # Try enhanced 128D engine first, fallback to regular enhanced
+        try:
+            from src.inference.enhanced_recommendation_engine_128d import Enhanced128DRecommendationEngine
+            enhanced_recommendation_engine = Enhanced128DRecommendationEngine()
+            print("✅ Using Enhanced 128D Recommendation Engine")
+        except:
+            from src.inference.enhanced_recommendation_engine import EnhancedRecommendationEngine
+            enhanced_recommendation_engine = EnhancedRecommendationEngine()
+            print("⚠️  Using fallback Enhanced Recommendation Engine")
+        print("Enhanced recommendation engine loaded successfully!")
+    except Exception as e:
+        print(f"Error loading enhanced recommendation engine: {e}")
+        enhanced_recommendation_engine = None
+    try:
+        print("Loading real user selector...")
+        real_user_selector = RealUserSelector()
+        print("Real user selector loaded successfully!")
+    except Exception as e:
+        print(f"Error loading real user selector: {e}")
+        real_user_selector = None
+    print("🎯 Single Joint API ready to serve recommendations!")
+@app.get("/")
+async def root():
+    """Root endpoint with API information."""
+    return {
+        "message": "Two-Tower Recommendation API (Single Joint Training)",
+        "version": "1.0.0-joint",
+        "training_approach": "single-joint (end-to-end optimization from scratch)",
+        "status": "active" if recommendation_engine is not None else "initialization_failed"
+    }
+@app.get("/health")
+async def health_check():
+    """Health check endpoint."""
+    return {
+        "status": "healthy" if recommendation_engine is not None else "unhealthy",
+        "engine_loaded": recommendation_engine is not None,
+        "training_approach": "single-joint"
+    }
+@app.get("/model-info")
+async def model_info():
+    """Get information about the loaded model."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    return {
+        "training_approach": "single-joint",
+        "description": "User and item towers trained simultaneously from scratch",
+        "advantages": [
+            "End-to-end optimization for better task alignment",
+            "No pre-training phase required",
+            "Faster overall training pipeline",
+            "Direct optimization for recommendation objectives"
+        ],
+        "embedding_dimension": 128,
+        "item_vocab_size": len(recommendation_engine.data_processor.item_vocab) if recommendation_engine.data_processor else "unknown",
+        "artifacts_loaded": {
+            "user_tower": "src/artifacts/user_tower_weights_best",
+            "item_tower_joint": "src/artifacts/item_tower_weights_finetuned_best",
+            "rating_model": "src/artifacts/rating_model_weights_best"
+        }
+    }
+@app.get("/real-users", response_model=RealUsersResponse)
+async def get_real_users(count: int = 100, min_interactions: int = 5):
+    """Get real user profiles with genuine interaction histories."""
+    if real_user_selector is None:
+        raise HTTPException(status_code=503, detail="Real user selector not available")
+    try:
+        # Get real user profiles
+        real_users = real_user_selector.get_real_users(n=count, min_interactions=min_interactions)
+        # Get dataset summary
+        dataset_summary = real_user_selector.get_dataset_summary()
+        # Format users for response
+        formatted_users = []
+        for user in real_users:
+            formatted_users.append(RealUserProfile(**user))
+        return RealUsersResponse(
+            users=formatted_users,
+            total_count=len(formatted_users),
+            dataset_summary=dataset_summary
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving real users: {str(e)}")
+@app.get("/real-users/{user_id}")
+async def get_real_user_details(user_id: int):
+    """Get detailed interaction breakdown for a specific real user."""
+    if real_user_selector is None:
+        raise HTTPException(status_code=503, detail="Real user selector not available")
+    try:
+        user_details = real_user_selector.get_user_interaction_details(user_id)
+        if "error" in user_details:
+            raise HTTPException(status_code=404, detail=user_details["error"])
+        return user_details
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving user details: {str(e)}")
+@app.get("/dataset-summary")
+async def get_dataset_summary():
+    """Get summary statistics of the real dataset."""
+    if real_user_selector is None:
+        raise HTTPException(status_code=503, detail="Real user selector not available")
+    try:
+        return real_user_selector.get_dataset_summary()
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving dataset summary: {str(e)}")
+@app.post("/recommendations", response_model=RecommendationsResponse)
+async def get_recommendations(request: RecommendationRequest):
+    """Get item recommendations for a user."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    try:
+        user_profile = request.user_profile
+        # Generate recommendations based on type
+        if request.recommendation_type == "collaborative":
+            recommendations = recommendation_engine.recommend_items_collaborative(
+                age=user_profile.age,
+                gender=user_profile.gender,
+                income=user_profile.income,
+                interaction_history=user_profile.interaction_history,
+                k=request.num_recommendations
+            )
+        elif request.recommendation_type == "content":
+            if not user_profile.interaction_history:
+                raise HTTPException(
+                    status_code=400,
+                    detail="Content-based recommendations require interaction history"
+                )
+            # Use most recent interaction as seed
+            seed_item = user_profile.interaction_history[-1]
+            recommendations = recommendation_engine.recommend_items_content_based(
+                seed_item_id=seed_item,
+                k=request.num_recommendations
+            )
+        elif request.recommendation_type == "hybrid":
+            recommendations = recommendation_engine.recommend_items_hybrid(
+                age=user_profile.age,
+                gender=user_profile.gender,
+                income=user_profile.income,
+                interaction_history=user_profile.interaction_history,
+                k=request.num_recommendations,
+                collaborative_weight=request.collaborative_weight
+            )
+        elif request.recommendation_type == "enhanced":
+            if enhanced_recommendation_engine is None:
+                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
+            # Check if it's the 128D engine or fallback
+            if hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
+                # 128D Enhanced engine
+                recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
+                    age=user_profile.age,
+                    gender=user_profile.gender,
+                    income=user_profile.income,
+                    interaction_history=user_profile.interaction_history,
+                    k=request.num_recommendations,
+                    diversity_weight=0.3 if request.enable_diversity else 0.0,
+                    category_boost=request.category_boost if request.enable_category_boost else 1.0
+                )
+            else:
+                # Fallback enhanced engine
+                recommendations = enhanced_recommendation_engine.recommend_items_enhanced_hybrid(
+                    age=user_profile.age,
+                    gender=user_profile.gender,
+                    income=user_profile.income,
+                    interaction_history=user_profile.interaction_history,
+                    k=request.num_recommendations,
+                    collaborative_weight=request.collaborative_weight,
+                    category_boost=request.category_boost,
+                    enable_category_boost=request.enable_category_boost,
+                    enable_diversity=request.enable_diversity
+                )
+        elif request.recommendation_type == "enhanced_128d":
+            if enhanced_recommendation_engine is None or not hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
+                raise HTTPException(status_code=503, detail="Enhanced 128D recommendation engine not available")
+            recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
+                age=user_profile.age,
+                gender=user_profile.gender,
+                income=user_profile.income,
+                interaction_history=user_profile.interaction_history,
+                k=request.num_recommendations,
+                diversity_weight=0.3 if request.enable_diversity else 0.0,
+                category_boost=request.category_boost if request.enable_category_boost else 1.0
+            )
+        elif request.recommendation_type == "category_focused":
+            if enhanced_recommendation_engine is None:
+                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
+            recommendations = enhanced_recommendation_engine.recommend_items_category_focused(
+                age=user_profile.age,
+                gender=user_profile.gender,
+                income=user_profile.income,
+                interaction_history=user_profile.interaction_history,
+                k=request.num_recommendations,
+                focus_percentage=0.8
+            )
+        else:
+            raise HTTPException(
+                status_code=400,
+                detail="Invalid recommendation_type. Must be 'collaborative', 'content', 'hybrid', 'enhanced', 'enhanced_128d', or 'category_focused'"
+            )
+        # Format response
+        formatted_recommendations = []
+        for item_id, score, item_info in recommendations:
+            formatted_recommendations.append(
+                RecommendationResponse(
+                    item_id=item_id,
+                    score=score,
+                    item_info=ItemInfo(**item_info)
+                )
+            )
+        return RecommendationsResponse(
+            recommendations=formatted_recommendations,
+            user_profile=user_profile,
+            recommendation_type=request.recommendation_type,
+            total_count=len(formatted_recommendations),
+            training_approach="single-joint"
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error generating recommendations: {str(e)}")
+@app.post("/item-similarity", response_model=List[RecommendationResponse])
+async def get_similar_items(request: ItemSimilarityRequest):
+    """Get items similar to a given item."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    try:
+        recommendations = recommendation_engine.recommend_items_content_based(
+            seed_item_id=request.item_id,
+            k=request.num_recommendations
+        )
+        formatted_recommendations = []
+        for item_id, score, item_info in recommendations:
+            formatted_recommendations.append(
+                RecommendationResponse(
+                    item_id=item_id,
+                    score=score,
+                    item_info=ItemInfo(**item_info)
+                )
+            )
+        return formatted_recommendations
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error finding similar items: {str(e)}")
+@app.post("/predict-rating", response_model=RatingPredictionResponse)
+async def predict_user_item_rating(request: RatingPredictionRequest):
+    """Predict rating for a user-item pair."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    try:
+        user_profile = request.user_profile
+        predicted_rating = recommendation_engine.predict_rating(
+            age=user_profile.age,
+            gender=user_profile.gender,
+            income=user_profile.income,
+            interaction_history=user_profile.interaction_history,
+            item_id=request.item_id
+        )
+        item_info = recommendation_engine._get_item_info(request.item_id)
+        return RatingPredictionResponse(
+            user_profile=user_profile,
+            item_id=request.item_id,
+            predicted_rating=predicted_rating,
+            item_info=ItemInfo(**item_info)
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error predicting rating: {str(e)}")
+@app.get("/items/{item_id}", response_model=ItemInfo)
+async def get_item_info(item_id: int):
+    """Get information about a specific item."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    try:
+        item_info = recommendation_engine._get_item_info(item_id)
+        return ItemInfo(**item_info)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving item info: {str(e)}")
+@app.get("/items")
+async def get_sample_items(limit: int = 20):
+    """Get a sample of items for testing."""
+    if recommendation_engine is None:
+        raise HTTPException(status_code=503, detail="Recommendation engine not available")
+    try:
+        # Get sample items from the dataframe
+        sample_items = recommendation_engine.items_df.sample(n=min(limit, len(recommendation_engine.items_df)))
+        items = []
+        for _, row in sample_items.iterrows():
+            items.append({
+                "product_id": int(row['product_id']),
+                "category_id": int(row['category_id']),
+                "category_code": str(row['category_code']),
+                "brand": str(row['brand']) if pd.notna(row['brand']) else 'Unknown',
+                "price": float(row['price'])
+            })
+        return {"items": items, "total": len(items), "training_approach": "single-joint"}
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error retrieving sample items: {str(e)}")
+if __name__ == "__main__":
+    print("🚀 Starting Single Joint Training Recommendation API...")
+    print("⚡ Training approach: End-to-end joint optimization from scratch")
+    print("🌐 Server will be available at: http://localhost:8000")
+    print("📚 API docs at: http://localhost:8000/docs")
+    uvicorn.run(
+        "api_joint:app",
+        host="0.0.0.0",
+        port=8000,
+        reload=True
+    )

frontend/src/App.css CHANGED Viewed

@@ -1,33 +1,1048 @@
 .App {
   text-align: center;
 }
-.App-logo {
-  height: 40vmin;
-  pointer-events: none;
 }
-@media (prefers-reduced-motion: no-preference) {
-  .App-logo {
-    animation: App-logo-spin infinite 20s linear;
   }
 }
-.App-header {
-  background-color: #282c34;
   padding: 20px;
   color: white;
 }
-.App-link {
-  color: #61dafb;
 }
-@keyframes App-logo-spin {
-  from {
-    transform: rotate(0deg);
   }
-  to {
-    transform: rotate(360deg);
   }
 }

 .App {
   text-align: center;
+  font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', sans-serif;
 }
+.container {
+  max-width: 1200px;
+  margin: 0 auto;
+  padding: 20px;
+  text-align: left;
+}
+.header {
+  text-align: center;
+  margin-bottom: 30px;
+  padding: 20px;
+  background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+  color: white;
+  border-radius: 10px;
+}
+.header h1 {
+  margin: 0 0 10px 0;
+  font-size: 2.5rem;
+}
+.header p {
+  margin: 0;
+  opacity: 0.9;
+  font-size: 1.1rem;
+}
+.dataset-info {
+  margin-top: 15px;
+  padding: 10px;
+  background: rgba(255, 255, 255, 0.2);
+  border-radius: 5px;
+  font-size: 0.95rem;
+}
+/* Real User Selector */
+.real-user-selector {
+  background: #e8f5e8;
+  padding: 25px;
+  border-radius: 10px;
+  margin-bottom: 30px;
+  border: 1px solid #c3e6c3;
+}
+.real-user-selector h2 {
+  margin-top: 0;
+  color: #2d5a2d;
+  border-bottom: 2px solid #28a745;
+  padding-bottom: 10px;
+}
+.user-selector-controls {
+  display: flex;
+  align-items: center;
+  gap: 15px;
+  margin-bottom: 20px;
+}
+.user-selector-controls label {
+  font-weight: bold;
+  min-width: 200px;
+}
+.user-selector-controls select {
+  flex: 1;
+  padding: 8px 12px;
+  border: 1px solid #28a745;
+  border-radius: 5px;
+  font-size: 14px;
+}
+.selected-real-user {
+  background: rgba(255, 255, 255, 0.8);
+  padding: 20px;
+  border-radius: 8px;
+  border: 1px solid #28a745;
+}
+.real-user-stats {
+  display: grid;
+  gap: 12px;
+}
+.user-stat {
+  display: flex;
+  justify-content: space-between;
+  padding: 8px 0;
+  border-bottom: 1px solid #ddd;
+}
+.stat-label {
+  font-weight: bold;
+  color: #495057;
+}
+.stat-value {
+  color: #28a745;
+  font-weight: 600;
+}
+.real-interaction-summary {
+  display: flex;
+  gap: 20px;
+  justify-content: center;
+  margin: 20px 0;
+}
+.real-history-info {
+  background: rgba(255, 255, 255, 0.8);
+  padding: 15px;
+  border-radius: 8px;
+  margin-top: 15px;
+}
+.real-history-info p {
+  margin: 8px 0;
+}
+/* Expand Interactions Button */
+.expand-interactions-btn {
+  background: #007bff;
+  color: white;
+  border: 1px solid #007bff;
+  border-radius: 5px;
+  padding: 10px 20px;
+  margin-top: 15px;
+  cursor: pointer;
+  font-size: 14px;
+  transition: background-color 0.3s;
+}
+.expand-interactions-btn:hover:not(:disabled) {
+  background: #0056b3;
+}
+.expand-interactions-btn:disabled {
+  opacity: 0.6;
+  cursor: not-allowed;
+}
+/* User Interactions Timeline */
+.user-interactions-timeline {
+  margin-top: 20px;
+  padding: 20px;
+  background: rgba(255, 255, 255, 0.95);
+  border-radius: 8px;
+  border: 1px solid #007bff;
+}
+.user-interactions-timeline h4 {
+  margin-top: 0;
+  color: #007bff;
+  border-bottom: 2px solid #007bff;
+  padding-bottom: 8px;
+}
+.user-interactions-timeline h5 {
+  color: #495057;
+  margin: 15px 0 10px 0;
+}
+.timeline-stats {
+  display: flex;
+  flex-wrap: wrap;
+  gap: 20px;
+  margin: 15px 0;
+  padding: 15px;
+  background: #f8f9fa;
+  border-radius: 5px;
+  font-size: 14px;
+}
+.timeline-stats span {
+  color: #495057;
+}
+/* Interactions List */
+.interactions-list {
+  max-height: 400px;
+  overflow-y: auto;
+  border: 1px solid #e9ecef;
+  border-radius: 5px;
+  background: white;
+}
+.interaction-timeline-item {
+  display: flex;
+  align-items: center;
+  padding: 12px 15px;
+  border-bottom: 1px solid #e9ecef;
+  transition: background-color 0.2s;
+}
+.interaction-timeline-item:hover {
+  background: #f8f9fa;
+}
+.interaction-timeline-item:last-child {
+  border-bottom: none;
+}
+.interaction-timeline-time {
+  flex: 0 0 180px;
+  font-size: 12px;
+  color: #6c757d;
+  font-family: monospace;
+}
+.interaction-timeline-content {
+  flex: 1;
+  display: flex;
+  flex-direction: column;
+  gap: 5px;
+}
+.interaction-main-info {
+  display: flex;
+  align-items: center;
+  gap: 10px;
+}
+.interaction-item-details {
+  display: flex;
+  align-items: center;
+  gap: 15px;
+  padding-left: 20px;
+  font-size: 13px;
+}
+.interaction-item-id {
+  color: #495057;
+  font-size: 13px;
+  font-weight: 500;
+}
+.interaction-icon {
+  font-size: 16px;
+  min-width: 20px;
+}
+.item-brand {
+  color: #007bff;
+  font-size: 14px;
+  min-width: 120px;
+}
+.item-category {
+  color: #6c757d;
+  font-size: 12px;
+  background: #f8f9fa;
+  padding: 2px 6px;
+  border-radius: 3px;
+  border: 1px solid #e9ecef;
+  min-width: 150px;
+  text-align: center;
+}
+.item-price {
+  color: #28a745;
+  font-weight: 600;
+  font-size: 13px;
+  min-width: 70px;
+  text-align: right;
+}
+/* Interaction type badges in timeline */
+.interaction-timeline-content .interaction-type {
+  font-size: 11px;
+  font-weight: bold;
+  padding: 3px 8px;
+  border-radius: 12px;
+  text-transform: uppercase;
+  letter-spacing: 0.5px;
+}
+.interaction-timeline-content .interaction-type.view {
+  background: #d1ecf1;
+  color: #0c5460;
+}
+.interaction-timeline-content .interaction-type.cart {
+  background: #fff3cd;
+  color: #856404;
+}
+.interaction-timeline-content .interaction-type.purchase {
+  background: #d4edda;
+  color: #155724;
+}
+/* Category Analysis Columns */
+.category-analysis {
+  margin-top: 20px;
+  padding: 20px;
+  background: rgba(255, 255, 255, 0.95);
+  border-radius: 8px;
+  border: 1px solid #6c757d;
+}
+.category-analysis h4 {
+  margin-top: 0;
+  color: #495057;
+  border-bottom: 2px solid #6c757d;
+  padding-bottom: 8px;
+}
+.category-columns {
+  display: grid;
+  grid-template-columns: 1fr 1fr;
+  gap: 30px;
+  margin: 20px 0;
+}
+.category-column {
+  background: #f8f9fa;
+  padding: 15px;
+  border-radius: 8px;
+  border: 1px solid #dee2e6;
+}
+.category-column h5 {
+  margin-top: 0;
+  margin-bottom: 15px;
+  color: #495057;
+  font-size: 16px;
+  padding-bottom: 5px;
+  border-bottom: 1px solid #dee2e6;
+}
+.category-percentages {
+  display: flex;
+  flex-direction: column;
+  gap: 8px;
+}
+.category-item {
+  display: flex;
+  align-items: center;
+  gap: 10px;
+  padding: 6px 0;
+}
+.category-bar-container {
+  flex: 0 0 80px;
+  height: 16px;
+  background: #e9ecef;
+  border-radius: 8px;
+  overflow: hidden;
+  position: relative;
+}
+.category-bar {
+  height: 100%;
+  border-radius: 8px;
+  transition: width 0.3s ease;
+}
+.category-bar.user-category {
+  background: linear-gradient(90deg, #007bff, #0056b3);
+}
+.category-bar.rec-category-matched {
+  background: linear-gradient(90deg, #28a745, #1e7e34);
+}
+.category-bar.rec-category-new {
+  background: linear-gradient(90deg, #ffc107, #e0a800);
 }
+.category-label {
+  flex: 1;
+  font-size: 12px;
+  color: #495057;
+  text-overflow: ellipsis;
+  overflow: hidden;
+  white-space: nowrap;
+}
+.category-percent {
+  flex: 0 0 40px;
+  font-size: 11px;
+  font-weight: bold;
+  text-align: right;
+  color: #495057;
+}
+.match-indicator {
+  color: #28a745;
+  font-weight: bold;
+  font-size: 14px;
+}
+.category-match-summary {
+  margin-top: 15px;
+  padding: 15px;
+  background: #e7f3ff;
+  border-radius: 6px;
+  border-left: 4px solid #007bff;
+}
+.category-match-summary p {
+  margin: 0;
+  font-size: 14px;
+  color: #495057;
+}
+.match-legend {
+  display: flex;
+  gap: 20px;
+  margin-top: 8px;
+  font-size: 12px;
+}
+.legend-item {
+  display: flex;
+  align-items: center;
+  gap: 6px;
+}
+.legend-dot {
+  width: 12px;
+  height: 12px;
+  border-radius: 50%;
+}
+.legend-dot.matched {
+  background: #28a745;
+}
+.legend-dot.new {
+  background: #ffc107;
+}
+/* Responsive design for smaller screens */
+@media (max-width: 768px) {
+  .category-columns {
+    grid-template-columns: 1fr;
+    gap: 20px;
   }
+  .category-bar-container {
+    flex: 0 0 60px;
+  }
+  .match-legend {
+    flex-direction: column;
+    gap: 8px;
+  }
+}
+/* Pagination Styles */
+.pagination-info {
+  margin: 20px 0;
+  text-align: center;
 }
+.pagination-info p {
+  margin: 0 0 15px 0;
+  color: #6c757d;
+  font-size: 14px;
+}
+.pagination-controls {
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  gap: 10px;
+  margin: 15px 0;
+}
+.pagination-btn {
+  background: #007bff;
+  color: white;
+  border: 1px solid #007bff;
+  border-radius: 5px;
+  padding: 8px 16px;
+  cursor: pointer;
+  font-size: 14px;
+  transition: background-color 0.3s;
+}
+.pagination-btn:hover:not(:disabled) {
+  background: #0056b3;
+  border-color: #0056b3;
+}
+.pagination-btn:disabled {
+  background: #6c757d;
+  border-color: #6c757d;
+  cursor: not-allowed;
+  opacity: 0.65;
+}
+.page-numbers {
+  display: flex;
+  align-items: center;
+  gap: 5px;
+}
+.page-number {
+  background: #f8f9fa;
+  color: #007bff;
+  border: 1px solid #dee2e6;
+  border-radius: 3px;
+  padding: 6px 10px;
+  cursor: pointer;
+  font-size: 14px;
+  transition: all 0.3s;
+  min-width: 35px;
+}
+.page-number:hover {
+  background: #e9ecef;
+  border-color: #007bff;
+}
+.page-number.active {
+  background: #007bff;
+  color: white;
+  border-color: #007bff;
+}
+.pagination-ellipsis {
+  color: #6c757d;
+  padding: 0 5px;
+}
+.bottom-pagination {
+  border-top: 1px solid #dee2e6;
+  padding-top: 20px;
+  margin-top: 20px;
+}
+.page-indicator {
+  color: #6c757d;
+  font-size: 14px;
+  margin: 0 15px;
+}
+/* Responsive pagination */
+@media (max-width: 768px) {
+  .pagination-controls {
+    flex-wrap: wrap;
+    gap: 8px;
+  }
+  .page-numbers {
+    order: 3;
+    width: 100%;
+    justify-content: center;
+    margin-top: 10px;
+  }
+  .pagination-btn {
+    padding: 6px 12px;
+    font-size: 13px;
+  }
+  .page-number {
+    padding: 5px 8px;
+    min-width: 30px;
+    font-size: 13px;
+  }
+}
+/* User Profile Form */
+.user-profile-form {
+  background: #f8f9fa;
+  padding: 25px;
+  border-radius: 10px;
+  margin-bottom: 30px;
+  border: 1px solid #e9ecef;
+}
+.user-profile-form h2 {
+  margin-top: 0;
+  color: #495057;
+  border-bottom: 2px solid #007bff;
+  padding-bottom: 10px;
+}
+.form-row {
+  display: flex;
+  gap: 20px;
+  flex-wrap: wrap;
+}
+.form-group {
+  flex: 1;
+  min-width: 200px;
+}
+.form-group label {
+  display: block;
+  margin-bottom: 5px;
+  font-weight: 600;
+  color: #495057;
+}
+.form-group input,
+.form-group select {
+  width: 100%;
+  padding: 10px;
+  border: 1px solid #ced4da;
+  border-radius: 5px;
+  font-size: 14px;
+}
+.form-group input:focus,
+.form-group select:focus {
+  outline: none;
+  border-color: #007bff;
+  box-shadow: 0 0 0 0.2rem rgba(0, 123, 255, 0.25);
+}
+/* Interaction Patterns */
+.interaction-patterns {
+  background: #fff;
+  padding: 25px;
+  border-radius: 10px;
+  margin-bottom: 30px;
+  border: 1px solid #e9ecef;
+  box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+}
+.interaction-patterns h2 {
+  margin-top: 0;
+  color: #495057;
+  border-bottom: 2px solid #28a745;
+  padding-bottom: 10px;
+}
+.pattern-buttons {
+  display: flex;
+  gap: 15px;
+  margin: 20px 0;
+  flex-wrap: wrap;
+}
+.pattern-btn {
+  padding: 15px 20px;
+  border: 2px solid #007bff;
+  background: white;
+  color: #007bff;
+  border-radius: 8px;
+  cursor: pointer;
+  transition: all 0.3s ease;
+  font-size: 14px;
+  text-align: center;
+  min-width: 120px;
+}
+.pattern-btn:hover {
+  background: #007bff;
+  color: white;
+  transform: translateY(-2px);
+  box-shadow: 0 4px 8px rgba(0,123,255,0.3);
+}
+.pattern-btn.active {
+  background: #007bff;
+  color: white;
+  box-shadow: 0 4px 8px rgba(0,123,255,0.3);
+}
+.pattern-btn small {
+  opacity: 0.8;
+  font-size: 12px;
+}
+.pattern-summary {
+  display: flex;
+  gap: 20px;
+  margin: 20px 0;
+  flex-wrap: wrap;
+}
+.summary-card {
+  background: white;
+  border: 1px solid #e9ecef;
+  border-radius: 8px;
   padding: 20px;
+  text-align: center;
+  flex: 1;
+  min-width: 100px;
+  box-shadow: 0 2px 4px rgba(0,0,0,0.05);
+}
+.summary-card.views {
+  border-left: 4px solid #17a2b8;
+}
+.summary-card.carts {
+  border-left: 4px solid #ffc107;
+}
+.summary-card.purchases {
+  border-left: 4px solid #28a745;
+}
+.summary-number {
+  font-size: 2rem;
+  font-weight: bold;
+  margin-bottom: 5px;
+}
+.summary-label {
+  color: #6c757d;
+  font-size: 14px;
+}
+/* Interaction History */
+.interaction-history {
+  margin-top: 25px;
+}
+.interaction-history h3 {
+  color: #495057;
+  margin-bottom: 15px;
+}
+.interaction-item {
+  background: white;
+  border: 1px solid #e9ecef;
+  border-radius: 8px;
+  margin-bottom: 10px;
+  padding: 15px;
+  display: flex;
+  justify-content: space-between;
+  align-items: center;
+  transition: all 0.2s ease;
+}
+.interaction-item:hover {
+  box-shadow: 0 2px 8px rgba(0,0,0,0.1);
+  transform: translateY(-1px);
+}
+.interaction-main {
+  display: flex;
+  align-items: center;
+  gap: 15px;
+  flex: 1;
+}
+.interaction-type {
+  padding: 4px 12px;
+  border-radius: 20px;
+  font-size: 12px;
+  font-weight: 600;
+  text-transform: uppercase;
+  min-width: 80px;
+  text-align: center;
+}
+.interaction-type.view {
+  background: #cce7ff;
+  color: #0066cc;
+}
+.interaction-type.cart {
+  background: #fff3cd;
+  color: #856404;
+}
+.interaction-type.purchase {
+  background: #d4edda;
+  color: #155724;
+}
+.interaction-details {
+  flex: 1;
+  font-size: 14px;
+}
+.category-tag {
+  background: #e9ecef;
+  color: #495057;
+  padding: 2px 8px;
+  border-radius: 12px;
+  font-size: 12px;
+  font-weight: 500;
+  display: inline-block;
+  margin: 0 5px;
+}
+.interaction-expand {
+  padding: 6px 12px;
+  background: #f8f9fa;
+  border: 1px solid #dee2e6;
+  border-radius: 4px;
+  cursor: pointer;
+  font-size: 12px;
+  color: #495057;
+  transition: all 0.2s ease;
+}
+.interaction-expand:hover {
+  background: #e9ecef;
+  border-color: #adb5bd;
+}
+.interaction-expanded {
+  background: #f8f9fa;
+  border: 1px solid #dee2e6;
+  border-radius: 8px;
+  padding: 20px;
+  margin-top: 10px;
+}
+.interaction-meta {
+  display: grid;
+  grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
+  gap: 15px;
+}
+.interaction-meta-item {
+  display: flex;
+  justify-content: space-between;
+  padding: 8px 0;
+  border-bottom: 1px solid #e9ecef;
+}
+.interaction-meta-label {
+  font-weight: 600;
+  color: #495057;
+}
+.interaction-meta-value {
+  color: #6c757d;
+  font-family: monospace;
+}
+/* Recommendation Controls */
+.recommendation-controls {
+  background: #fff;
+  padding: 25px;
+  border-radius: 10px;
+  margin-bottom: 30px;
+  border: 1px solid #e9ecef;
+  box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+}
+.recommendation-controls h2 {
+  margin-top: 0;
+  color: #495057;
+  border-bottom: 2px solid #dc3545;
+  padding-bottom: 10px;
+}
+.controls-row {
+  display: flex;
+  gap: 20px;
+  align-items: end;
+  flex-wrap: wrap;
+}
+.btn {
+  padding: 12px 24px;
+  border: none;
+  border-radius: 6px;
+  cursor: pointer;
+  font-size: 14px;
+  font-weight: 600;
+  transition: all 0.3s ease;
+  text-decoration: none;
+  display: inline-block;
+  text-align: center;
+}
+.btn-primary {
+  background: #007bff;
   color: white;
 }
+.btn-primary:hover:not(:disabled) {
+  background: #0056b3;
+  transform: translateY(-1px);
+  box-shadow: 0 4px 8px rgba(0,123,255,0.3);
+}
+.btn:disabled {
+  background: #6c757d;
+  cursor: not-allowed;
+  opacity: 0.65;
+}
+/* Recommendations */
+.recommendations {
+  background: #fff;
+  padding: 25px;
+  border-radius: 10px;
+  border: 1px solid #e9ecef;
+  box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+}
+.recommendations h2 {
+  margin-top: 0;
+  color: #495057;
+  border-bottom: 2px solid #6f42c1;
+  padding-bottom: 10px;
 }
+.stats {
+  background: #f8f9fa;
+  padding: 15px;
+  border-radius: 6px;
+  margin-bottom: 20px;
+  font-size: 14px;
+  color: #495057;
+}
+.recommendations-grid {
+  display: grid;
+  grid-template-columns: repeat(auto-fill, minmax(300px, 1fr));
+  gap: 20px;
+  margin-top: 20px;
+}
+.recommendation-card {
+  background: white;
+  border: 1px solid #e9ecef;
+  border-radius: 8px;
+  padding: 20px;
+  transition: all 0.2s ease;
+  box-shadow: 0 2px 4px rgba(0,0,0,0.05);
+}
+.recommendation-card:hover {
+  transform: translateY(-2px);
+  box-shadow: 0 4px 12px rgba(0,0,0,0.1);
+  border-color: #007bff;
+}
+.card-header {
+  display: flex;
+  justify-content: space-between;
+  align-items: center;
+  margin-bottom: 15px;
+  padding-bottom: 10px;
+  border-bottom: 1px solid #e9ecef;
+}
+.item-id {
+  font-weight: 600;
+  color: #495057;
+}
+.score {
+  background: #007bff;
+  color: white;
+  padding: 4px 8px;
+  border-radius: 12px;
+  font-size: 12px;
+  font-weight: 600;
+}
+.item-details p {
+  margin: 8px 0;
+}
+.brand {
+  font-weight: 600;
+  color: #495057;
+  font-size: 16px;
+}
+.price {
+  color: #28a745;
+  font-weight: 600;
+  font-size: 18px;
+}
+.category {
+  color: #6c757d;
+  font-size: 14px;
+  background: #f8f9fa;
+  padding: 4px 8px;
+  border-radius: 4px;
+  display: inline-block;
+}
+/* Error and Loading States */
+.error {
+  background: #f8d7da;
+  color: #721c24;
+  padding: 15px;
+  border-radius: 6px;
+  border: 1px solid #f5c6cb;
+  margin-bottom: 20px;
+}
+.loading {
+  text-align: center;
+  padding: 40px;
+  background: #f8f9fa;
+  border-radius: 10px;
+  border: 1px solid #e9ecef;
+}
+.loading h3 {
+  color: #495057;
+  margin-bottom: 10px;
+}
+.loading p {
+  color: #6c757d;
+  margin: 0;
+}
+/* Responsive Design */
+@media (max-width: 768px) {
+  .container {
+    padding: 10px;
+  }
+  .form-row,
+  .controls-row {
+    flex-direction: column;
+  }
+  .pattern-buttons {
+    flex-direction: column;
+  }
+  .pattern-btn {
+    width: 100%;
+  }
+  .recommendations-grid {
+    grid-template-columns: 1fr;
+  }
+  .interaction-main {
+    flex-direction: column;
+    align-items: flex-start;
+    gap: 10px;
   }
+  .interaction-meta {
+    grid-template-columns: 1fr;
   }
 }

frontend/src/App.js CHANGED Viewed

@@ -22,21 +22,38 @@ function App() {
   });
   const [recommendationType, setRecommendationType] = useState('hybrid');
-  const [numRecommendations, setNumRecommendations] = useState(10);
   const [collaborativeWeight, setCollaborativeWeight] = useState(0.7);
   const [recommendations, setRecommendations] = useState([]);
   const [loading, setLoading] = useState(false);
   const [error, setError] = useState(null);
   const [sampleItems, setSampleItems] = useState([]);
   const [interactions, setInteractions] = useState([]);
   const [expandedInteraction, setExpandedInteraction] = useState(null);
   const [selectedPattern, setSelectedPattern] = useState(null);
-  // Load sample items on component mount
   useEffect(() => {
     fetchSampleItems();
   }, []);
   const fetchSampleItems = async () => {
@@ -48,6 +65,30 @@ function App() {
     }
   };
   const handleProfileChange = (field, value) => {
     setUserProfile(prev => ({
       ...prev,
@@ -55,6 +96,41 @@ function App() {
     }));
   };
   const generateTimestamp = (baseTime, offsetHours) => {
     const timestamp = new Date(baseTime.getTime() - (offsetHours * 60 * 60 * 1000));
     return timestamp.toISOString().replace('T', ' ').slice(0, 19);
@@ -170,9 +246,77 @@ function App() {
     }));
   };
   const getRecommendations = async () => {
     setLoading(true);
     setError(null);
     try {
       const requestData = {
@@ -192,27 +336,154 @@ function App() {
     }
   };
-  const getInteractionCounts = () => {
-    const counts = { views: 0, carts: 0, purchases: 0 };
-    interactions.forEach(interaction => {
-      counts[interaction.type + 's'] = (counts[interaction.type + 's'] || 0) + 1;
-    });
-    return counts;
-  };
-  const counts = getInteractionCounts();
   return (
     <div className="App">
       <div className="container">
         <header className="header">
           <h1>Two-Tower Recommendation System Demo</h1>
-          <p>Configure user demographics and realistic interaction patterns to get personalized recommendations</p>
         </header>
         {/* User Profile Form */}
         <div className="user-profile-form">
-          <h2>User Demographics</h2>
           <div className="form-row">
             <div className="form-group">
@@ -224,6 +495,8 @@ function App() {
                 onChange={(e) => handleProfileChange('age', e.target.value)}
                 min="18"
                 max="100"
               />
             </div>
@@ -233,6 +506,8 @@ function App() {
                 id="gender"
                 value={userProfile.gender}
                 onChange={(e) => handleProfileChange('gender', e.target.value)}
               >
                 <option value="male">Male</option>
                 <option value="female">Female</option>
@@ -248,6 +523,8 @@ function App() {
                 onChange={(e) => handleProfileChange('income', e.target.value)}
                 min="0"
                 step="1000"
               />
             </div>
           </div>
@@ -255,29 +532,140 @@ function App() {
         {/* Interaction Patterns */}
         <div className="interaction-patterns">
-          <h2>Interaction Patterns</h2>
-          <p>Generate realistic user behavior patterns with proportional view, cart, and purchase events</p>
-          <div className="pattern-buttons">
-            {INTERACTION_PATTERNS.map((pattern, index) => (
-              <button
-                key={index}
-                className={`pattern-btn ${selectedPattern?.name === pattern.name ? 'active' : ''}`}
-                onClick={() => handlePatternSelect(pattern)}
-              >
-                {pattern.name}
-                <br />
-                <small>{pattern.views}V • {pattern.carts}C • {pattern.purchases}P</small>
-              </button>
-            ))}
-            <button
-              className="pattern-btn"
-              onClick={clearInteractions}
-              style={{backgroundColor: '#dc3545', color: 'white', borderColor: '#dc3545'}}
-            >
-              Clear All
-            </button>
-          </div>
           {interactions.length > 0 && (
             <>
@@ -305,7 +693,7 @@ function App() {
                         {interaction.type}
                       </span>
                       <span className="interaction-details">
-                        <strong>{interaction.brand}</strong> - ${interaction.price}
                         {interaction.quantity && ` (x${interaction.quantity})`}
                         {interaction.total_amount && ` = $${interaction.total_amount}`}
                       </span>
@@ -387,6 +775,8 @@ function App() {
                 onChange={(e) => setRecommendationType(e.target.value)}
               >
                 <option value="hybrid">Hybrid</option>
                 <option value="collaborative">Collaborative Filtering</option>
                 <option value="content">Content-Based</option>
               </select>
@@ -399,14 +789,15 @@ function App() {
                 value={numRecommendations}
                 onChange={(e) => setNumRecommendations(parseInt(e.target.value))}
               >
-                <option value="5">5</option>
                 <option value="10">10</option>
-                <option value="15">15</option>
                 <option value="20">20</option>
               </select>
             </div>
-            {recommendationType === 'hybrid' && (
               <div className="form-group">
                 <label htmlFor="collabWeight">Collaborative Weight:</label>
                 <input
@@ -448,19 +839,66 @@ function App() {
         {/* Recommendations Display */}
         {recommendations.length > 0 && (
           <div className="recommendations">
-            <h2>Recommendations ({recommendationType})</h2>
             <div className="stats">
               <strong>User Profile:</strong> {userProfile.age}yr {userProfile.gender},
-              ${userProfile.income.toLocaleString()} income,
-              {interactions.length} total interactions ({counts.views || 0} views, {counts.carts || 0} carts, {counts.purchases || 0} purchases)
             </div>
             <div className="recommendations-grid">
-              {recommendations.map((rec, index) => (
                 <div key={rec.item_id} className="recommendation-card">
                   <div className="card-header">
-                    <span className="item-id">#{index + 1} Item {rec.item_id}</span>
                     <span className="score">{rec.score.toFixed(4)}</span>
                   </div>
@@ -472,6 +910,27 @@ function App() {
                 </div>
               ))}
             </div>
           </div>
         )}

   });
   const [recommendationType, setRecommendationType] = useState('hybrid');
+  const [numRecommendations, setNumRecommendations] = useState(100);
   const [collaborativeWeight, setCollaborativeWeight] = useState(0.7);
   const [recommendations, setRecommendations] = useState([]);
   const [loading, setLoading] = useState(false);
   const [error, setError] = useState(null);
+  // Pagination for recommendations
+  const [currentPage, setCurrentPage] = useState(1);
+  const [itemsPerPage] = useState(20); // Show 20 recommendations per page
   const [sampleItems, setSampleItems] = useState([]);
   const [interactions, setInteractions] = useState([]);
   const [expandedInteraction, setExpandedInteraction] = useState(null);
   const [selectedPattern, setSelectedPattern] = useState(null);
+  // Real user data states
+  const [realUsers, setRealUsers] = useState([]);
+  const [selectedRealUser, setSelectedRealUser] = useState(null);
+  const [datasetSummary, setDatasetSummary] = useState(null);
+  const [useRealUsers, setUseRealUsers] = useState(true);
+  // Expanded interaction states
+  const [showUserInteractions, setShowUserInteractions] = useState(false);
+  const [userInteractionDetails, setUserInteractionDetails] = useState(null);
+  const [loadingInteractions, setLoadingInteractions] = useState(false);
+  // Load sample items and real users on component mount
   useEffect(() => {
     fetchSampleItems();
+    fetchRealUsers();
+    fetchDatasetSummary();
   }, []);
   const fetchSampleItems = async () => {
     }
   };
+  const fetchRealUsers = async () => {
+    try {
+      const response = await axios.get(`${API_BASE_URL}/real-users?count=100&min_interactions=5`);
+      setRealUsers(response.data.users || []);
+      if (response.data.users && response.data.users.length > 0) {
+        // Auto-select the first (most active) user
+        handleRealUserSelect(response.data.users[0]);
+      }
+    } catch (error) {
+      console.error('Error fetching real users:', error);
+      setError('Could not load real users. Using synthetic data mode.');
+      setUseRealUsers(false);
+    }
+  };
+  const fetchDatasetSummary = async () => {
+    try {
+      const response = await axios.get(`${API_BASE_URL}/dataset-summary`);
+      setDatasetSummary(response.data);
+    } catch (error) {
+      console.error('Error fetching dataset summary:', error);
+    }
+  };
   const handleProfileChange = (field, value) => {
     setUserProfile(prev => ({
       ...prev,
     }));
   };
+  const handleRealUserSelect = (user) => {
+    setSelectedRealUser(user);
+    setUserProfile({
+      age: user.age,
+      gender: user.gender,
+      income: user.income,
+      interaction_history: user.interaction_history.slice(0, 50) // Limit to 50 items
+    });
+    // Clear any synthetic interactions and expanded states
+    setInteractions([]);
+    setSelectedPattern(null);
+    setShowUserInteractions(false);
+    setUserInteractionDetails(null);
+  };
+  const fetchUserInteractionDetails = async (userId) => {
+    setLoadingInteractions(true);
+    try {
+      const response = await axios.get(`${API_BASE_URL}/real-users/${userId}`);
+      setUserInteractionDetails(response.data);
+    } catch (error) {
+      console.error('Error fetching user interaction details:', error);
+      setError('Could not load user interaction details');
+    } finally {
+      setLoadingInteractions(false);
+    }
+  };
+  const toggleUserInteractions = async () => {
+    if (!showUserInteractions && selectedRealUser && !userInteractionDetails) {
+      await fetchUserInteractionDetails(selectedRealUser.user_id);
+    }
+    setShowUserInteractions(!showUserInteractions);
+  };
   const generateTimestamp = (baseTime, offsetHours) => {
     const timestamp = new Date(baseTime.getTime() - (offsetHours * 60 * 60 * 1000));
     return timestamp.toISOString().replace('T', ' ').slice(0, 19);
     }));
   };
+  const getInteractionCounts = () => {
+    const counts = { views: 0, carts: 0, purchases: 0 };
+    interactions.forEach(interaction => {
+      counts[interaction.type + 's'] = (counts[interaction.type + 's'] || 0) + 1;
+    });
+    return counts;
+  };
+  const counts = getInteractionCounts();
+  // Calculate category percentages from user interactions
+  const getCategoryPercentages = () => {
+    if (!selectedRealUser || !userInteractionDetails) return {};
+    const categoryCounts = {};
+    let totalInteractions = 0;
+    userInteractionDetails.timeline?.forEach(interaction => {
+      const category = interaction.category_code || 'Unknown';
+      categoryCounts[category] = (categoryCounts[category] || 0) + 1;
+      totalInteractions++;
+    });
+    const categoryPercentages = {};
+    Object.keys(categoryCounts).forEach(category => {
+      categoryPercentages[category] = ((categoryCounts[category] / totalInteractions) * 100).toFixed(1);
+    });
+    return categoryPercentages;
+  };
+  // Calculate recommendation category percentages
+  const getRecommendationCategoryPercentages = () => {
+    if (!recommendations || recommendations.length === 0) return {};
+    const recCategoryCounts = {};
+    recommendations.forEach(rec => {
+      const category = rec.item_info?.category_code || 'Unknown';
+      recCategoryCounts[category] = (recCategoryCounts[category] || 0) + 1;
+    });
+    const recCategoryPercentages = {};
+    Object.keys(recCategoryCounts).forEach(category => {
+      recCategoryPercentages[category] = ((recCategoryCounts[category] / recommendations.length) * 100).toFixed(1);
+    });
+    return recCategoryPercentages;
+  };
+  const categoryPercentages = getCategoryPercentages();
+  const recommendationCategoryPercentages = getRecommendationCategoryPercentages();
+  // Pagination logic
+  const totalPages = Math.ceil(recommendations.length / itemsPerPage);
+  const startIndex = (currentPage - 1) * itemsPerPage;
+  const endIndex = startIndex + itemsPerPage;
+  const currentRecommendations = recommendations.slice(startIndex, endIndex);
+  const goToPage = (page) => {
+    setCurrentPage(page);
+    // Scroll to recommendations section
+    document.querySelector('.recommendations')?.scrollIntoView({ behavior: 'smooth' });
+  };
+  // Reset pagination when new recommendations are generated
   const getRecommendations = async () => {
     setLoading(true);
     setError(null);
+    setCurrentPage(1); // Reset to first page
     try {
       const requestData = {
     }
   };
   return (
     <div className="App">
       <div className="container">
         <header className="header">
           <h1>Two-Tower Recommendation System Demo</h1>
+          <p>Select from {realUsers.length} real users or configure custom demographics to get personalized recommendations</p>
+          {datasetSummary && (
+            <div className="dataset-info">
+              📊 Dataset: {datasetSummary.total_users?.toLocaleString()} users, {datasetSummary.total_interactions?.toLocaleString()} interactions |
+              👥 Demographics: Avg age {datasetSummary.demographics?.avg_age}, avg income ${datasetSummary.demographics?.avg_income?.toLocaleString()}
+            </div>
+          )}
         </header>
+        {/* Real User Selector */}
+        {useRealUsers && realUsers.length > 0 && (
+          <div className="real-user-selector">
+            <h2>Real User Selection</h2>
+            <div className="user-selector-controls">
+              <label htmlFor="realUserSelect">Choose from {realUsers.length} real users:</label>
+              <select
+                id="realUserSelect"
+                value={selectedRealUser?.user_id || ''}
+                onChange={(e) => {
+                  const userId = parseInt(e.target.value);
+                  const user = realUsers.find(u => u.user_id === userId);
+                  if (user) handleRealUserSelect(user);
+                }}
+              >
+                <option value="">Select a real user...</option>
+                {realUsers.map((user, index) => (
+                  <option key={user.user_id} value={user.user_id}>
+                    #{index + 1}: {user.summary} - {user.interaction_pattern}
+                  </option>
+                ))}
+              </select>
+              <button
+                onClick={() => setUseRealUsers(false)}
+                className="btn btn-secondary"
+                style={{marginLeft: '10px'}}
+              >
+                Use Custom User Instead
+              </button>
+            </div>
+            {selectedRealUser && (
+              <div className="selected-real-user">
+                <h3>Selected User: {selectedRealUser.user_id}</h3>
+                <div className="real-user-stats">
+                  <div className="user-stat">
+                    <span className="stat-label">Demographics:</span>
+                    <span className="stat-value">{selectedRealUser.age}yr {selectedRealUser.gender}, ${selectedRealUser.income.toLocaleString()}</span>
+                  </div>
+                  <div className="user-stat">
+                    <span className="stat-label">Behavior Pattern:</span>
+                    <span className="stat-value">{selectedRealUser.interaction_pattern}</span>
+                  </div>
+                  <div className="user-stat">
+                    <span className="stat-label">Interactions:</span>
+                    <span className="stat-value">
+                      {selectedRealUser.interaction_stats.total_interactions} total
+                      ({selectedRealUser.interaction_stats.views} views, {selectedRealUser.interaction_stats.cart_adds} carts, {selectedRealUser.interaction_stats.purchases} purchases)
+                    </span>
+                  </div>
+                  <div className="user-stat">
+                    <span className="stat-label">History:</span>
+                    <span className="stat-value">{selectedRealUser.interaction_stats.unique_items} unique items</span>
+                  </div>
+                </div>
+                <button
+                  onClick={toggleUserInteractions}
+                  className="btn btn-info expand-interactions-btn"
+                  disabled={loadingInteractions}
+                >
+                  {loadingInteractions ? 'Loading...' : showUserInteractions ? 'Hide Interaction Timeline' : 'Show All Interactions Timeline'}
+                </button>
+                {showUserInteractions && userInteractionDetails && (
+                  <div className="user-interactions-timeline">
+                    <h4>Complete Interaction Timeline</h4>
+                    <div className="timeline-stats">
+                      <span><strong>Total Events:</strong> {userInteractionDetails.total_interactions}</span>
+                      <span><strong>Pattern:</strong> {userInteractionDetails.interaction_pattern}</span>
+                      <span><strong>Breakdown:</strong> {userInteractionDetails.breakdown.views} views, {userInteractionDetails.breakdown.cart_adds} carts, {userInteractionDetails.breakdown.purchases} purchases</span>
+                    </div>
+                    <div className="interactions-list">
+                      <h5>Recent Interactions (Last {userInteractionDetails.timeline?.length || 0} events):</h5>
+                      {userInteractionDetails.timeline?.map((interaction, index) => (
+                        <div key={index} className="interaction-timeline-item">
+                          <div className="interaction-timeline-time">
+                            {new Date(interaction.timestamp).toLocaleString()}
+                          </div>
+                          <div className="interaction-timeline-content">
+                            <div className="interaction-main-info">
+                              <span className={`interaction-type ${interaction.event_type}`}>
+                                {interaction.event_type.toUpperCase()}
+                              </span>
+                              <span className="interaction-icon">
+                                {interaction.event_type === 'purchase' && '💰'}
+                                {interaction.event_type === 'cart' && '🛒'}
+                                {interaction.event_type === 'view' && '👁️'}
+                              </span>
+                              <span className="interaction-item-id">
+                                Item #{interaction.product_id}
+                              </span>
+                            </div>
+                            <div className="interaction-item-details">
+                              <span className="item-brand">
+                                <strong>{interaction.brand || 'Unknown Brand'}</strong>
+                              </span>
+                              <span className="item-category">
+                                {interaction.category_code || 'Unknown Category'}
+                              </span>
+                              <span className="item-price">
+                                ${interaction.price ? interaction.price.toFixed(2) : '0.00'}
+                              </span>
+                            </div>
+                          </div>
+                        </div>
+                      ))}
+                    </div>
+                  </div>
+                )}
+              </div>
+            )}
+          </div>
+        )}
         {/* User Profile Form */}
         <div className="user-profile-form">
+          <h2>User Demographics {useRealUsers && selectedRealUser ? '(From Real User)' : '(Custom)'}</h2>
+          {!useRealUsers && (
+            <button
+              onClick={() => {
+                setUseRealUsers(true);
+                if (realUsers.length > 0) handleRealUserSelect(realUsers[0]);
+              }}
+              className="btn btn-secondary"
+              style={{marginBottom: '15px'}}
+            >
+              Switch to Real Users
+            </button>
+          )}
           <div className="form-row">
             <div className="form-group">
                 onChange={(e) => handleProfileChange('age', e.target.value)}
                 min="18"
                 max="100"
+                disabled={useRealUsers && selectedRealUser}
+                style={{backgroundColor: useRealUsers && selectedRealUser ? '#f5f5f5' : 'white'}}
               />
             </div>
                 id="gender"
                 value={userProfile.gender}
                 onChange={(e) => handleProfileChange('gender', e.target.value)}
+                disabled={useRealUsers && selectedRealUser}
+                style={{backgroundColor: useRealUsers && selectedRealUser ? '#f5f5f5' : 'white'}}
               >
                 <option value="male">Male</option>
                 <option value="female">Female</option>
                 onChange={(e) => handleProfileChange('income', e.target.value)}
                 min="0"
                 step="1000"
+                disabled={useRealUsers && selectedRealUser}
+                style={{backgroundColor: useRealUsers && selectedRealUser ? '#f5f5f5' : 'white'}}
               />
             </div>
           </div>
         {/* Interaction Patterns */}
         <div className="interaction-patterns">
+          {useRealUsers && selectedRealUser ? (
+            <>
+              <h2>Real User Interaction History</h2>
+              <p>This user has genuine interaction history from the dataset - no synthetic patterns needed.</p>
+              <div className="real-interaction-summary">
+                <div className="summary-card views">
+                  <div className="summary-number">{selectedRealUser.interaction_stats.views}</div>
+                  <div className="summary-label">Views</div>
+                </div>
+                <div className="summary-card carts">
+                  <div className="summary-number">{selectedRealUser.interaction_stats.cart_adds}</div>
+                  <div className="summary-label">Cart Adds</div>
+                </div>
+                <div className="summary-card purchases">
+                  <div className="summary-number">{selectedRealUser.interaction_stats.purchases}</div>
+                  <div className="summary-label">Purchases</div>
+                </div>
+              </div>
+              <div className="real-history-info">
+                <p><strong>Pattern:</strong> {selectedRealUser.interaction_pattern}</p>
+                <p><strong>Total Interactions:</strong> {selectedRealUser.interaction_stats.total_interactions}</p>
+                <p><strong>Unique Items:</strong> {selectedRealUser.interaction_stats.unique_items}</p>
+                <p><strong>Items in History:</strong> {userProfile.interaction_history.length} (showing up to 50 most recent)</p>
+              </div>
+              {/* Category Analysis Columns */}
+              {(Object.keys(categoryPercentages).length > 0 || Object.keys(recommendationCategoryPercentages).length > 0) && (
+                <div className="category-analysis">
+                  <h4>Category Analysis</h4>
+                  <div className="category-columns">
+                    {/* User's Interacted Categories */}
+                    {Object.keys(categoryPercentages).length > 0 && (
+                      <div className="category-column">
+                        <h5>👁️ User's Category Interests</h5>
+                        <div className="category-percentages">
+                          {Object.entries(categoryPercentages)
+                            .sort((a, b) => parseFloat(b[1]) - parseFloat(a[1]))
+                            .slice(0, 5)
+                            .map(([category, percentage]) => (
+                              <div key={category} className="category-item">
+                                <div className="category-bar-container">
+                                  <div
+                                    className="category-bar user-category"
+                                    style={{ width: `${Math.max(parseFloat(percentage), 5)}%` }}
+                                  ></div>
+                                </div>
+                                <span className="category-label">{category.replace('_', ' ')}</span>
+                                <span className="category-percent">{percentage}%</span>
+                              </div>
+                          ))}
+                        </div>
+                      </div>
+                    )}
+                    {/* Recommendation Categories */}
+                    {Object.keys(recommendationCategoryPercentages).length > 0 && (
+                      <div className="category-column">
+                        <h5>🎯 Recommendation Categories</h5>
+                        <div className="category-percentages">
+                          {Object.entries(recommendationCategoryPercentages)
+                            .sort((a, b) => parseFloat(b[1]) - parseFloat(a[1]))
+                            .map(([category, percentage]) => {
+                              const userPercentage = categoryPercentages[category] || 0;
+                              const isMatch = parseFloat(userPercentage) > 0;
+                              return (
+                                <div key={category} className={`category-item ${isMatch ? 'matched' : 'new'}`}>
+                                  <div className="category-bar-container">
+                                    <div
+                                      className={`category-bar ${isMatch ? 'rec-category-matched' : 'rec-category-new'}`}
+                                      style={{ width: `${Math.max(parseFloat(percentage), 5)}%` }}
+                                    ></div>
+                                  </div>
+                                  <span className="category-label">{category.replace('_', ' ')}</span>
+                                  <span className="category-percent">{percentage}%</span>
+                                  {isMatch && <span className="match-indicator">✓</span>}
+                                </div>
+                              );
+                            })}
+                        </div>
+                      </div>
+                    )}
+                  </div>
+                  {/* Category Match Analysis */}
+                  {Object.keys(categoryPercentages).length > 0 && Object.keys(recommendationCategoryPercentages).length > 0 && (
+                    <div className="category-match-summary">
+                      <p>
+                        <strong>Category Alignment:</strong> {
+                          Object.keys(recommendationCategoryPercentages).filter(cat =>
+                            parseFloat(categoryPercentages[cat] || 0) > 0
+                          ).length
+                        } of {Object.keys(recommendationCategoryPercentages).length} recommended categories match user interests
+                        <span className="match-legend">
+                          <span className="legend-item"><span className="legend-dot matched"></span> Matches user interest</span>
+                          <span className="legend-item"><span className="legend-dot new"></span> New category exploration</span>
+                        </span>
+                      </p>
+                    </div>
+                  )}
+                </div>
+              )}
+            </>
+          ) : (
+            <>
+              <h2>Synthetic Interaction Patterns</h2>
+              <p>Generate realistic user behavior patterns with proportional view, cart, and purchase events</p>
+              <div className="pattern-buttons">
+                {INTERACTION_PATTERNS.map((pattern, index) => (
+                  <button
+                    key={index}
+                    className={`pattern-btn ${selectedPattern?.name === pattern.name ? 'active' : ''}`}
+                    onClick={() => handlePatternSelect(pattern)}
+                  >
+                    {pattern.name}
+                    <br />
+                    <small>{pattern.views}V • {pattern.carts}C • {pattern.purchases}P</small>
+                  </button>
+                ))}
+                <button
+                  className="pattern-btn"
+                  onClick={clearInteractions}
+                  style={{backgroundColor: '#dc3545', color: 'white', borderColor: '#dc3545'}}
+                >
+                  Clear All
+                </button>
+              </div>
+            </>
+          )}
           {interactions.length > 0 && (
             <>
                         {interaction.type}
                       </span>
                       <span className="interaction-details">
+                        <strong>{interaction.brand}</strong> - <span className="category-tag">{interaction.category}</span> - ${interaction.price}
                         {interaction.quantity && ` (x${interaction.quantity})`}
                         {interaction.total_amount && ` = $${interaction.total_amount}`}
                       </span>
                 onChange={(e) => setRecommendationType(e.target.value)}
               >
                 <option value="hybrid">Hybrid</option>
+                <option value="enhanced">🎯 Enhanced Hybrid (Category-Aware)</option>
+                <option value="category_focused">🎯 Category Focused (80% Match)</option>
                 <option value="collaborative">Collaborative Filtering</option>
                 <option value="content">Content-Based</option>
               </select>
                 value={numRecommendations}
                 onChange={(e) => setNumRecommendations(parseInt(e.target.value))}
               >
                 <option value="10">10</option>
                 <option value="20">20</option>
+                <option value="50">50</option>
+                <option value="100">100 (Top Items)</option>
+                <option value="200">200 (Extended)</option>
               </select>
             </div>
+            {(recommendationType === 'hybrid' || recommendationType === 'enhanced') && (
               <div className="form-group">
                 <label htmlFor="collabWeight">Collaborative Weight:</label>
                 <input
         {/* Recommendations Display */}
         {recommendations.length > 0 && (
           <div className="recommendations">
+            <h2>Top {recommendations.length} Recommendations ({recommendationType})</h2>
             <div className="stats">
               <strong>User Profile:</strong> {userProfile.age}yr {userProfile.gender},
+              ${userProfile.income.toLocaleString()} income
+              {useRealUsers && selectedRealUser ? (
+                <span> | <strong>Real User {selectedRealUser.user_id}:</strong> {selectedRealUser.interaction_pattern} -
+                {selectedRealUser.interaction_stats.total_interactions} total interactions
+                ({selectedRealUser.interaction_stats.views} views, {selectedRealUser.interaction_stats.cart_adds} carts, {selectedRealUser.interaction_stats.purchases} purchases)
+                </span>
+              ) : (
+                <span> | <strong>Synthetic:</strong> {interactions.length} total interactions ({counts.views || 0} views, {counts.carts || 0} carts, {counts.purchases || 0} purchases)</span>
+              )}
             </div>
+            {/* Pagination Controls */}
+            {totalPages > 1 && (
+              <div className="pagination-info">
+                <p>Showing {startIndex + 1}-{Math.min(endIndex, recommendations.length)} of {recommendations.length} recommendations</p>
+                <div className="pagination-controls">
+                  <button
+                    onClick={() => goToPage(currentPage - 1)}
+                    disabled={currentPage === 1}
+                    className="pagination-btn"
+                  >
+                    ← Previous
+                  </button>
+                  <div className="page-numbers">
+                    {Array.from({length: Math.min(totalPages, 10)}, (_, i) => {
+                      const page = i + 1;
+                      return (
+                        <button
+                          key={page}
+                          onClick={() => goToPage(page)}
+                          className={`page-number ${currentPage === page ? 'active' : ''}`}
+                        >
+                          {page}
+                        </button>
+                      );
+                    })}
+                    {totalPages > 10 && <span className="pagination-ellipsis">...</span>}
+                  </div>
+                  <button
+                    onClick={() => goToPage(currentPage + 1)}
+                    disabled={currentPage === totalPages}
+                    className="pagination-btn"
+                  >
+                    Next →
+                  </button>
+                </div>
+              </div>
+            )}
             <div className="recommendations-grid">
+              {currentRecommendations.map((rec, index) => (
                 <div key={rec.item_id} className="recommendation-card">
                   <div className="card-header">
+                    <span className="item-id">#{startIndex + index + 1} Item {rec.item_id}</span>
                     <span className="score">{rec.score.toFixed(4)}</span>
                   </div>
                 </div>
               ))}
             </div>
+            {/* Bottom Pagination */}
+            {totalPages > 1 && (
+              <div className="pagination-controls bottom-pagination">
+                <button
+                  onClick={() => goToPage(currentPage - 1)}
+                  disabled={currentPage === 1}
+                  className="pagination-btn"
+                >
+                  ← Previous
+                </button>
+                <span className="page-indicator">Page {currentPage} of {totalPages}</span>
+                <button
+                  onClick={() => goToPage(currentPage + 1)}
+                  disabled={currentPage === totalPages}
+                  className="pagination-btn"
+                >
+                  Next →
+                </button>
+              </div>
+            )}
           </div>
         )}

run_2phase_training.py ADDED Viewed

	@@ -0,0 +1,206 @@

+#!/usr/bin/env python3
+"""
+Two-Phase Training Pipeline Runner
+This script orchestrates the complete 2-phase training approach:
+1. Phase 1: Item tower pretraining on item features
+2. Phase 2: Joint training of user tower + fine-tuning pre-trained item tower
+Usage:
+    python run_2phase_training.py
+"""
+import os
+import sys
+import time
+import pickle
+import numpy as np
+from typing import Dict
+# Add src to path
+sys.path.append(os.path.join(os.path.dirname(__file__), 'src'))
+from src.training.item_pretraining import ItemTowerPretrainer
+from src.training.joint_training import JointTrainer
+from src.preprocessing.data_loader import DataProcessor
+from src.inference.faiss_index import FAISSItemIndex
+def run_phase1_item_pretraining():
+    """Phase 1: Pre-train the item tower."""
+    print("\n" + "="*60)
+    print("PHASE 1: ITEM TOWER PRETRAINING")
+    print("="*60)
+    # Initialize components
+    data_processor = DataProcessor()
+    pretrainer = ItemTowerPretrainer(
+        embedding_dim=128,
+        hidden_dims=[256, 128],
+        dropout_rate=0.2,
+        learning_rate=0.001
+    )
+    # Prepare data
+    print("Preparing item data...")
+    dataset, data_processor, price_normalizer = pretrainer.prepare_data(data_processor)
+    # Build model
+    print("Building item tower...")
+    model = pretrainer.build_model(
+        item_vocab_size=len(data_processor.item_vocab),
+        category_vocab_size=len(data_processor.category_vocab),
+        brand_vocab_size=len(data_processor.brand_vocab),
+        price_normalizer=price_normalizer
+    )
+    # Train model
+    print("Training item tower (Phase 1)...")
+    start_time = time.time()
+    history = pretrainer.train(dataset, epochs=50)
+    phase1_time = time.time() - start_time
+    # Generate embeddings
+    print("Generating item embeddings...")
+    item_embeddings = pretrainer.generate_item_embeddings(dataset, data_processor)
+    # Save artifacts
+    print("Saving Phase 1 artifacts...")
+    os.makedirs("src/artifacts", exist_ok=True)
+    data_processor.save_vocabularies()
+    pretrainer.save_model()
+    # Save embeddings for FAISS index
+    np.save("src/artifacts/item_embeddings.npy", item_embeddings)
+    # Build FAISS index
+    print("Building FAISS index...")
+    faiss_index = FAISSItemIndex()
+    faiss_index.build_index(item_embeddings)
+    faiss_index.save_index("src/artifacts/")
+    print(f"✅ Phase 1 completed in {phase1_time:.2f} seconds")
+    print(f"   - Items processed: {len(item_embeddings)}")
+    print(f"   - Final loss: {history.history['total_loss'][-1]:.4f}")
+    return data_processor
+def run_phase2_joint_training(data_processor: DataProcessor):
+    """Phase 2: Joint training with pre-trained item tower."""
+    print("\n" + "="*60)
+    print("PHASE 2: JOINT TRAINING")
+    print("="*60)
+    # Initialize joint trainer
+    trainer = JointTrainer(
+        embedding_dim=128,
+        user_learning_rate=0.001,
+        item_learning_rate=0.0001,  # Lower LR for pre-trained item tower
+        rating_weight=1.0,
+        retrieval_weight=0.5
+    )
+    # Load pre-trained item tower
+    print("Loading pre-trained item tower...")
+    trainer.load_pre_trained_item_tower()
+    # Build user tower
+    print("Building user tower...")
+    trainer.build_user_tower(max_history_length=50)
+    # Build complete two-tower model
+    print("Building complete two-tower model...")
+    trainer.build_two_tower_model()
+    # Prepare training data
+    print("Preparing user interaction data...")
+    # Check if training features already exist
+    if os.path.exists("src/artifacts/training_features.pkl"):
+        print("Loading existing training features...")
+        with open("src/artifacts/training_features.pkl", 'rb') as f:
+            training_features = pickle.load(f)
+        with open("src/artifacts/validation_features.pkl", 'rb') as f:
+            validation_features = pickle.load(f)
+    else:
+        # Generate training features
+        print("Generating training features...")
+        training_features, validation_features = data_processor.prepare_training_data()
+        # Save features
+        with open("src/artifacts/training_features.pkl", 'wb') as f:
+            pickle.dump(training_features, f)
+        with open("src/artifacts/validation_features.pkl", 'wb') as f:
+            pickle.dump(validation_features, f)
+    # Train joint model
+    print("Starting joint training (Phase 2)...")
+    start_time = time.time()
+    history = trainer.train(
+        training_features=training_features,
+        validation_features=validation_features,
+        epochs=100,
+        batch_size=256
+    )
+    phase2_time = time.time() - start_time
+    # Save final model
+    print("Saving final two-tower model...")
+    trainer.save_model()
+    # Save training history
+    with open("src/artifacts/joint_training_history.pkl", 'wb') as f:
+        pickle.dump(history, f)
+    print(f"✅ Phase 2 completed in {phase2_time:.2f} seconds")
+    print(f"   - Best validation loss: {min(history['val_total_loss']):.4f}")
+    print(f"   - Epochs trained: {len(history['total_loss'])}")
+    return history
+def main():
+    """Main function to run complete 2-phase training pipeline."""
+    print("🚀 STARTING 2-PHASE TRAINING PIPELINE")
+    print(f"Working directory: {os.getcwd()}")
+    print(f"Python path: {sys.executable}")
+    total_start_time = time.time()
+    try:
+        # Phase 1: Item tower pretraining
+        data_processor = run_phase1_item_pretraining()
+        # Phase 2: Joint training
+        history = run_phase2_joint_training(data_processor)
+        # Final summary
+        total_time = time.time() - total_start_time
+        print("\n" + "="*60)
+        print("🎉 2-PHASE TRAINING COMPLETED SUCCESSFULLY!")
+        print("="*60)
+        print(f"Total training time: {total_time:.2f} seconds ({total_time/60:.1f} minutes)")
+        print(f"Artifacts saved in: src/artifacts/")
+        print("\nKey files generated:")
+        print("  - item_tower_weights: Pre-trained item embeddings")
+        print("  - item_tower_weights_finetuned_best: Fine-tuned item tower")
+        print("  - user_tower_weights_best: Trained user tower")
+        print("  - rating_model_weights_best: Rating prediction model")
+        print("  - faiss_index.index: Item similarity index")
+        print("  - vocabularies.pkl: Feature vocabularies")
+        print(f"\n🔥 Final validation loss: {min(history['val_total_loss']):.4f}")
+        print("\n✅ Ready to run inference with api/main.py!")
+    except Exception as e:
+        print(f"\n❌ Training failed with error: {str(e)}")
+        raise
+if __name__ == "__main__":
+    main()

run_joint_training.py ADDED Viewed

	@@ -0,0 +1,453 @@

+#!/usr/bin/env python3
+"""
+Single Joint Training Pipeline Runner
+This script orchestrates the single-phase joint training approach:
+- Trains user tower and item tower simultaneously from scratch
+- No pre-training phase - direct end-to-end optimization
+- Supports both regular and fast training modes
+Usage:
+    python run_joint_training.py [--fast]
+"""
+import os
+import sys
+import time
+import pickle
+import argparse
+from typing import Dict
+# Add src to path
+sys.path.append(os.path.join(os.path.dirname(__file__), 'src'))
+from src.training.fast_joint_training import FastJointTrainer
+from src.models.item_tower import ItemTower
+from src.models.user_tower import UserTower, TwoTowerModel
+from src.preprocessing.data_loader import DataProcessor, create_tf_dataset
+from src.inference.faiss_index import FAISSItemIndex
+import tensorflow as tf
+import numpy as np
+class SingleJointTrainer:
+    """Complete single-phase joint training from scratch."""
+    def __init__(self):
+        self.item_tower = None
+        self.user_tower = None
+        self.model = None
+        self.data_processor = None
+        # Training hyperparameters
+        self.embedding_dim = 128
+        self.learning_rate = 0.001
+        self.batch_size = 256
+        self.epochs = 80
+        self.patience = 20
+    def prepare_data(self):
+        """Prepare all training data from scratch."""
+        print("Loading and preparing data...")
+        # Initialize data processor
+        self.data_processor = DataProcessor()
+        # Check if preprocessed data exists
+        if os.path.exists("src/artifacts/training_features.pkl"):
+            print("Loading existing preprocessed data...")
+            # Load vocabularies
+            self.data_processor.load_vocabularies("src/artifacts/vocabularies.pkl")
+            # Load training features
+            with open("src/artifacts/training_features.pkl", 'rb') as f:
+                training_features = pickle.load(f)
+            with open("src/artifacts/validation_features.pkl", 'rb') as f:
+                validation_features = pickle.load(f)
+        else:
+            print("Preprocessing data from scratch...")
+            # Load raw data and build vocabularies
+            items_df, users_df, interactions_df = self.data_processor.load_data()
+            self.data_processor.build_vocabularies(items_df, users_df, interactions_df)
+            # Generate training features
+            training_features, validation_features = self.data_processor.prepare_training_data()
+            # Save for future use
+            os.makedirs("src/artifacts", exist_ok=True)
+            self.data_processor.save_vocabularies()
+            with open("src/artifacts/training_features.pkl", 'wb') as f:
+                pickle.dump(training_features, f)
+            with open("src/artifacts/validation_features.pkl", 'wb') as f:
+                pickle.dump(validation_features, f)
+        print(f"Training samples: {len(training_features['rating']):,}")
+        print(f"Validation samples: {len(validation_features['rating']):,}")
+        return training_features, validation_features
+    def build_models(self):
+        """Build both towers from scratch."""
+        print("Building item tower...")
+        self.item_tower = ItemTower(
+            item_vocab_size=len(self.data_processor.item_vocab),
+            category_vocab_size=len(self.data_processor.category_vocab),
+            brand_vocab_size=len(self.data_processor.brand_vocab),
+            embedding_dim=self.embedding_dim,
+            hidden_dims=[256, 128],
+            dropout_rate=0.2
+        )
+        print("Building user tower...")
+        self.user_tower = UserTower(
+            max_history_length=50,
+            embedding_dim=self.embedding_dim,
+            hidden_dims=[128, 64],  # Match trained architecture
+            dropout_rate=0.2
+        )
+        print("Building complete two-tower model...")
+        self.model = TwoTowerModel(
+            item_tower=self.item_tower,
+            user_tower=self.user_tower,
+            rating_weight=1.0,
+            retrieval_weight=0.5
+        )
+        print("Models initialized successfully")
+    def train_joint_model(self, training_features: Dict, validation_features: Dict):
+        """Train both towers jointly from scratch."""
+        print(f"Starting single-phase joint training...")
+        print(f"Configuration: {self.epochs} epochs, batch size {self.batch_size}")
+        # Create datasets
+        train_dataset = create_tf_dataset(training_features, self.batch_size)
+        val_dataset = create_tf_dataset(validation_features, self.batch_size)
+        # Setup optimizer
+        optimizer = tf.keras.optimizers.Adam(learning_rate=self.learning_rate)
+        # Training history
+        history = {
+            'total_loss': [],
+            'rating_loss': [],
+            'retrieval_loss': [],
+            'val_total_loss': [],
+            'val_rating_loss': [],
+            'val_retrieval_loss': []
+        }
+        best_val_loss = float('inf')
+        patience_counter = 0
+        for epoch in range(self.epochs):
+            epoch_start = time.time()
+            print(f"\nEpoch {epoch + 1}/{self.epochs}")
+            # Training phase
+            epoch_losses = {'total_loss': [], 'rating_loss': [], 'retrieval_loss': []}
+            for batch in train_dataset:
+                with tf.GradientTape() as tape:
+                    # Forward pass
+                    user_embeddings = self.user_tower(batch, training=True)
+                    item_embeddings = self.item_tower(batch, training=True)
+                    # Rating prediction
+                    concatenated = tf.concat([user_embeddings, item_embeddings], axis=-1)
+                    rating_predictions = self.model.rating_model(concatenated, training=True)
+                    # Rating loss
+                    rating_loss = self.model.rating_task(
+                        labels=batch["rating"],
+                        predictions=rating_predictions
+                    )
+                    # Retrieval loss (dot product similarity)
+                    similarities = tf.reduce_sum(user_embeddings * item_embeddings, axis=1)
+                    retrieval_loss = self.model.retrieval_loss(
+                        batch["rating"],
+                        tf.nn.sigmoid(similarities)
+                    )
+                    # Combined loss
+                    total_loss = (
+                        self.model.rating_weight * rating_loss +
+                        self.model.retrieval_weight * retrieval_loss
+                    )
+                # Compute and apply gradients
+                all_variables = (
+                    self.user_tower.trainable_variables +
+                    self.item_tower.trainable_variables +
+                    self.model.rating_model.trainable_variables
+                )
+                gradients = tape.gradient(total_loss, all_variables)
+                optimizer.apply_gradients(zip(gradients, all_variables))
+                # Track losses
+                epoch_losses['total_loss'].append(total_loss)
+                epoch_losses['rating_loss'].append(rating_loss)
+                epoch_losses['retrieval_loss'].append(retrieval_loss)
+            # Validation phase
+            val_losses = {'total_loss': [], 'rating_loss': [], 'retrieval_loss': []}
+            for batch in val_dataset:
+                user_embeddings = self.user_tower(batch, training=False)
+                item_embeddings = self.item_tower(batch, training=False)
+                concatenated = tf.concat([user_embeddings, item_embeddings], axis=-1)
+                rating_predictions = self.model.rating_model(concatenated, training=False)
+                rating_loss = self.model.rating_task(
+                    labels=batch["rating"],
+                    predictions=rating_predictions
+                )
+                similarities = tf.reduce_sum(user_embeddings * item_embeddings, axis=1)
+                retrieval_loss = self.model.retrieval_loss(
+                    batch["rating"],
+                    tf.nn.sigmoid(similarities)
+                )
+                total_loss = (
+                    self.model.rating_weight * rating_loss +
+                    self.model.retrieval_weight * retrieval_loss
+                )
+                val_losses['total_loss'].append(total_loss)
+                val_losses['rating_loss'].append(rating_loss)
+                val_losses['retrieval_loss'].append(retrieval_loss)
+            # Calculate average losses
+            avg_train_losses = {k: tf.reduce_mean(v).numpy() for k, v in epoch_losses.items()}
+            avg_val_losses = {k: tf.reduce_mean(v).numpy() for k, v in val_losses.items()}
+            # Update history
+            for key in history.keys():
+                if key.startswith('val_'):
+                    history[key].append(avg_val_losses[key.replace('val_', '')])
+                else:
+                    history[key].append(avg_train_losses[key])
+            # Print progress
+            epoch_time = time.time() - epoch_start
+            print(f"Time: {epoch_time:.1f}s | Train: {avg_train_losses['total_loss']:.4f} | Val: {avg_val_losses['total_loss']:.4f}")
+            print(f"  Rating: {avg_val_losses['rating_loss']:.4f} | Retrieval: {avg_val_losses['retrieval_loss']:.4f}")
+            # Early stopping and best model saving
+            if avg_val_losses['total_loss'] < best_val_loss:
+                best_val_loss = avg_val_losses['total_loss']
+                patience_counter = 0
+                self.save_model("_best")
+                print("  ✅ Best model saved!")
+            else:
+                patience_counter += 1
+                if patience_counter >= self.patience:
+                    print(f"Early stopping at epoch {epoch + 1}")
+                    break
+        print("Joint training completed!")
+        return history
+    def generate_item_embeddings(self, training_features: Dict):
+        """Generate item embeddings for FAISS index."""
+        print("Generating item embeddings...")
+        # Get all unique items from training data
+        unique_items = np.unique(training_features['product_id'])
+        item_embeddings = {}
+        # Process in batches
+        batch_size = 1000
+        for i in range(0, len(unique_items), batch_size):
+            batch_items = unique_items[i:i+batch_size]
+            # Create batch features
+            batch_features = {
+                'product_id': batch_items,
+                'category_id': training_features['category_id'][:len(batch_items)],
+                'brand_id': training_features['brand_id'][:len(batch_items)],
+                'price': training_features['price'][:len(batch_items)]
+            }
+            # Convert to tensors
+            batch_tensors = {k: tf.constant(v) for k, v in batch_features.items()}
+            # Get embeddings
+            embeddings = self.item_tower(batch_tensors, training=False)
+            # Store embeddings
+            for j, item_id in enumerate(batch_items):
+                # Map back from vocab index to actual item ID
+                actual_item_id = item_id  # Assuming direct mapping
+                item_embeddings[actual_item_id] = embeddings[j].numpy()
+        print(f"Generated embeddings for {len(item_embeddings)} items")
+        return item_embeddings
+    def save_model(self, suffix=""):
+        """Save trained models."""
+        save_path = "src/artifacts/"
+        os.makedirs(save_path, exist_ok=True)
+        # Save model weights
+        self.user_tower.save_weights(f"{save_path}/user_tower_weights{suffix}")
+        self.item_tower.save_weights(f"{save_path}/item_tower_weights_finetuned{suffix}")
+        self.model.rating_model.save_weights(f"{save_path}/rating_model_weights{suffix}")
+        # Save item tower config for inference
+        with open(f"{save_path}/item_tower_config.txt", 'w') as f:
+            f.write(f"embedding_dim: {self.embedding_dim}\n")
+            f.write(f"hidden_dims: [256, 128]\n")  # Item tower architecture
+            f.write(f"dropout_rate: 0.2\n")
+        if not suffix:
+            print("Final model saved")
+def run_fast_joint_training():
+    """Run fast optimized joint training."""
+    print("\n" + "="*60)
+    print("FAST JOINT TRAINING MODE")
+    print("="*60)
+    # Initialize fast trainer
+    trainer = FastJointTrainer()
+    # Check if we need to prepare data first
+    if not os.path.exists("src/artifacts/training_features.pkl"):
+        print("Preparing data first...")
+        single_trainer = SingleJointTrainer()
+        training_features, validation_features = single_trainer.prepare_data()
+    # Run fast training
+    trainer.load_components()
+    print("Loading training data...")
+    with open("src/artifacts/training_features.pkl", 'rb') as f:
+        training_features = pickle.load(f)
+    with open("src/artifacts/validation_features.pkl", 'rb') as f:
+        validation_features = pickle.load(f)
+    start_time = time.time()
+    trainer.train_fast(training_features, validation_features)
+    training_time = time.time() - start_time
+    # Generate embeddings and build FAISS index
+    print("Building FAISS index...")
+    # Use single trainer for embedding generation
+    single_trainer = SingleJointTrainer()
+    single_trainer.data_processor = DataProcessor()
+    single_trainer.data_processor.load_vocabularies("src/artifacts/vocabularies.pkl")
+    single_trainer.item_tower = trainer.item_tower
+    item_embeddings = single_trainer.generate_item_embeddings(training_features)
+    faiss_index = FAISSItemIndex()
+    faiss_index.build_index(item_embeddings)
+    faiss_index.save_index("src/artifacts/")
+    return training_time
+def run_regular_joint_training():
+    """Run regular comprehensive joint training."""
+    print("\n" + "="*60)
+    print("REGULAR JOINT TRAINING MODE")
+    print("="*60)
+    # Initialize trainer
+    trainer = SingleJointTrainer()
+    # Prepare data
+    training_features, validation_features = trainer.prepare_data()
+    # Build models from scratch
+    trainer.build_models()
+    # Train joint model
+    start_time = time.time()
+    history = trainer.train_joint_model(training_features, validation_features)
+    training_time = time.time() - start_time
+    # Generate item embeddings
+    item_embeddings = trainer.generate_item_embeddings(training_features)
+    # Build FAISS index
+    print("Building FAISS index...")
+    faiss_index = FAISSItemIndex()
+    faiss_index.build_index(item_embeddings)
+    faiss_index.save_index("src/artifacts/")
+    # Save final model
+    trainer.save_model()
+    # Save training history
+    with open("src/artifacts/single_joint_training_history.pkl", 'wb') as f:
+        pickle.dump(history, f)
+    return training_time, history
+def main():
+    """Main function to run single joint training pipeline."""
+    parser = argparse.ArgumentParser(description='Single Joint Training Pipeline')
+    parser.add_argument('--fast', action='store_true', help='Use fast training mode')
+    args = parser.parse_args()
+    print("🚀 STARTING SINGLE JOINT TRAINING PIPELINE")
+    print(f"Working directory: {os.getcwd()}")
+    print(f"Training mode: {'FAST' if args.fast else 'REGULAR'}")
+    total_start_time = time.time()
+    try:
+        if args.fast:
+            training_time = run_fast_joint_training()
+            history = None
+        else:
+            training_time, history = run_regular_joint_training()
+        total_time = time.time() - total_start_time
+        print("\n" + "="*60)
+        print("🎉 SINGLE JOINT TRAINING COMPLETED SUCCESSFULLY!")
+        print("="*60)
+        print(f"Training time: {training_time:.2f} seconds ({training_time/60:.1f} minutes)")
+        print(f"Total time: {total_time:.2f} seconds ({total_time/60:.1f} minutes)")
+        print(f"Artifacts saved in: src/artifacts/")
+        print("\nKey files generated:")
+        print("  - user_tower_weights_best: Trained user tower")
+        print("  - item_tower_weights_finetuned_best: Trained item tower")
+        print("  - rating_model_weights_best: Rating prediction model")
+        print("  - faiss_index.index: Item similarity index")
+        print("  - vocabularies.pkl: Feature vocabularies")
+        if history:
+            print(f"\n🔥 Best validation loss: {min(history['val_total_loss']):.4f}")
+        print(f"\n🎯 Training approach: Single-phase joint optimization")
+        print("✅ Ready to run inference with api/main.py!")
+    except Exception as e:
+        print(f"\n❌ Training failed with error: {str(e)}")
+        raise
+if __name__ == "__main__":
+    main()

src/inference/enhanced_recommendation_engine_128d.py ADDED Viewed

	@@ -0,0 +1,499 @@

+#!/usr/bin/env python3
+"""
+Enhanced recommendation engine using 128D embeddings with diversity regularization.
+"""
+import numpy as np
+import pandas as pd
+import tensorflow as tf
+import pickle
+import os
+from typing import Dict, List, Tuple, Optional
+from collections import Counter, defaultdict
+import sys
+sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))
+from src.models.enhanced_two_tower import EnhancedItemTower, EnhancedUserTower
+from src.inference.faiss_index import FAISSItemIndex
+from src.preprocessing.data_loader import DataProcessor
+from src.preprocessing.user_data_preparation import prepare_user_features
+from src.utils.real_user_selector import RealUserSelector
+class Enhanced128DRecommendationEngine:
+    """Enhanced recommendation engine with 128D embeddings and all improvements."""
+    def __init__(self, artifacts_path: str = "src/artifacts/"):
+        self.artifacts_path = artifacts_path
+        self.embedding_dim = 128  # Fixed to 128D
+        # Model components
+        self.item_tower = None
+        self.user_tower = None
+        self.rating_model = None
+        self.faiss_index = None
+        self.data_processor = None
+        # Data
+        self.items_df = None
+        self.users_df = None
+        self.income_thresholds = None
+        # Load all components
+        self._load_all_components()
+    def _load_all_components(self):
+        """Load all enhanced model components."""
+        print("Loading enhanced 128D recommendation engine...")
+        # Load data processor
+        self.data_processor = DataProcessor()
+        try:
+            self.data_processor.load_vocabularies(f"{self.artifacts_path}/vocabularies.pkl")
+        except FileNotFoundError:
+            print("❌ Vocabularies not found. Please train the model first.")
+            return
+        # Load datasets
+        self.items_df = pd.read_csv("datasets/items.csv")
+        self.users_df = pd.read_csv("datasets/users.csv")
+        # Load enhanced model components
+        self._load_enhanced_models()
+        # Load FAISS index with 128D
+        try:
+            self.faiss_index = FAISSItemIndex(embedding_dim=self.embedding_dim)
+            # Try to load enhanced embeddings first
+            if os.path.exists(f"{self.artifacts_path}/enhanced_item_embeddings.npy"):
+                enhanced_embeddings = np.load(
+                    f"{self.artifacts_path}/enhanced_item_embeddings.npy",
+                    allow_pickle=True
+                ).item()
+                self.faiss_index.build_index(enhanced_embeddings)
+                print("✅ Loaded enhanced 128D FAISS index")
+            else:
+                print("⚠️  Enhanced embeddings not found. Train enhanced model first.")
+                self.faiss_index = None
+        except Exception as e:
+            print(f"⚠️  Could not load FAISS index: {e}")
+            self.faiss_index = None
+        # Load income thresholds for categorical demographics
+        self._load_income_thresholds()
+        print("✅ Enhanced 128D engine loaded successfully!")
+    def _load_enhanced_models(self):
+        """Load enhanced model components."""
+        try:
+            # Create model architecture
+            self.item_tower = EnhancedItemTower(
+                item_vocab_size=len(self.data_processor.item_vocab),
+                category_vocab_size=len(self.data_processor.category_vocab),
+                brand_vocab_size=len(self.data_processor.brand_vocab),
+                embedding_dim=self.embedding_dim,
+                use_bias=True,
+                use_diversity_reg=False  # Disable during inference
+            )
+            self.user_tower = EnhancedUserTower(
+                max_history_length=50,
+                embedding_dim=self.embedding_dim,
+                use_bias=True,
+                use_diversity_reg=False  # Disable during inference
+            )
+            # Create rating model
+            self.rating_model = tf.keras.Sequential([
+                tf.keras.layers.Dense(512, activation="relu"),
+                tf.keras.layers.BatchNormalization(),
+                tf.keras.layers.Dropout(0.3),
+                tf.keras.layers.Dense(256, activation="relu"),
+                tf.keras.layers.BatchNormalization(),
+                tf.keras.layers.Dropout(0.2),
+                tf.keras.layers.Dense(64, activation="relu"),
+                tf.keras.layers.Dense(1, activation="sigmoid")
+            ])
+            # Load weights - try enhanced first, fall back to regular
+            model_files = [
+                ('enhanced_item_tower_weights_enhanced_best', 'enhanced_user_tower_weights_enhanced_best', 'enhanced_rating_model_weights_enhanced_best'),
+                ('enhanced_item_tower_weights_enhanced_final', 'enhanced_user_tower_weights_enhanced_final', 'enhanced_rating_model_weights_enhanced_final'),
+            ]
+            loaded = False
+            for item_file, user_file, rating_file in model_files:
+                try:
+                    # Need to build models first with dummy data
+                    self._build_models()
+                    self.item_tower.load_weights(f"{self.artifacts_path}/{item_file}")
+                    self.user_tower.load_weights(f"{self.artifacts_path}/{user_file}")
+                    self.rating_model.load_weights(f"{self.artifacts_path}/{rating_file}")
+                    print(f"✅ Loaded enhanced model: {item_file}")
+                    loaded = True
+                    break
+                except Exception as e:
+                    print(f"⚠️  Could not load {item_file}: {e}")
+                    continue
+            if not loaded:
+                print("❌ No enhanced model weights found. Please train enhanced model first.")
+                self.item_tower = None
+                self.user_tower = None
+                self.rating_model = None
+        except Exception as e:
+            print(f"❌ Failed to load enhanced models: {e}")
+            self.item_tower = None
+            self.user_tower = None
+            self.rating_model = None
+    def _build_models(self):
+        """Build models with dummy data to initialize weights."""
+        # Dummy item features
+        dummy_item_features = {
+            'product_id': tf.constant([0]),
+            'category_id': tf.constant([0]),
+            'brand_id': tf.constant([0]),
+            'price': tf.constant([100.0])
+        }
+        # Dummy user features
+        dummy_user_features = {
+            'age': tf.constant([2]),  # Adult category
+            'gender': tf.constant([0]),  # Female
+            'income': tf.constant([2]),  # Middle income
+            'item_history_embeddings': tf.constant(np.zeros((1, 50, self.embedding_dim), dtype=np.float32))
+        }
+        # Forward pass to build models
+        _ = self.item_tower(dummy_item_features, training=False)
+        _ = self.user_tower(dummy_user_features, training=False)
+        # Build rating model
+        dummy_concat = tf.constant(np.zeros((1, self.embedding_dim * 2), dtype=np.float32))
+        _ = self.rating_model(dummy_concat, training=False)
+    def _load_income_thresholds(self):
+        """Load income thresholds for categorical processing."""
+        # Calculate income thresholds from training data
+        user_incomes = self.users_df['income'].values
+        self.income_thresholds = np.percentile(user_incomes, [0, 20, 40, 60, 80, 100])
+        print(f"Income thresholds: {self.income_thresholds}")
+    def categorize_age(self, age: float) -> int:
+        """Categorize age into 6 groups."""
+        if age < 18: return 0      # Teen
+        elif age < 26: return 1    # Young Adult
+        elif age < 36: return 2    # Adult
+        elif age < 51: return 3    # Middle Age
+        elif age < 66: return 4    # Mature
+        else: return 5             # Senior
+    def categorize_income(self, income: float) -> int:
+        """Categorize income into 5 percentile groups."""
+        category = np.digitize([income], self.income_thresholds[1:-1])[0]
+        return min(max(category, 0), 4)
+    def categorize_gender(self, gender: str) -> int:
+        """Categorize gender."""
+        return 1 if gender.lower() == 'male' else 0
+    def get_user_embedding(self,
+                          age: int,
+                          gender: str,
+                          income: float,
+                          interaction_history: List[int] = None) -> np.ndarray:
+        """Generate user embedding with categorical demographics."""
+        if self.user_tower is None:
+            print("❌ User tower not loaded")
+            return None
+        # Categorize demographics
+        age_cat = self.categorize_age(age)
+        gender_cat = self.categorize_gender(gender)
+        income_cat = self.categorize_income(income)
+        # Prepare interaction history embeddings
+        if interaction_history is None:
+            interaction_history = []
+        # Get item embeddings for history
+        history_embeddings = np.zeros((50, self.embedding_dim), dtype=np.float32)
+        for i, item_id in enumerate(interaction_history[:50]):
+            if self.faiss_index and item_id in self.faiss_index.item_id_to_idx:
+                item_emb = self.faiss_index.get_item_embedding(item_id)
+                if item_emb is not None:
+                    history_embeddings[i] = item_emb
+        # Create user features
+        user_features = {
+            'age': tf.constant([age_cat]),
+            'gender': tf.constant([gender_cat]),
+            'income': tf.constant([income_cat]),
+            'item_history_embeddings': tf.constant([history_embeddings])
+        }
+        # Get embedding
+        user_output = self.user_tower(user_features, training=False)
+        if isinstance(user_output, tuple):
+            user_embedding = user_output[0].numpy()[0]
+        else:
+            user_embedding = user_output.numpy()[0]
+        return user_embedding
+    def get_item_embedding(self, item_id: int) -> Optional[np.ndarray]:
+        """Get item embedding."""
+        if self.faiss_index:
+            return self.faiss_index.get_item_embedding(item_id)
+        # Fallback to model computation
+        if self.item_tower is None:
+            return None
+        item_row = self.items_df[self.items_df['product_id'] == item_id]
+        if item_row.empty:
+            return None
+        item_data = item_row.iloc[0]
+        # Prepare features
+        item_features = {
+            'product_id': tf.constant([self.data_processor.item_vocab.get(item_id, 0)]),
+            'category_id': tf.constant([self.data_processor.category_vocab.get(item_data['category_id'], 0)]),
+            'brand_id': tf.constant([self.data_processor.brand_vocab.get(item_data.get('brand', 'unknown'), 0)]),
+            'price': tf.constant([float(item_data.get('price', 0.0))])
+        }
+        # Get embedding
+        item_output = self.item_tower(item_features, training=False)
+        if isinstance(item_output, tuple):
+            item_embedding = item_output[0].numpy()[0]
+        else:
+            item_embedding = item_output.numpy()[0]
+        return item_embedding
+    def recommend_items_enhanced(self,
+                               age: int,
+                               gender: str,
+                               income: float,
+                               interaction_history: List[int] = None,
+                               k: int = 10,
+                               diversity_weight: float = 0.3,
+                               category_boost: float = 1.5) -> List[Tuple[int, float, Dict]]:
+        """Generate enhanced recommendations with diversity and category boosting."""
+        if not self.faiss_index:
+            print("❌ FAISS index not available")
+            return []
+        # Get user embedding
+        user_embedding = self.get_user_embedding(age, gender, income, interaction_history)
+        if user_embedding is None:
+            return []
+        # Get candidate recommendations (more than needed for filtering)
+        candidates = self.faiss_index.search_by_embedding(user_embedding, k * 3)
+        # Filter out items from interaction history
+        if interaction_history:
+            history_set = set(interaction_history)
+            candidates = [(item_id, score) for item_id, score in candidates
+                         if item_id not in history_set]
+        # Add item metadata and apply enhancements
+        enhanced_candidates = []
+        for item_id, similarity_score in candidates[:k * 2]:
+            # Get item info
+            item_row = self.items_df[self.items_df['product_id'] == item_id]
+            if item_row.empty:
+                continue
+            item_info = item_row.iloc[0].to_dict()
+            # Enhanced scoring with multiple factors
+            final_score = similarity_score
+            # Category boosting based on user history
+            if interaction_history and category_boost > 1.0:
+                user_categories = self._get_user_categories(interaction_history)
+                item_category = item_info.get('category_code', '')
+                if item_category in user_categories:
+                    category_preference = user_categories[item_category]
+                    final_score *= (1 + (category_boost - 1) * category_preference)
+            enhanced_candidates.append((item_id, final_score, item_info))
+        # Sort by enhanced scores
+        enhanced_candidates.sort(key=lambda x: x[1], reverse=True)
+        # Apply diversity filtering
+        if diversity_weight > 0:
+            diversified_candidates = self._apply_diversity_filter(
+                enhanced_candidates, diversity_weight
+            )
+        else:
+            diversified_candidates = enhanced_candidates
+        return diversified_candidates[:k]
+    def _get_user_categories(self, interaction_history: List[int]) -> Dict[str, float]:
+        """Get user's category preferences from history."""
+        category_counts = Counter()
+        for item_id in interaction_history:
+            item_row = self.items_df[self.items_df['product_id'] == item_id]
+            if not item_row.empty:
+                category = item_row.iloc[0].get('category_code', 'Unknown')
+                category_counts[category] += 1
+        # Convert to preferences (percentages)
+        total = sum(category_counts.values())
+        if total == 0:
+            return {}
+        return {cat: count / total for cat, count in category_counts.items()}
+    def _apply_diversity_filter(self,
+                              candidates: List[Tuple[int, float, Dict]],
+                              diversity_weight: float,
+                              max_per_category: int = 3) -> List[Tuple[int, float, Dict]]:
+        """Apply diversity filtering to recommendations."""
+        category_counts = defaultdict(int)
+        diversified = []
+        for item_id, score, item_info in candidates:
+            category = item_info.get('category_code', 'Unknown')
+            # Apply diversity penalty
+            if category_counts[category] >= max_per_category:
+                # Penalty for over-representation
+                diversity_penalty = diversity_weight * (category_counts[category] - max_per_category + 1)
+                adjusted_score = score * (1 - diversity_penalty)
+            else:
+                adjusted_score = score
+            diversified.append((item_id, adjusted_score, item_info))
+            category_counts[category] += 1
+        # Re-sort by adjusted scores
+        diversified.sort(key=lambda x: x[1], reverse=True)
+        return diversified
+    def predict_rating(self,
+                      age: int,
+                      gender: str,
+                      income: float,
+                      item_id: int,
+                      interaction_history: List[int] = None) -> float:
+        """Predict rating for user-item pair."""
+        if self.rating_model is None:
+            return 0.5  # Default rating
+        # Get embeddings
+        user_embedding = self.get_user_embedding(age, gender, income, interaction_history)
+        item_embedding = self.get_item_embedding(item_id)
+        if user_embedding is None or item_embedding is None:
+            return 0.5
+        # Concatenate embeddings
+        combined = np.concatenate([user_embedding, item_embedding])
+        combined = tf.constant([combined])
+        # Predict rating
+        rating = self.rating_model(combined, training=False)
+        return float(rating.numpy()[0][0])
+def demo_enhanced_engine():
+    """Demo the enhanced 128D recommendation engine."""
+    print("🚀 ENHANCED 128D RECOMMENDATION ENGINE DEMO")
+    print("="*70)
+    try:
+        # Initialize engine
+        engine = Enhanced128DRecommendationEngine()
+        if engine.item_tower is None:
+            print("❌ Enhanced model not available. Please train first using:")
+            print("   python train_enhanced_model.py")
+            return
+        # Get real user for testing
+        real_user_selector = RealUserSelector()
+        test_users = real_user_selector.get_real_users(n=2, min_interactions=10)
+        for user in test_users:
+            print(f"\n📊 Testing User {user['user_id']} ({user['age']}yr {user['gender']}):")
+            print(f"   Income: ${user['income']:,}")
+            print(f"   History: {len(user['interaction_history'])} items")
+            # Test enhanced recommendations
+            try:
+                recs = engine.recommend_items_enhanced(
+                    age=user['age'],
+                    gender=user['gender'],
+                    income=user['income'],
+                    interaction_history=user['interaction_history'][:20],
+                    k=10,
+                    diversity_weight=0.3,
+                    category_boost=1.5
+                )
+                print(f"   🎯 Enhanced Recommendations:")
+                categories = []
+                for i, (item_id, score, item_info) in enumerate(recs[:5]):
+                    category = item_info.get('category_code', 'Unknown')[:30]
+                    price = item_info.get('price', 0)
+                    categories.append(category)
+                    print(f"      #{i+1} Item {item_id}: {score:.4f} | ${price:.2f} | {category}")
+                # Analyze diversity
+                unique_categories = len(set(categories))
+                print(f"   📈 Diversity: {unique_categories}/{len(categories)} unique categories")
+                # Test rating prediction
+                if recs:
+                    test_item = recs[0][0]
+                    predicted_rating = engine.predict_rating(
+                        age=user['age'],
+                        gender=user['gender'],
+                        income=user['income'],
+                        item_id=test_item,
+                        interaction_history=user['interaction_history'][:20]
+                    )
+                    print(f"   ⭐ Rating prediction for item {test_item}: {predicted_rating:.3f}")
+            except Exception as e:
+                print(f"   ❌ Error: {e}")
+        print(f"\n✅ Enhanced 128D engine demo completed!")
+    except Exception as e:
+        print(f"❌ Demo failed: {e}")
+        import traceback
+        traceback.print_exc()
+if __name__ == "__main__":
+    demo_enhanced_engine()

src/inference/faiss_index.py CHANGED Viewed

@@ -8,7 +8,7 @@ from typing import Dict, List, Tuple, Optional
 class FAISSItemIndex:
     """FAISS-based item similarity search index."""
-    def __init__(self, embedding_dim: int = 64):
         self.embedding_dim = embedding_dim
         self.index = None
         self.item_id_to_idx = {}
@@ -40,14 +40,10 @@ class FAISSItemIndex:
             # Exact search (slower but accurate)
             self.index = faiss.IndexFlatIP(self.embedding_dim)
         elif index_type == "IVF":
-            # Approximate search (faster)
-            nlist = min(100, len(item_ids) // 10)  # Number of clusters
-            quantizer = faiss.IndexFlatIP(self.embedding_dim)
-            self.index = faiss.IndexIVFFlat(quantizer, self.embedding_dim, nlist)
-            # Train the index
-            self.index.train(embeddings_array)
-            self.index.nprobe = min(10, nlist)  # Search in top 10 clusters
         else:
             raise ValueError(f"Unsupported index type: {index_type}")
@@ -134,6 +130,7 @@ class FAISSItemIndex:
             sample_queries = list(self.item_id_to_idx.keys())[:5]
         print("Validating FAISS index...")
         for query_item in sample_queries:
             if query_item not in self.item_id_to_idx:
@@ -141,9 +138,16 @@ class FAISSItemIndex:
             similar_items = self.search_similar_items(query_item, k=5)
-            print(f"\nSimilar items to {query_item}:")
-            for item_id, score in similar_items:
-                print(f"  Item {item_id}: similarity = {score:.4f}")
     def save_index(self, save_path: str = "src/artifacts/") -> None:
         """Save FAISS index and mappings."""
@@ -197,7 +201,7 @@ def main():
     # Create and build FAISS index
     print("Building FAISS index...")
-    faiss_index = FAISSItemIndex(embedding_dim=64)
     faiss_index.build_index(item_embeddings, index_type="IVF")
     # Validate index

 class FAISSItemIndex:
     """FAISS-based item similarity search index."""
+    def __init__(self, embedding_dim: int = 128):
         self.embedding_dim = embedding_dim
         self.index = None
         self.item_id_to_idx = {}
             # Exact search (slower but accurate)
             self.index = faiss.IndexFlatIP(self.embedding_dim)
         elif index_type == "IVF":
+            # For CPU use exact search (IndexFlatIP) for better accuracy
+            # IVF is mainly beneficial for GPU, for CPU stick with exact search
+            print("Using IndexFlatIP for CPU (exact search)")
+            self.index = faiss.IndexFlatIP(self.embedding_dim)
         else:
             raise ValueError(f"Unsupported index type: {index_type}")
             sample_queries = list(self.item_id_to_idx.keys())[:5]
         print("Validating FAISS index...")
+        print("Note: Higher similarity scores = more similar items (cosine similarity)")
         for query_item in sample_queries:
             if query_item not in self.item_id_to_idx:
             similar_items = self.search_similar_items(query_item, k=5)
+            print(f"\nSimilar items to {query_item} (sorted by similarity DESC):")
+            for i, (item_id, score) in enumerate(similar_items):
+                print(f"  #{i+1} Item {item_id}: similarity = {score:.4f}")
+            # Check if scores are properly ordered (descending)
+            scores = [score for _, score in similar_items]
+            if len(scores) > 1 and not all(scores[i] >= scores[i+1] for i in range(len(scores)-1)):
+                print(f"  WARNING: Scores not in descending order! {scores}")
+            else:
+                print(f"  ✓ Scores properly ordered (most to least similar)")
     def save_index(self, save_path: str = "src/artifacts/") -> None:
         """Save FAISS index and mappings."""
     # Create and build FAISS index
     print("Building FAISS index...")
+    faiss_index = FAISSItemIndex(embedding_dim=128)
     faiss_index.build_index(item_embeddings, index_type="IVF")
     # Validate index

src/inference/recommendation_engine.py CHANGED Viewed

@@ -129,8 +129,8 @@ class RecommendationEngine:
         self.user_tower = UserTower(
             max_history_length=50,
-            embedding_dim=64,
-            hidden_dims=[128, 64],
             dropout_rate=0.2
         )
@@ -139,7 +139,7 @@ class RecommendationEngine:
             'age': tf.constant([2]),  # Adult category (26-35)
             'gender': tf.constant([1]),  # Male
             'income': tf.constant([2]),  # Middle income category
-            'item_history_embeddings': tf.constant([[[0.0] * 64] * 50])
         }
         _ = self.user_tower(dummy_input)
@@ -182,7 +182,7 @@ class RecommendationEngine:
         ])
         # Build model with dummy input (concatenated user and item embeddings)
-        dummy_input = tf.constant([[0.0] * 128])  # 64 + 64 = 128
         _ = self.rating_model(dummy_input)
         try:
@@ -216,14 +216,16 @@ class RecommendationEngine:
                 history_embeddings.append(embedding)
             else:
                 # Use zero embedding for unknown items
-                history_embeddings.append(np.zeros(64))
         # Pad or truncate to max_history_length
         max_history_length = 50
         if len(history_embeddings) < max_history_length:
-            padding = [np.zeros(64)] * (max_history_length - len(history_embeddings))
-            history_embeddings = padding + history_embeddings
         else:
             history_embeddings = history_embeddings[-max_history_length:]
         history_embeddings = np.array(history_embeddings, dtype=np.float32)
@@ -301,27 +303,52 @@ class RecommendationEngine:
                                    income: float,
                                    interaction_history: List[int] = None,
                                    k: int = 10,
-                                   exclude_history: bool = True) -> List[Tuple[int, float, Dict]]:
-        """Generate recommendations using collaborative filtering (user-item similarity)."""
         # Get user embedding
         user_embedding = self.get_user_embedding(age, gender, income, interaction_history)
-        # Find similar items using FAISS
-        similar_items = self.faiss_index.search_by_embedding(user_embedding, k * 2)
-        # Filter out interaction history if requested
-        if exclude_history and interaction_history:
-            history_set = set(interaction_history)
-            similar_items = [(item_id, score) for item_id, score in similar_items
-                           if item_id not in history_set]
-        # Take top k
-        similar_items = similar_items[:k]
         # Add item metadata
         recommendations = []
-        for item_id, score in similar_items:
             item_info = self._get_item_info(item_id)
             recommendations.append((item_id, score, item_info))

         self.user_tower = UserTower(
             max_history_length=50,
+            embedding_dim=128,  # Changed from 64 to 128
+            hidden_dims=[128, 64],  # Match training architecture
             dropout_rate=0.2
         )
             'age': tf.constant([2]),  # Adult category (26-35)
             'gender': tf.constant([1]),  # Male
             'income': tf.constant([2]),  # Middle income category
+            'item_history_embeddings': tf.constant([[[0.0] * 128] * 50])  # Changed from 64 to 128
         }
         _ = self.user_tower(dummy_input)
         ])
         # Build model with dummy input (concatenated user and item embeddings)
+        dummy_input = tf.constant([[0.0] * 256])  # 128 + 128 = 256
         _ = self.rating_model(dummy_input)
         try:
                 history_embeddings.append(embedding)
             else:
                 # Use zero embedding for unknown items
+                history_embeddings.append(np.zeros(128))  # Changed from 64 to 128
         # Pad or truncate to max_history_length
         max_history_length = 50
         if len(history_embeddings) < max_history_length:
+            # Add padding at the END so real interactions are at the BEGINNING
+            padding = [np.zeros(128)] * (max_history_length - len(history_embeddings))
+            history_embeddings = history_embeddings + padding
         else:
+            # Keep most recent interactions
             history_embeddings = history_embeddings[-max_history_length:]
         history_embeddings = np.array(history_embeddings, dtype=np.float32)
                                    income: float,
                                    interaction_history: List[int] = None,
                                    k: int = 10,
+                                   exclude_history: bool = True,
+                                   category_boost: float = 1.3) -> List[Tuple[int, float, Dict]]:
+        """Generate recommendations using collaborative filtering with category awareness."""
         # Get user embedding
         user_embedding = self.get_user_embedding(age, gender, income, interaction_history)
+        # Find similar items using FAISS (get more candidates for boosting)
+        similar_items = self.faiss_index.search_by_embedding(user_embedding, k * 4)
+        # Get user's preferred categories from interaction history
+        user_categories = set()
+        if interaction_history:
+            for item_id in interaction_history[-10:]:  # Focus on recent interactions
+                item_row = self.items_df[self.items_df['product_id'] == item_id]
+                if len(item_row) > 0:
+                    user_categories.add(item_row.iloc[0]['category_code'])
+        # Filter out interaction history and apply category boosting
+        boosted_items = []
+        history_set = set(interaction_history) if (exclude_history and interaction_history) else set()
+        for item_id, score in similar_items:
+            if item_id in history_set:
+                continue
+            # Get item category
+            item_row = self.items_df[self.items_df['product_id'] == item_id]
+            if len(item_row) > 0:
+                item_category = item_row.iloc[0]['category_code']
+                # Boost score if item is in user's preferred categories
+                if item_category in user_categories:
+                    boosted_score = score * category_boost
+                else:
+                    boosted_score = score
+                boosted_items.append((item_id, boosted_score))
+        # Sort by boosted score and take top k
+        boosted_items.sort(key=lambda x: x[1], reverse=True)
+        boosted_items = boosted_items[:k]
         # Add item metadata
         recommendations = []
+        for item_id, score in boosted_items:
             item_info = self._get_item_info(item_id)
             recommendations.append((item_id, score, item_info))

src/models/enhanced_two_tower.py ADDED Viewed

	@@ -0,0 +1,574 @@

+#!/usr/bin/env python3
+"""
+Enhanced two-tower model with embedding diversity regularization and improved discrimination.
+"""
+import tensorflow as tf
+import tensorflow_recommenders as tfrs
+import numpy as np
+class EmbeddingDiversityRegularizer(tf.keras.layers.Layer):
+    """Regularizer to prevent embedding collapse by enforcing diversity."""
+    def __init__(self, diversity_weight=0.01, orthogonality_weight=0.05, **kwargs):
+        super().__init__(**kwargs)
+        self.diversity_weight = diversity_weight
+        self.orthogonality_weight = orthogonality_weight
+    def call(self, embeddings):
+        """Apply diversity regularization to embeddings."""
+        batch_size = tf.shape(embeddings)[0]
+        # Compute pairwise cosine similarities
+        normalized_embeddings = tf.nn.l2_normalize(embeddings, axis=1)
+        similarity_matrix = tf.linalg.matmul(
+            normalized_embeddings, normalized_embeddings, transpose_b=True
+        )
+        # Remove diagonal (self-similarities)
+        mask = 1.0 - tf.eye(batch_size)
+        masked_similarities = similarity_matrix * mask
+        # Diversity loss: penalize high similarities between different embeddings
+        diversity_loss = tf.reduce_mean(tf.square(masked_similarities))
+        # Orthogonality loss: encourage embeddings to be orthogonal
+        identity_target = tf.eye(batch_size)
+        orthogonality_loss = tf.reduce_mean(
+            tf.square(similarity_matrix - identity_target)
+        )
+        # Add as regularization losses
+        self.add_loss(self.diversity_weight * diversity_loss)
+        self.add_loss(self.orthogonality_weight * orthogonality_loss)
+        return embeddings
+class AdaptiveTemperatureScaling(tf.keras.layers.Layer):
+    """Advanced temperature scaling with learned parameters."""
+    def __init__(self, initial_temperature=1.0, min_temp=0.1, max_temp=5.0, **kwargs):
+        super().__init__(**kwargs)
+        self.initial_temperature = initial_temperature
+        self.min_temp = min_temp
+        self.max_temp = max_temp
+    def build(self, input_shape):
+        # Learnable temperature with constraints
+        self.raw_temperature = self.add_weight(
+            name='raw_temperature',
+            shape=(),
+            initializer=tf.keras.initializers.Constant(
+                np.log(self.initial_temperature - self.min_temp)
+            ),
+            trainable=True
+        )
+        # Learnable bias term for better discrimination
+        self.similarity_bias = self.add_weight(
+            name='similarity_bias',
+            shape=(),
+            initializer=tf.keras.initializers.Zeros(),
+            trainable=True
+        )
+        super().build(input_shape)
+    def call(self, user_embeddings, item_embeddings):
+        """Compute adaptive temperature-scaled similarity with bias."""
+        # Constrain temperature to valid range
+        temperature = self.min_temp + tf.nn.softplus(self.raw_temperature)
+        temperature = tf.minimum(temperature, self.max_temp)
+        # Compute similarities
+        similarities = tf.reduce_sum(user_embeddings * item_embeddings, axis=1)
+        # Add learnable bias and apply temperature scaling
+        scaled_similarities = (similarities + self.similarity_bias) / temperature
+        return scaled_similarities, temperature
+class EnhancedItemTower(tf.keras.Model):
+    """Enhanced item tower with diversity regularization."""
+    def __init__(self,
+                 item_vocab_size: int,
+                 category_vocab_size: int,
+                 brand_vocab_size: int,
+                 embedding_dim: int = 128,
+                 hidden_dims: list = [256, 128],
+                 dropout_rate: float = 0.3,
+                 use_bias: bool = True,
+                 use_diversity_reg: bool = True):
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.use_bias = use_bias
+        self.use_diversity_reg = use_diversity_reg
+        # Embedding layers with better initialization
+        self.item_embedding = tf.keras.layers.Embedding(
+            item_vocab_size, embedding_dim,
+            embeddings_initializer='he_normal',  # Better initialization
+            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
+            name="item_embedding"
+        )
+        self.category_embedding = tf.keras.layers.Embedding(
+            category_vocab_size, embedding_dim,
+            embeddings_initializer='he_normal',
+            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
+            name="category_embedding"
+        )
+        self.brand_embedding = tf.keras.layers.Embedding(
+            brand_vocab_size, embedding_dim,
+            embeddings_initializer='he_normal',
+            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
+            name="brand_embedding"
+        )
+        # Price processing
+        self.price_normalization = tf.keras.layers.Normalization(name="price_norm")
+        self.price_projection = tf.keras.layers.Dense(
+            embedding_dim // 4, activation='relu', name="price_proj"
+        )
+        # Enhanced attention mechanism
+        self.feature_attention = tf.keras.layers.MultiHeadAttention(
+            num_heads=4,
+            key_dim=embedding_dim,
+            dropout=0.1,
+            name="feature_attention"
+        )
+        # Dense layers with residual connections
+        self.dense_layers = []
+        for i, dim in enumerate(hidden_dims):
+            self.dense_layers.extend([
+                tf.keras.layers.Dense(dim, activation=None, name=f"dense_{i}"),
+                tf.keras.layers.BatchNormalization(name=f"bn_{i}"),
+                tf.keras.layers.Activation('relu', name=f"relu_{i}"),
+                tf.keras.layers.Dropout(dropout_rate, name=f"dropout_{i}")
+            ])
+        # Output layer with controlled normalization
+        self.output_layer = tf.keras.layers.Dense(
+            embedding_dim, activation=None, use_bias=use_bias, name="item_output"
+        )
+        # Diversity regularizer
+        if use_diversity_reg:
+            self.diversity_regularizer = EmbeddingDiversityRegularizer()
+        # Adaptive normalization instead of hard L2 normalization
+        self.adaptive_norm = tf.keras.layers.LayerNormalization(name="adaptive_norm")
+        # Item bias
+        if use_bias:
+            self.item_bias = tf.keras.layers.Embedding(
+                item_vocab_size, 1, name="item_bias"
+            )
+    def call(self, inputs, training=None):
+        """Enhanced forward pass with diversity regularization."""
+        item_id = inputs["product_id"]
+        category_id = inputs["category_id"]
+        brand_id = inputs["brand_id"]
+        price = inputs["price"]
+        # Get embeddings
+        item_emb = self.item_embedding(item_id)
+        category_emb = self.category_embedding(category_id)
+        brand_emb = self.brand_embedding(brand_id)
+        # Process price
+        price_norm = self.price_normalization(tf.expand_dims(price, -1))
+        price_emb = self.price_projection(price_norm)
+        # Pad price embedding
+        price_emb_padded = tf.pad(
+            price_emb,
+            [[0, 0], [0, self.embedding_dim - tf.shape(price_emb)[-1]]]
+        )
+        # Stack features for attention
+        features = tf.stack([item_emb, category_emb, brand_emb, price_emb_padded], axis=1)
+        # Apply attention
+        attended_features = self.feature_attention(
+            query=features,
+            value=features,
+            key=features,
+            training=training
+        )
+        # Aggregate with residual connection
+        combined = tf.reduce_mean(attended_features + features, axis=1)
+        # Pass through dense layers with residual connections
+        x = combined
+        residual = x
+        for i, layer in enumerate(self.dense_layers):
+            x = layer(x, training=training)
+            # Add residual connection every 4 layers (complete block)
+            if (i + 1) % 4 == 0 and x.shape[-1] == residual.shape[-1]:
+                x = x + residual
+                residual = x
+        # Final output
+        output = self.output_layer(x)
+        # Apply diversity regularization if enabled
+        if self.use_diversity_reg and training:
+            output = self.diversity_regularizer(output)
+        # Adaptive normalization instead of hard L2
+        normalized_output = self.adaptive_norm(output)
+        # Add bias if enabled
+        if self.use_bias:
+            bias = tf.squeeze(self.item_bias(item_id), axis=-1)
+            return normalized_output, bias
+        else:
+            return normalized_output
+class EnhancedUserTower(tf.keras.Model):
+    """Enhanced user tower with diversity regularization."""
+    def __init__(self,
+                 max_history_length: int = 50,
+                 embedding_dim: int = 128,
+                 hidden_dims: list = [256, 128],
+                 dropout_rate: float = 0.3,
+                 use_bias: bool = True,
+                 use_diversity_reg: bool = True):
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.max_history_length = max_history_length
+        self.use_bias = use_bias
+        self.use_diversity_reg = use_diversity_reg
+        # Demographic embeddings with regularization
+        self.age_embedding = tf.keras.layers.Embedding(
+            6, embedding_dim // 16,
+            embeddings_initializer='he_normal',
+            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
+            name="age_embedding"
+        )
+        self.income_embedding = tf.keras.layers.Embedding(
+            5, embedding_dim // 16,
+            embeddings_initializer='he_normal',
+            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
+            name="income_embedding"
+        )
+        self.gender_embedding = tf.keras.layers.Embedding(
+            2, embedding_dim // 16,
+            embeddings_initializer='he_normal',
+            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
+            name="gender_embedding"
+        )
+        # Enhanced history processing
+        self.history_transformer = tf.keras.layers.MultiHeadAttention(
+            num_heads=8,
+            key_dim=embedding_dim,
+            dropout=0.1,
+            name="history_transformer"
+        )
+        # History aggregation with attention pooling
+        self.history_attention_pooling = tf.keras.layers.Dense(
+            1, activation=None, name="history_attention"
+        )
+        # Dense layers with residual connections
+        self.dense_layers = []
+        for i, dim in enumerate(hidden_dims):
+            self.dense_layers.extend([
+                tf.keras.layers.Dense(dim, activation=None, name=f"user_dense_{i}"),
+                tf.keras.layers.BatchNormalization(name=f"user_bn_{i}"),
+                tf.keras.layers.Activation('relu', name=f"user_relu_{i}"),
+                tf.keras.layers.Dropout(dropout_rate, name=f"user_dropout_{i}")
+            ])
+        # Output layer
+        self.output_layer = tf.keras.layers.Dense(
+            embedding_dim, activation=None, use_bias=use_bias, name="user_output"
+        )
+        # Diversity regularizer
+        if use_diversity_reg:
+            self.diversity_regularizer = EmbeddingDiversityRegularizer()
+        # Adaptive normalization
+        self.adaptive_norm = tf.keras.layers.LayerNormalization(name="user_adaptive_norm")
+        # Global user bias
+        if use_bias:
+            self.global_user_bias = tf.Variable(
+                initial_value=0.0, trainable=True, name="global_user_bias"
+            )
+    def call(self, inputs, training=None):
+        """Enhanced forward pass with diversity regularization."""
+        age = inputs["age"]
+        gender = inputs["gender"]
+        income = inputs["income"]
+        item_history = inputs["item_history_embeddings"]
+        # Process demographics
+        age_emb = self.age_embedding(age)
+        income_emb = self.income_embedding(income)
+        gender_emb = self.gender_embedding(gender)
+        # Combine demographics
+        demo_combined = tf.concat([age_emb, income_emb, gender_emb], axis=-1)
+        # Enhanced history processing
+        batch_size = tf.shape(item_history)[0]
+        seq_len = tf.shape(item_history)[1]
+        # Simplified positional encoding - ensure shape compatibility
+        positions = tf.range(seq_len, dtype=tf.float32)
+        # Create simpler positional encoding
+        pos_encoding_scale = tf.cast(tf.range(self.embedding_dim, dtype=tf.float32), tf.float32) / self.embedding_dim
+        position_encoding = tf.sin(positions[:, tf.newaxis] * pos_encoding_scale[tf.newaxis, :])
+        # Ensure correct shape: [seq_len, embedding_dim] -> [batch_size, seq_len, embedding_dim]
+        position_encoding = tf.expand_dims(position_encoding, 0)
+        position_encoding = tf.tile(position_encoding, [batch_size, 1, 1])
+        # Add positional encoding with shape check
+        history_with_pos = item_history + position_encoding
+        # Create attention mask - fix shape for MultiHeadAttention
+        # MultiHeadAttention expects mask shape: [batch_size, seq_len] or [batch_size, seq_len, seq_len]
+        history_mask = tf.reduce_sum(tf.abs(item_history), axis=-1) > 0  # [batch_size, seq_len]
+        # Apply transformer attention
+        attended_history = self.history_transformer(
+            query=history_with_pos,
+            value=history_with_pos,
+            key=history_with_pos,
+            attention_mask=history_mask,
+            training=training
+        )
+        # Attention-based pooling instead of simple mean
+        attention_weights = tf.nn.softmax(
+            self.history_attention_pooling(attended_history), axis=1
+        )
+        history_aggregated = tf.reduce_sum(
+            attended_history * attention_weights, axis=1
+        )
+        # Combine features
+        combined = tf.concat([demo_combined, history_aggregated], axis=-1)
+        # Pass through dense layers with residual connections
+        x = combined
+        residual = x
+        for i, layer in enumerate(self.dense_layers):
+            x = layer(x, training=training)
+            # Add residual connection every 4 layers
+            if (i + 1) % 4 == 0 and x.shape[-1] == residual.shape[-1]:
+                x = x + residual
+                residual = x
+        # Final output
+        output = self.output_layer(x)
+        # Apply diversity regularization if enabled
+        if self.use_diversity_reg and training:
+            output = self.diversity_regularizer(output)
+        # Adaptive normalization
+        normalized_output = self.adaptive_norm(output)
+        # Add bias if enabled
+        if self.use_bias:
+            return normalized_output, self.global_user_bias
+        else:
+            return normalized_output
+class EnhancedTwoTowerModel(tfrs.Model):
+    """Enhanced two-tower model with all improvements."""
+    def __init__(self,
+                 item_tower: EnhancedItemTower,
+                 user_tower: EnhancedUserTower,
+                 rating_weight: float = 1.0,
+                 retrieval_weight: float = 1.0,
+                 contrastive_weight: float = 0.3,
+                 diversity_weight: float = 0.1):
+        super().__init__()
+        self.item_tower = item_tower
+        self.user_tower = user_tower
+        self.rating_weight = rating_weight
+        self.retrieval_weight = retrieval_weight
+        self.contrastive_weight = contrastive_weight
+        self.diversity_weight = diversity_weight
+        # Adaptive temperature scaling
+        self.temperature_similarity = AdaptiveTemperatureScaling()
+        # Enhanced rating model
+        self.rating_model = tf.keras.Sequential([
+            tf.keras.layers.Dense(512, activation="relu"),
+            tf.keras.layers.BatchNormalization(),
+            tf.keras.layers.Dropout(0.3),
+            tf.keras.layers.Dense(256, activation="relu"),
+            tf.keras.layers.BatchNormalization(),
+            tf.keras.layers.Dropout(0.2),
+            tf.keras.layers.Dense(64, activation="relu"),
+            tf.keras.layers.Dense(1, activation="sigmoid")
+        ])
+        # Focal loss for imbalanced data
+        self.focal_loss = self._focal_loss
+    def _focal_loss(self, y_true, y_pred, alpha=0.25, gamma=2.0):
+        """Focal loss implementation."""
+        epsilon = tf.keras.backend.epsilon()
+        y_pred = tf.clip_by_value(y_pred, epsilon, 1.0 - epsilon)
+        alpha_t = y_true * alpha + (1 - y_true) * (1 - alpha)
+        p_t = y_true * y_pred + (1 - y_true) * (1 - y_pred)
+        focal_weight = alpha_t * tf.pow((1 - p_t), gamma)
+        bce = -(y_true * tf.math.log(y_pred) + (1 - y_true) * tf.math.log(1 - y_pred))
+        focal_loss = focal_weight * bce
+        return tf.reduce_mean(focal_loss)
+    def call(self, features):
+        # Get embeddings
+        user_output = self.user_tower(features)
+        item_output = self.item_tower(features)
+        # Handle bias terms
+        if isinstance(user_output, tuple):
+            user_embeddings, user_bias = user_output
+        else:
+            user_embeddings = user_output
+            user_bias = 0.0
+        if isinstance(item_output, tuple):
+            item_embeddings, item_bias = item_output
+        else:
+            item_embeddings = item_output
+            item_bias = 0.0
+        return {
+            "user_embedding": user_embeddings,
+            "item_embedding": item_embeddings,
+            "user_bias": user_bias,
+            "item_bias": item_bias
+        }
+    def compute_loss(self, features, training=False):
+        # Get embeddings and biases
+        outputs = self(features)
+        user_embeddings = outputs["user_embedding"]
+        item_embeddings = outputs["item_embedding"]
+        user_bias = outputs["user_bias"]
+        item_bias = outputs["item_bias"]
+        # Rating prediction
+        concatenated = tf.concat([user_embeddings, item_embeddings], axis=-1)
+        rating_predictions = self.rating_model(concatenated, training=training)
+        # Add bias terms
+        rating_predictions_with_bias = rating_predictions + user_bias + item_bias
+        rating_predictions_with_bias = tf.nn.sigmoid(rating_predictions_with_bias)
+        # Losses
+        rating_loss = self.focal_loss(features["rating"], rating_predictions_with_bias)
+        # Adaptive temperature-scaled retrieval loss
+        scaled_similarities, temperature = self.temperature_similarity(
+            user_embeddings, item_embeddings
+        )
+        retrieval_loss = tf.keras.losses.binary_crossentropy(
+            features["rating"],
+            tf.nn.sigmoid(scaled_similarities)
+        )
+        retrieval_loss = tf.reduce_mean(retrieval_loss)
+        # Enhanced contrastive loss with hard negatives
+        batch_size = tf.shape(user_embeddings)[0]
+        positive_similarities = tf.reduce_sum(user_embeddings * item_embeddings, axis=1)
+        # Random negative sampling
+        shuffled_indices = tf.random.shuffle(tf.range(batch_size))
+        negative_item_embeddings = tf.gather(item_embeddings, shuffled_indices)
+        negative_similarities = tf.reduce_sum(user_embeddings * negative_item_embeddings, axis=1)
+        # Triplet loss with adaptive margin
+        margin = 0.5 / temperature  # Adaptive margin based on temperature
+        contrastive_loss = tf.reduce_mean(
+            tf.maximum(0.0, margin + negative_similarities - positive_similarities)
+        )
+        # Combine losses
+        total_loss = (
+            self.rating_weight * rating_loss +
+            self.retrieval_weight * retrieval_loss +
+            self.contrastive_weight * contrastive_loss
+        )
+        # Add regularization losses from diversity regularizers
+        if training:
+            regularization_losses = tf.add_n(self.losses) if self.losses else 0.0
+            total_loss += self.diversity_weight * regularization_losses
+        return {
+            'total_loss': total_loss,
+            'rating_loss': rating_loss,
+            'retrieval_loss': retrieval_loss,
+            'contrastive_loss': contrastive_loss,
+            'temperature': temperature,
+            'diversity_loss': regularization_losses if training else 0.0
+        }
+def create_enhanced_model(data_processor,
+                         embedding_dim=128,
+                         use_bias=True,
+                         use_diversity_reg=True):
+    """Factory function to create enhanced two-tower model."""
+    # Create enhanced towers
+    item_tower = EnhancedItemTower(
+        item_vocab_size=len(data_processor.item_vocab),
+        category_vocab_size=len(data_processor.category_vocab),
+        brand_vocab_size=len(data_processor.brand_vocab),
+        embedding_dim=embedding_dim,
+        use_bias=use_bias,
+        use_diversity_reg=use_diversity_reg
+    )
+    user_tower = EnhancedUserTower(
+        max_history_length=50,
+        embedding_dim=embedding_dim,
+        use_bias=use_bias,
+        use_diversity_reg=use_diversity_reg
+    )
+    # Create enhanced model
+    model = EnhancedTwoTowerModel(
+        item_tower=item_tower,
+        user_tower=user_tower,
+        rating_weight=1.0,
+        retrieval_weight=0.5,
+        contrastive_weight=0.3,
+        diversity_weight=0.1
+    )
+    return model

src/models/item_tower.py CHANGED Viewed

@@ -10,8 +10,8 @@ class ItemTower(tf.keras.Model):
                  item_vocab_size: int,
                  category_vocab_size: int,
                  brand_vocab_size: int,
-                 embedding_dim: int = 64,
-                 hidden_dims: list = [128, 64],
                  dropout_rate: float = 0.2):
         super().__init__()

                  item_vocab_size: int,
                  category_vocab_size: int,
                  brand_vocab_size: int,
+                 embedding_dim: int = 128,  # Output embedding dimension
+                 hidden_dims: list = [256, 128],  # Internal dims can be larger
                  dropout_rate: float = 0.2):
         super().__init__()

src/models/user_tower.py CHANGED Viewed

@@ -8,8 +8,8 @@ class UserTower(tf.keras.Model):
     def __init__(self,
                  max_history_length: int = 50,
-                 embedding_dim: int = 64,
-                 hidden_dims: list = [128, 64],
                  dropout_rate: float = 0.2):
         super().__init__()

     def __init__(self,
                  max_history_length: int = 50,
+                 embedding_dim: int = 128,  # Output embedding dimension
+                 hidden_dims: list = [256, 128],  # Internal dims for processing
                  dropout_rate: float = 0.2):
         super().__init__()

src/preprocessing/data_loader.py CHANGED Viewed

@@ -4,6 +4,8 @@ import tensorflow as tf
 from typing import Dict, List, Tuple, Optional
 from collections import defaultdict
 import pickle
 class DataProcessor:
@@ -97,54 +99,73 @@ class DataProcessor:
                                      interactions_df: pd.DataFrame,
                                      items_df: pd.DataFrame,
                                      negative_samples_per_positive: int = 4) -> pd.DataFrame:
-        """Create positive and negative user-item pairs for training."""
-        # Get all unique items for negative sampling
         all_items = set(self.item_vocab.keys())
-        # Create positive pairs
-        positive_pairs = []
-        for _, row in interactions_df.iterrows():
-            if row['user_id'] in self.user_vocab and row['product_id'] in self.item_vocab:
-                positive_pairs.append({
-                    'user_id': row['user_id'],
-                    'product_id': row['product_id'],
-                    'rating': 1.0  # Implicit positive feedback
-                })
-        # Create negative pairs
-        negative_pairs = []
-        user_item_interactions = set(
-            (row['user_id'], row['product_id'])
-            for _, row in interactions_df.iterrows()
-        )
-        for pos_pair in positive_pairs:
-            user_id = pos_pair['user_id']
-            user_interactions = set(
-                row['product_id'] for _, row in interactions_df.iterrows()
-                if row['user_id'] == user_id
-            )
-            # Sample negative items
-            negative_items = all_items - user_interactions
             if len(negative_items) >= negative_samples_per_positive:
                 sampled_negatives = np.random.choice(
-                    list(negative_items),
-                    size=negative_samples_per_positive,
-                    replace=False
                 )
-                for neg_item in sampled_negatives:
-                    negative_pairs.append({
-                        'user_id': user_id,
-                        'product_id': neg_item,
-                        'rating': 0.0  # Negative feedback
-                    })
         # Combine positive and negative pairs
-        all_pairs = positive_pairs + negative_pairs
-        return pd.DataFrame(all_pairs)
     def save_vocabularies(self, save_path: str = "src/artifacts/"):
         """Save vocabularies for later use."""
@@ -176,9 +197,19 @@ class DataProcessor:
         print("Vocabularies loaded successfully")
-def create_tf_dataset(features: Dict[str, np.ndarray], batch_size: int = 256) -> tf.data.Dataset:
-    """Create TensorFlow dataset from features."""
     dataset = tf.data.Dataset.from_tensor_slices(features)
     dataset = dataset.batch(batch_size)
-    dataset = dataset.prefetch(tf.data.AUTOTUNE)
     return dataset

 from typing import Dict, List, Tuple, Optional
 from collections import defaultdict
 import pickle
+from concurrent.futures import ThreadPoolExecutor
+import multiprocessing as mp
 class DataProcessor:
                                      interactions_df: pd.DataFrame,
                                      items_df: pd.DataFrame,
                                      negative_samples_per_positive: int = 4) -> pd.DataFrame:
+        """Create positive and negative user-item pairs for training (optimized)."""
+        # Filter valid interactions once
+        valid_interactions = interactions_df[
+            (interactions_df['user_id'].isin(self.user_vocab)) &
+            (interactions_df['product_id'].isin(self.item_vocab))
+        ].copy()
+        # Create positive pairs vectorized
+        positive_pairs = valid_interactions[['user_id', 'product_id']].copy()
+        positive_pairs['rating'] = 1.0
+        # Pre-compute user interactions for faster lookup
+        user_items_dict = (
+            valid_interactions.groupby('user_id')['product_id']
+            .apply(set).to_dict()
+        )
         all_items = set(self.item_vocab.keys())
+        all_items_array = np.array(list(all_items))
+        # Generate negative samples in parallel
+        def generate_negatives_for_user(user_data):
+            user_id, user_items = user_data
+            negative_items = all_items - user_items
             if len(negative_items) >= negative_samples_per_positive:
+                neg_items_array = np.array(list(negative_items))
                 sampled_negatives = np.random.choice(
+                    neg_items_array,
+                    size=negative_samples_per_positive * len(user_items),
+                    replace=len(negative_items) < negative_samples_per_positive * len(user_items)
                 )
+                # Repeat user_id for each negative sample
+                user_ids = np.repeat(user_id, len(sampled_negatives))
+                ratings = np.zeros(len(sampled_negatives))
+                return pd.DataFrame({
+                    'user_id': user_ids,
+                    'product_id': sampled_negatives,
+                    'rating': ratings
+                })
+            return pd.DataFrame(columns=['user_id', 'product_id', 'rating'])
+        # Process in parallel chunks
+        chunk_size = max(1, len(user_items_dict) // mp.cpu_count())
+        user_chunks = [
+            list(user_items_dict.items())[i:i + chunk_size]
+            for i in range(0, len(user_items_dict), chunk_size)
+        ]
+        negative_dfs = []
+        with ThreadPoolExecutor(max_workers=mp.cpu_count()) as executor:
+            for chunk in user_chunks:
+                chunk_results = list(executor.map(generate_negatives_for_user, chunk))
+                negative_dfs.extend(chunk_results)
+        # Combine all negative samples
+        if negative_dfs:
+            negative_pairs = pd.concat(negative_dfs, ignore_index=True)
+        else:
+            negative_pairs = pd.DataFrame(columns=['user_id', 'product_id', 'rating'])
         # Combine positive and negative pairs
+        all_pairs = pd.concat([positive_pairs, negative_pairs], ignore_index=True)
+        return all_pairs
     def save_vocabularies(self, save_path: str = "src/artifacts/"):
         """Save vocabularies for later use."""
         print("Vocabularies loaded successfully")
+def create_tf_dataset(features: Dict[str, np.ndarray], batch_size: int = 256, shuffle: bool = True) -> tf.data.Dataset:
+    """Create optimized TensorFlow dataset from features for CPU training."""
     dataset = tf.data.Dataset.from_tensor_slices(features)
+    if shuffle:
+        # Use reasonable buffer size for memory efficiency - handle different feature types
+        sample_key = next(iter(features.keys()))
+        buffer_size = min(len(features[sample_key]), 10000)
+        dataset = dataset.shuffle(buffer_size)
     dataset = dataset.batch(batch_size)
+    # Optimize for CPU with reasonable prefetch
+    dataset = dataset.prefetch(2)  # Reduced from AUTOTUNE for CPU efficiency
     return dataset

src/preprocessing/optimized_dataset_creator.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""
+Optimized dataset creation script with performance improvements.
+"""
+import time
+import numpy as np
+from src.preprocessing.user_data_preparation import UserDatasetCreator
+from src.preprocessing.data_loader import DataProcessor, create_tf_dataset
+def create_optimized_dataset(max_history_length: int = 50,
+                           batch_size: int = 512,
+                           negative_samples_per_positive: int = 2,
+                           use_sample: bool = False,
+                           sample_size: int = 10000):
+    """
+    Create dataset with optimized performance settings.
+    Args:
+        max_history_length: Maximum user interaction history length
+        batch_size: Batch size for TensorFlow dataset
+        negative_samples_per_positive: Negative sampling ratio
+        use_sample: Whether to use a sample of the data for faster processing
+        sample_size: Size of sample if use_sample=True
+    """
+    print("Starting optimized dataset creation...")
+    start_time = time.time()
+    # Initialize with optimized settings
+    dataset_creator = UserDatasetCreator(max_history_length=max_history_length)
+    data_processor = DataProcessor()
+    # Load data
+    print("Loading data...")
+    load_start = time.time()
+    items_df, users_df, interactions_df = data_processor.load_data()
+    print(f"Data loaded in {time.time() - load_start:.2f} seconds")
+    # Optional: Use sample for faster development/testing
+    if use_sample:
+        print(f"Using sample of {sample_size} interactions for faster processing...")
+        sample_interactions = interactions_df.sample(min(sample_size, len(interactions_df)))
+        user_ids = set(sample_interactions['user_id'])
+        item_ids = set(sample_interactions['product_id'])
+        users_df = users_df[users_df['user_id'].isin(user_ids)]
+        items_df = items_df[items_df['product_id'].isin(item_ids)]
+        interactions_df = sample_interactions
+        print(f"Sample: {len(items_df)} items, {len(users_df)} users, {len(interactions_df)} interactions")
+    # Load embeddings with caching
+    print("Loading item embeddings...")
+    embed_start = time.time()
+    item_embeddings = dataset_creator.load_item_embeddings()
+    print(f"Embeddings loaded in {time.time() - embed_start:.2f} seconds")
+    # Create temporal split
+    print("Creating temporal split...")
+    split_start = time.time()
+    train_interactions, val_interactions = dataset_creator.create_temporal_split(interactions_df)
+    print(f"Temporal split created in {time.time() - split_start:.2f} seconds")
+    # Create training dataset with optimizations
+    print("Creating optimized training dataset...")
+    train_start = time.time()
+    training_features = dataset_creator.create_training_dataset(
+        train_interactions, items_df, users_df, item_embeddings,
+        negative_samples_per_positive=negative_samples_per_positive
+    )
+    print(f"Training dataset created in {time.time() - train_start:.2f} seconds")
+    # Create TensorFlow dataset optimized for CPU
+    print("Creating TensorFlow dataset...")
+    tf_start = time.time()
+    tf_dataset = create_tf_dataset(training_features, batch_size=batch_size)
+    print(f"TensorFlow dataset created in {time.time() - tf_start:.2f} seconds")
+    # Save optimized dataset
+    print("Saving dataset...")
+    save_start = time.time()
+    dataset_creator.save_dataset(training_features, "src/artifacts/")
+    # Save vocabularies for later use
+    data_processor.save_vocabularies("src/artifacts/")
+    print(f"Dataset saved in {time.time() - save_start:.2f} seconds")
+    total_time = time.time() - start_time
+    print(f"\nOptimized dataset creation completed in {total_time:.2f} seconds!")
+    print(f"Training samples: {len(training_features['rating'])}")
+    print(f"Memory usage optimized for CPU training")
+    return tf_dataset, training_features
+if __name__ == "__main__":
+    # Run with optimized settings
+    tf_dataset, features = create_optimized_dataset(
+        max_history_length=30,  # Reduced for speed
+        batch_size=512,         # Larger batches for CPU efficiency
+        negative_samples_per_positive=2,  # Reduced sampling ratio
+        use_sample=True,        # Use sample for development
+        sample_size=50000       # Reasonable sample size
+    )
+    print("\nDataset creation optimization complete!")
+    print("Key optimizations applied:")
+    print("- Vectorized DataFrame operations")
+    print("- Parallel negative sampling")
+    print("- Memory-efficient embedding lookup")
+    print("- Optimized TensorFlow dataset pipeline")
+    print("- LRU caching for embeddings")

src/preprocessing/user_data_preparation.py CHANGED Viewed

@@ -63,7 +63,7 @@ class UserDatasetCreator:
             # Use more efficient random generation
             num_items = len(items_df['product_id'].unique())
             item_ids = items_df['product_id'].unique()
-            embedding_matrix = np.random.rand(num_items, 64).astype(np.float32)
             dummy_embeddings = dict(zip(item_ids, embedding_matrix))
             print(f"Created dummy embeddings for {len(dummy_embeddings)} items")
@@ -72,7 +72,7 @@ class UserDatasetCreator:
     def aggregate_user_history_embeddings(self,
                                         user_histories: Dict[int, List[int]],
                                         item_embeddings: Dict[int, np.ndarray],
-                                        embedding_dim: int = 64) -> Dict[int, np.ndarray]:
         """Aggregate item embeddings for each user's interaction history."""
         user_aggregated_embeddings = {}
@@ -103,9 +103,11 @@ class UserDatasetCreator:
             # Pad or truncate to max_history_length
             if len(history_embeddings) < self.max_history_length:
                 padding = np.zeros((self.max_history_length - len(history_embeddings), embedding_dim))
-                history_embeddings = np.vstack([padding, history_embeddings])
             else:
                 history_embeddings = history_embeddings[-self.max_history_length:]
             user_aggregated_embeddings[user_id] = history_embeddings
@@ -368,5 +370,58 @@ def main():
     print("User dataset creation completed!")
 if __name__ == "__main__":
     main()

             # Use more efficient random generation
             num_items = len(items_df['product_id'].unique())
             item_ids = items_df['product_id'].unique()
+            embedding_matrix = np.random.rand(num_items, 128).astype(np.float32)  # Updated to 128D
             dummy_embeddings = dict(zip(item_ids, embedding_matrix))
             print(f"Created dummy embeddings for {len(dummy_embeddings)} items")
     def aggregate_user_history_embeddings(self,
                                         user_histories: Dict[int, List[int]],
                                         item_embeddings: Dict[int, np.ndarray],
+                                        embedding_dim: int = 128) -> Dict[int, np.ndarray]:  # Updated to 128D
         """Aggregate item embeddings for each user's interaction history."""
         user_aggregated_embeddings = {}
             # Pad or truncate to max_history_length
             if len(history_embeddings) < self.max_history_length:
+                # Add padding at the END so real interactions are at the BEGINNING
                 padding = np.zeros((self.max_history_length - len(history_embeddings), embedding_dim))
+                history_embeddings = np.vstack([history_embeddings, padding])
             else:
+                # Keep most recent interactions
                 history_embeddings = history_embeddings[-self.max_history_length:]
             user_aggregated_embeddings[user_id] = history_embeddings
     print("User dataset creation completed!")
+def prepare_user_features(users_df: pd.DataFrame,
+                          user_histories: Dict[int, List[int]],
+                          item_features: Dict[str, np.ndarray],
+                          max_history_length: int = 50,
+                          embedding_dim: int = 128) -> Dict[int, Dict]:
+    """Standalone function to prepare user features with categorical demographics."""
+    creator = UserDatasetCreator(max_history_length=max_history_length)
+    # Create dummy item embeddings if not available (for 128D)
+    item_embeddings = {}
+    unique_items = set()
+    for history in user_histories.values():
+        unique_items.update(history)
+    # Create random embeddings for items (will be replaced by actual embeddings later)
+    for item_vocab_idx in unique_items:
+        item_embeddings[item_vocab_idx] = np.random.randn(embedding_dim).astype(np.float32)
+    # Get user aggregated embeddings
+    user_aggregated_embeddings = creator.aggregate_user_history_embeddings(
+        user_histories, item_embeddings, embedding_dim
+    )
+    # Process user features
+    user_feature_dict = {}
+    for _, user_row in users_df.iterrows():
+        user_id = user_row['user_id']
+        if user_id not in user_aggregated_embeddings:
+            continue
+        # Categorize demographics
+        age_cat = creator.categorize_age(user_row['age'])
+        gender_cat = 1 if user_row['gender'].lower() == 'male' else 0
+        # Categorize income using percentiles from all users
+        income_categories = creator.categorize_income(users_df['income'])
+        user_idx = users_df[users_df['user_id'] == user_id].index[0]
+        income_cat = income_categories[user_idx]
+        user_feature_dict[user_id] = {
+            'age': age_cat,
+            'gender': gender_cat,
+            'income': income_cat,
+            'item_history_embeddings': user_aggregated_embeddings[user_id]
+        }
+    print(f"Prepared features for {len(user_feature_dict)} users with {embedding_dim}D embeddings")
+    return user_feature_dict
 if __name__ == "__main__":
     main()

src/training/fast_joint_training.py CHANGED Viewed

@@ -66,7 +66,7 @@ class FastJointTrainer:
         # Build user tower (simplified)
         self.user_tower = UserTower(
             max_history_length=50,
-            embedding_dim=64,
             hidden_dims=[64],  # Simplified architecture
             dropout_rate=0.1
         )

         # Build user tower (simplified)
         self.user_tower = UserTower(
             max_history_length=50,
+            embedding_dim=128,  # Updated to 128D
             hidden_dims=[64],  # Simplified architecture
             dropout_rate=0.1
         )

src/training/improved_joint_training.py ADDED Viewed

	@@ -0,0 +1,462 @@

+#!/usr/bin/env python3
+"""
+Improved joint training with hard negative mining, curriculum learning, and better optimization.
+"""
+import tensorflow as tf
+import numpy as np
+import pickle
+import os
+from typing import Dict, List, Tuple, Optional
+import time
+from collections import defaultdict
+from src.models.improved_two_tower import create_improved_model
+from src.preprocessing.data_loader import DataProcessor, create_tf_dataset
+class HardNegativeSampler:
+    """Hard negative sampling strategy for better training."""
+    def __init__(self, model, item_embeddings, sampling_strategy='mixed'):
+        self.model = model
+        self.item_embeddings = item_embeddings  # Pre-computed item embeddings
+        self.sampling_strategy = sampling_strategy
+    def sample_hard_negatives(self, user_embeddings, positive_items, k_hard=2, k_random=2):
+        """Sample hard negatives based on user-item similarity."""
+        batch_size = tf.shape(user_embeddings)[0]
+        # Compute similarities between users and all items
+        similarities = tf.linalg.matmul(user_embeddings, self.item_embeddings, transpose_b=True)
+        # Mask out positive items
+        positive_mask = tf.one_hot(positive_items, depth=tf.shape(self.item_embeddings)[0])
+        similarities = similarities - positive_mask * 1e9  # Large negative value
+        # Get top-k similar items (hard negatives)
+        _, hard_negative_indices = tf.nn.top_k(similarities, k=k_hard)
+        # Sample random negatives
+        total_items = tf.shape(self.item_embeddings)[0]
+        random_negatives = tf.random.uniform(
+            shape=[batch_size, k_random],
+            minval=0,
+            maxval=total_items,
+            dtype=tf.int32
+        )
+        # Combine hard and random negatives
+        if self.sampling_strategy == 'hard':
+            return hard_negative_indices
+        elif self.sampling_strategy == 'random':
+            return random_negatives
+        else:  # mixed
+            return tf.concat([hard_negative_indices, random_negatives], axis=1)
+class CurriculumLearningScheduler:
+    """Curriculum learning scheduler for progressive difficulty."""
+    def __init__(self, total_epochs, warmup_epochs=10):
+        self.total_epochs = total_epochs
+        self.warmup_epochs = warmup_epochs
+    def get_difficulty_schedule(self, epoch):
+        """Get curriculum parameters for current epoch."""
+        if epoch < self.warmup_epochs:
+            # Easy phase: more random negatives, lower temperature
+            hard_negative_ratio = 0.2
+            temperature = 2.0
+            negative_samples = 2
+        elif epoch < self.total_epochs * 0.6:
+            # Medium phase: balanced negatives
+            hard_negative_ratio = 0.5
+            temperature = 1.0
+            negative_samples = 4
+        else:
+            # Hard phase: more hard negatives, higher temperature
+            hard_negative_ratio = 0.8
+            temperature = 0.5
+            negative_samples = 6
+        return {
+            'hard_negative_ratio': hard_negative_ratio,
+            'temperature': temperature,
+            'negative_samples': negative_samples
+        }
+class ImprovedJointTrainer:
+    """Enhanced joint trainer with advanced techniques."""
+    def __init__(self,
+                 embedding_dim: int = 128,
+                 learning_rate: float = 0.001,
+                 use_mixed_precision: bool = True,
+                 use_curriculum_learning: bool = True,
+                 use_hard_negatives: bool = True):
+        self.embedding_dim = embedding_dim
+        self.learning_rate = learning_rate
+        self.use_mixed_precision = use_mixed_precision
+        self.use_curriculum_learning = use_curriculum_learning
+        self.use_hard_negatives = use_hard_negatives
+        # Enable mixed precision if requested
+        if use_mixed_precision:
+            policy = tf.keras.mixed_precision.Policy('mixed_float16')
+            tf.keras.mixed_precision.set_global_policy(policy)
+        self.model = None
+        self.data_processor = None
+        self.curriculum_scheduler = None
+        self.hard_negative_sampler = None
+    def setup_model(self, data_processor: DataProcessor):
+        """Setup the improved model."""
+        self.data_processor = data_processor
+        # Create improved model
+        self.model = create_improved_model(
+            data_processor=data_processor,
+            embedding_dim=self.embedding_dim,
+            use_bias=True,
+            use_focal_loss=True
+        )
+        print(f"Created improved two-tower model with {self.embedding_dim}D embeddings")
+    def setup_curriculum_learning(self, total_epochs: int):
+        """Setup curriculum learning scheduler."""
+        if self.use_curriculum_learning:
+            self.curriculum_scheduler = CurriculumLearningScheduler(
+                total_epochs=total_epochs,
+                warmup_epochs=max(5, total_epochs // 10)
+            )
+            print("Curriculum learning enabled")
+    def setup_hard_negative_sampling(self, item_features: Dict[str, np.ndarray]):
+        """Setup hard negative sampling."""
+        if self.use_hard_negatives:
+            # Pre-compute item embeddings for efficient hard negative sampling
+            print("Pre-computing item embeddings for hard negative sampling...")
+            # Create a dummy batch to get item embeddings
+            batch_size = 1000
+            total_items = len(item_features['product_id'])
+            item_embeddings_list = []
+            for i in range(0, total_items, batch_size):
+                end_idx = min(i + batch_size, total_items)
+                batch_features = {
+                    key: tf.constant(value[i:end_idx])
+                    for key, value in item_features.items()
+                }
+                item_emb_output = self.model.item_tower(batch_features, training=False)
+                if isinstance(item_emb_output, tuple):
+                    item_emb = item_emb_output[0]  # Get embeddings, ignore bias
+                else:
+                    item_emb = item_emb_output
+                item_embeddings_list.append(item_emb.numpy())
+            item_embeddings = np.vstack(item_embeddings_list)
+            self.hard_negative_sampler = HardNegativeSampler(
+                model=self.model,
+                item_embeddings=tf.constant(item_embeddings, dtype=tf.float32),
+                sampling_strategy='mixed'
+            )
+            print(f"Hard negative sampling enabled with {len(item_embeddings)} items")
+    def create_advanced_training_dataset(self,
+                                       features: Dict[str, np.ndarray],
+                                       batch_size: int = 256,
+                                       epoch: int = 0) -> tf.data.Dataset:
+        """Create training dataset with curriculum learning and hard negatives."""
+        # Get curriculum parameters
+        if self.curriculum_scheduler:
+            curriculum_params = self.curriculum_scheduler.get_difficulty_schedule(epoch)
+            print(f"Epoch {epoch}: {curriculum_params}")
+        else:
+            curriculum_params = {
+                'hard_negative_ratio': 0.5,
+                'temperature': 1.0,
+                'negative_samples': 4
+            }
+        # Filter data based on curriculum (start with easier examples)
+        if epoch < 5:  # Warmup epochs - use only high-confidence positive examples
+            positive_mask = features['rating'] == 1.0
+            if np.sum(positive_mask) > 0:
+                # Sample subset of positives and all negatives
+                positive_indices = np.where(positive_mask)[0]
+                negative_indices = np.where(features['rating'] == 0.0)[0]
+                # Sample subset for easier learning
+                n_positive_samples = min(len(positive_indices), len(negative_indices))
+                selected_positive = np.random.choice(
+                    positive_indices, size=n_positive_samples, replace=False
+                )
+                selected_negative = np.random.choice(
+                    negative_indices, size=n_positive_samples, replace=False
+                )
+                selected_indices = np.concatenate([selected_positive, selected_negative])
+                np.random.shuffle(selected_indices)
+                # Filter features
+                filtered_features = {
+                    key: value[selected_indices] for key, value in features.items()
+                }
+            else:
+                filtered_features = features
+        else:
+            filtered_features = features
+        # Create dataset
+        dataset = create_tf_dataset(filtered_features, batch_size, shuffle=True)
+        return dataset
+    def compile_model(self):
+        """Compile model with advanced optimizer."""
+        # Use AdamW with learning rate scheduling
+        initial_learning_rate = self.learning_rate
+        lr_schedule = tf.keras.optimizers.schedules.CosineDecayRestarts(
+            initial_learning_rate=initial_learning_rate,
+            first_decay_steps=1000,
+            t_mul=2.0,
+            m_mul=0.9,
+            alpha=0.01
+        )
+        optimizer = tf.keras.optimizers.AdamW(
+            learning_rate=lr_schedule,
+            weight_decay=1e-5,
+            beta_1=0.9,
+            beta_2=0.999,
+            epsilon=1e-7
+        )
+        # Enable mixed precision optimizer if needed
+        if self.use_mixed_precision:
+            optimizer = tf.keras.mixed_precision.LossScaleOptimizer(optimizer)
+        self.optimizer = optimizer
+        print(f"Model compiled with AdamW optimizer (lr={self.learning_rate})")
+    @tf.function
+    def train_step(self, features):
+        """Optimized training step with gradient scaling."""
+        with tf.GradientTape() as tape:
+            # Forward pass
+            loss_dict = self.model.compute_loss(features, training=True)
+            total_loss = loss_dict['total_loss']
+            # Scale loss for mixed precision
+            if self.use_mixed_precision:
+                scaled_loss = self.optimizer.get_scaled_loss(total_loss)
+            else:
+                scaled_loss = total_loss
+        # Compute gradients
+        if self.use_mixed_precision:
+            scaled_gradients = tape.gradient(scaled_loss, self.model.trainable_variables)
+            gradients = self.optimizer.get_unscaled_gradients(scaled_gradients)
+        else:
+            gradients = tape.gradient(scaled_loss, self.model.trainable_variables)
+        # Clip gradients to prevent exploding gradients
+        gradients, _ = tf.clip_by_global_norm(gradients, 1.0)
+        # Apply gradients
+        self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
+        return loss_dict
+    def evaluate_model(self, validation_dataset):
+        """Evaluate model on validation set."""
+        total_losses = defaultdict(list)
+        for batch in validation_dataset:
+            loss_dict = self.model.compute_loss(batch, training=False)
+            for key, value in loss_dict.items():
+                total_losses[key].append(float(value))
+        # Average losses
+        avg_losses = {key: np.mean(values) for key, values in total_losses.items()}
+        return avg_losses
+    def train(self,
+              training_features: Dict[str, np.ndarray],
+              validation_features: Dict[str, np.ndarray],
+              epochs: int = 50,
+              batch_size: int = 256,
+              save_path: str = "src/artifacts/") -> Dict:
+        """Enhanced training loop with all improvements."""
+        print(f"Starting improved training for {epochs} epochs...")
+        # Setup components
+        self.setup_curriculum_learning(epochs)
+        self.compile_model()
+        # Create validation dataset
+        validation_dataset = create_tf_dataset(validation_features, batch_size, shuffle=False)
+        # Training history
+        history = defaultdict(list)
+        best_val_loss = float('inf')
+        patience_counter = 0
+        early_stopping_patience = 10
+        # Training loop
+        for epoch in range(epochs):
+            epoch_start_time = time.time()
+            # Create training dataset for this epoch (curriculum learning)
+            training_dataset = self.create_advanced_training_dataset(
+                training_features, batch_size, epoch
+            )
+            # Training
+            epoch_losses = defaultdict(list)
+            num_batches = 0
+            for batch in training_dataset:
+                loss_dict = self.train_step(batch)
+                for key, value in loss_dict.items():
+                    epoch_losses[key].append(float(value))
+                num_batches += 1
+            # Average training losses
+            avg_train_losses = {
+                key: np.mean(values) for key, values in epoch_losses.items()
+            }
+            # Validation
+            avg_val_losses = self.evaluate_model(validation_dataset)
+            # Log progress
+            epoch_time = time.time() - epoch_start_time
+            print(f"Epoch {epoch+1}/{epochs} ({epoch_time:.1f}s):")
+            print(f"  Train Loss: {avg_train_losses['total_loss']:.4f}")
+            print(f"  Val Loss: {avg_val_losses['total_loss']:.4f}")
+            print(f"  Val Rating Loss: {avg_val_losses['rating_loss']:.4f}")
+            print(f"  Val Retrieval Loss: {avg_val_losses['retrieval_loss']:.4f}")
+            # Save history
+            for key, value in avg_train_losses.items():
+                history[f'train_{key}'].append(value)
+            for key, value in avg_val_losses.items():
+                history[f'val_{key}'].append(value)
+            # Early stopping and model saving
+            current_val_loss = avg_val_losses['total_loss']
+            if current_val_loss < best_val_loss:
+                best_val_loss = current_val_loss
+                patience_counter = 0
+                # Save best model
+                self.save_model(save_path, suffix='_improved_best')
+                print(f"  💾 Saved best model (val_loss: {best_val_loss:.4f})")
+            else:
+                patience_counter += 1
+            if patience_counter >= early_stopping_patience:
+                print(f"Early stopping at epoch {epoch+1}")
+                break
+        # Save final model and history
+        self.save_model(save_path, suffix='_improved_final')
+        self.save_training_history(dict(history), save_path)
+        print("✅ Improved training completed!")
+        return dict(history)
+    def save_model(self, save_path: str, suffix: str = ''):
+        """Save the trained model components."""
+        os.makedirs(save_path, exist_ok=True)
+        # Save model weights
+        self.model.item_tower.save_weights(f"{save_path}/improved_item_tower_weights{suffix}")
+        self.model.user_tower.save_weights(f"{save_path}/improved_user_tower_weights{suffix}")
+        if hasattr(self.model, 'rating_model'):
+            self.model.rating_model.save_weights(f"{save_path}/improved_rating_model_weights{suffix}")
+        # Save configuration
+        config = {
+            'embedding_dim': self.embedding_dim,
+            'learning_rate': self.learning_rate,
+            'use_mixed_precision': self.use_mixed_precision,
+            'use_curriculum_learning': self.use_curriculum_learning,
+            'use_hard_negatives': self.use_hard_negatives
+        }
+        with open(f"{save_path}/improved_model_config{suffix}.txt", 'w') as f:
+            for key, value in config.items():
+                f.write(f"{key}: {value}\n")
+        print(f"Model saved to {save_path} with suffix '{suffix}'")
+    def save_training_history(self, history: Dict, save_path: str):
+        """Save training history."""
+        with open(f"{save_path}/improved_training_history.pkl", 'wb') as f:
+            pickle.dump(history, f)
+        print(f"Training history saved to {save_path}")
+def main():
+    """Demo of improved training."""
+    print("🚀 IMPROVED TWO-TOWER TRAINING DEMO")
+    print("="*60)
+    # Load data
+    print("Loading training data...")
+    try:
+        with open("src/artifacts/training_features.pkl", 'rb') as f:
+            training_features = pickle.load(f)
+        with open("src/artifacts/validation_features.pkl", 'rb') as f:
+            validation_features = pickle.load(f)
+        print(f"Loaded {len(training_features['rating'])} training samples")
+        print(f"Loaded {len(validation_features['rating'])} validation samples")
+    except FileNotFoundError:
+        print("❌ Training data not found. Please run data preparation first.")
+        return
+    # Load data processor
+    data_processor = DataProcessor()
+    data_processor.load_vocabularies("src/artifacts/vocabularies.pkl")
+    # Create trainer
+    trainer = ImprovedJointTrainer(
+        embedding_dim=128,
+        learning_rate=0.001,
+        use_mixed_precision=True,
+        use_curriculum_learning=True,
+        use_hard_negatives=True
+    )
+    # Setup and train
+    trainer.setup_model(data_processor)
+    # Train model
+    history = trainer.train(
+        training_features=training_features,
+        validation_features=validation_features,
+        epochs=30,
+        batch_size=256
+    )
+    print("✅ Improved training completed successfully!")
+if __name__ == "__main__":
+    main()

src/training/item_pretraining.py CHANGED Viewed

@@ -12,8 +12,8 @@ class ItemTowerPretrainer:
     """Handles pre-training of the item tower."""
     def __init__(self,
-                 embedding_dim: int = 64,
-                 hidden_dims: List[int] = [128, 64],
                  dropout_rate: float = 0.2,
                  learning_rate: float = 0.001):
@@ -111,19 +111,26 @@ class ItemTowerPretrainer:
         return history
     def generate_item_embeddings(self,
-                                dataset: tf.data.Dataset) -> Dict[int, np.ndarray]:
         """Generate embeddings for all items in the catalog."""
         item_embeddings = {}
         for batch in dataset:
             embeddings = self.item_tower(batch)
-            product_ids = batch['product_id'].numpy()
-            for i, product_id in enumerate(product_ids):
-                item_embeddings[product_id] = embeddings[i].numpy()
         print(f"Generated embeddings for {len(item_embeddings)} items")
         return item_embeddings
     def save_model(self, save_path: str = "src/artifacts/"):
@@ -185,7 +192,7 @@ def main():
     # Initialize components
     data_processor = DataProcessor()
     pretrainer = ItemTowerPretrainer(
-        embedding_dim=64,
         hidden_dims=[128, 64],
         dropout_rate=0.2,
         learning_rate=0.001
@@ -210,7 +217,7 @@ def main():
     # Generate embeddings
     print("Generating item embeddings...")
-    item_embeddings = pretrainer.generate_item_embeddings(dataset)
     # Save everything
     print("Saving artifacts...")

     """Handles pre-training of the item tower."""
     def __init__(self,
+                 embedding_dim: int = 128,  # Updated to 128D output
+                 hidden_dims: List[int] = [256, 128],  # Scaled up
                  dropout_rate: float = 0.2,
                  learning_rate: float = 0.001):
         return history
     def generate_item_embeddings(self,
+                                dataset: tf.data.Dataset,
+                                data_processor: 'DataProcessor') -> Dict[int, np.ndarray]:
         """Generate embeddings for all items in the catalog."""
         item_embeddings = {}
+        # Create reverse mapping from vocab indices to actual item IDs
+        idx_to_item_id = {idx: item_id for item_id, idx in data_processor.item_vocab.items()}
         for batch in dataset:
             embeddings = self.item_tower(batch)
+            product_idx_batch = batch['product_id'].numpy()
+            for i, product_idx in enumerate(product_idx_batch):
+                # Convert vocab index back to actual item ID
+                actual_item_id = idx_to_item_id.get(product_idx, product_idx)
+                item_embeddings[actual_item_id] = embeddings[i].numpy()
         print(f"Generated embeddings for {len(item_embeddings)} items")
+        print(f"Sample item IDs: {list(item_embeddings.keys())[:5]}")
         return item_embeddings
     def save_model(self, save_path: str = "src/artifacts/"):
     # Initialize components
     data_processor = DataProcessor()
     pretrainer = ItemTowerPretrainer(
+        embedding_dim=128,  # Updated to 128D
         hidden_dims=[128, 64],
         dropout_rate=0.2,
         learning_rate=0.001
     # Generate embeddings
     print("Generating item embeddings...")
+    item_embeddings = pretrainer.generate_item_embeddings(dataset, data_processor)
     # Save everything
     print("Saving artifacts...")

src/training/joint_training.py CHANGED Viewed

@@ -13,7 +13,7 @@ class JointTrainer:
     """Handles joint training of user and item towers."""
     def __init__(self,
-                 embedding_dim: int = 64,
                  user_learning_rate: float = 0.001,
                  item_learning_rate: float = 0.0001,  # Lower LR for pre-trained item tower
                  rating_weight: float = 1.0,
@@ -330,7 +330,7 @@ def main():
     # Initialize trainer
     trainer = JointTrainer(
-        embedding_dim=64,
         user_learning_rate=0.001,
         item_learning_rate=0.0001,
         rating_weight=1.0,

     """Handles joint training of user and item towers."""
     def __init__(self,
+                 embedding_dim: int = 128,  # Updated to 128D output
                  user_learning_rate: float = 0.001,
                  item_learning_rate: float = 0.0001,  # Lower LR for pre-trained item tower
                  rating_weight: float = 1.0,
     # Initialize trainer
     trainer = JointTrainer(
+        embedding_dim=128,  # Updated to 128D
         user_learning_rate=0.001,
         item_learning_rate=0.0001,
         rating_weight=1.0,

src/training/optimized_joint_training.py CHANGED Viewed

@@ -14,7 +14,7 @@ class OptimizedJointTrainer:
     """Optimized joint training with performance enhancements."""
     def __init__(self,
-                 embedding_dim: int = 64,
                  user_learning_rate: float = 0.001,
                  item_learning_rate: float = 0.0001,
                  rating_weight: float = 1.0,
@@ -381,7 +381,7 @@ def main():
     print("Initializing optimized joint trainer...")
     trainer = OptimizedJointTrainer(
-        embedding_dim=64,
         user_learning_rate=0.002,  # Slightly higher for faster convergence
         item_learning_rate=0.0002,
         rating_weight=1.0,

     """Optimized joint training with performance enhancements."""
     def __init__(self,
+                 embedding_dim: int = 128,  # Updated to 128D output
                  user_learning_rate: float = 0.001,
                  item_learning_rate: float = 0.0001,
                  rating_weight: float = 1.0,
     print("Initializing optimized joint trainer...")
     trainer = OptimizedJointTrainer(
+        embedding_dim=128,  # Updated to 128D
         user_learning_rate=0.002,  # Slightly higher for faster convergence
         item_learning_rate=0.0002,
         rating_weight=1.0,