Spaces:

minhajHP
/

two_tower_recsys

Sleeping

minhajHP commited on Aug 28, 2025

Commit

7b5d392

1 Parent(s): 644ceea

Clean codebase and add demographic enhancements

- Enhanced demographic handling for zero interactions
- Improved content-based filtering with aggregated history
- Added comprehensive UI support for new user scenarios
- Cleaned up analysis files and redundant components
- Updated documentation and project structure

Files changed (26) hide show

DEEP_ARCHITECTURE.md +549 -0
README.md +25 -27
analyze_recommendations.py +0 -543
api/main.py +60 -88
api_2phase.py +0 -521
api_joint.py +0 -522
datasets/interactions.csv +0 -0
datasets/items.csv +0 -0
datasets/users.csv +0 -0
frontend/src/App.css +62 -0
frontend/src/App.js +207 -33
src/data_generation/generate_demographics.py +292 -0
src/inference/enhanced_recommendation_engine.py +0 -303
src/inference/enhanced_recommendation_engine_128d.py +0 -499
src/inference/recommendation_engine.py +562 -17
src/models/enhanced_two_tower.py +0 -574
src/models/improved_two_tower.py +0 -545
src/models/user_tower.py +33 -0
src/preprocessing/optimized_dataset_creator.py +0 -111
src/preprocessing/user_data_preparation.py +72 -0
src/training/curriculum_trainer.py +0 -341
src/training/fast_joint_training.py +0 -268
src/training/improved_joint_training.py +0 -462
src/training/optimized_joint_training.py +0 -439
src/utils/real_user_selector.py +51 -11
train_improved_model.py +0 -111

DEEP_ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,549 @@

+# Deep Architecture Documentation - RecSys-HP
+## 🏗️ Complete System Architecture Overview
+```mermaid
+graph TB
+    subgraph "Data Layer"
+        D1[items.csv<br/>15K+ products]
+        D2[users.csv<br/>Enhanced demographics]
+        D3[interactions.csv<br/>User-item interactions]
+        D4[Artifacts<br/>Trained models & indices]
+    end
+    subgraph "ML Pipeline"
+        P1[Data Preprocessing]
+        P2[Item Tower Pre-training]
+        P3[FAISS Index Creation]
+        P4[Joint Training]
+        P5[User Tower Training]
+    end
+    subgraph "Inference Layer"
+        I1[Recommendation Engine<br/>Category-Boosted Algorithm]
+        I2[FAISS Similarity Search]
+        I3[Real User Selection]
+        I4[Hybrid Scoring]
+    end
+    subgraph "API Layer"
+        A1[FastAPI Server<br/>Port 8000]
+        A2[Recommendation Endpoints]
+        A3[User Management]
+        A4[Item Retrieval]
+    end
+    subgraph "Frontend Layer"
+        F1[React.js Application]
+        F2[Interactive UI Components]
+        F3[Real-time Analytics]
+        F4[User Profile Management]
+    end
+    D1 --> P1
+    D2 --> P1
+    D3 --> P1
+    P1 --> P2
+    P2 --> P3
+    P3 --> P4
+    P4 --> P5
+    P5 --> D4
+    D4 --> I1
+    I1 --> I2
+    I2 --> I3
+    I3 --> I4
+    I4 --> A1
+    A1 --> A2
+    A2 --> A3
+    A3 --> A4
+    A4 --> F1
+    F1 --> F2
+    F2 --> F3
+    F3 --> F4
+```
+---
+## 📁 Project Structure
+```
+RecSys-HP/
+├── 🗂️ Data Layer
+│   ├── datasets/
+│   │   ├── items.csv                    # 15K+ product catalog
+│   │   ├── users.csv                    # Enhanced user demographics (7 features)
+│   │   ├── interactions.csv             # User-item interaction history
+│   │   └── users_enhanced.csv           # Backup with enhanced features
+│   └── src/artifacts/                   # Trained models and indices
+│       ├── item_embeddings.npy          # Pre-trained item vectors (128D)
+│       ├── faiss_item_index.bin         # FAISS similarity index
+│       ├── faiss_metadata.pkl           # Item metadata mapping
+│       ├── vocabularies.pkl             # Categorical encoders
+│       └── *.weights.* files            # TensorFlow model weights
+│
+├── 🧠 ML Pipeline
+│   ├── src/preprocessing/
+│   │   ├── data_loader.py               # Data loading and preprocessing
+│   │   └── user_data_preparation.py     # User feature engineering
+│   ├── src/training/
+│   │   ├── item_pretraining.py          # Item tower pre-training
+│   │   └── joint_training.py            # Two-tower joint training
+│   ├── src/models/
+│   │   ├── item_tower.py                # Item embedding model (TensorFlow)
+│   │   └── user_tower.py                # User embedding model (TensorFlow)
+│   └── Training Scripts
+│       ├── run_training_pipeline.py     # Complete pipeline executor
+│       ├── run_2phase_training.py       # 2-phase training approach
+│       └── run_joint_training.py        # Joint training approach
+│
+├── 🔍 Inference Layer
+│   ├── src/inference/
+│   │   ├── recommendation_engine.py     # Core recommendation algorithms
+│   │   └── faiss_index.py               # FAISS index management
+│   ├── src/utils/
+│   │   └── real_user_selector.py        # Real user data selection
+│   └── src/data_generation/
+│       └── generate_demographics.py     # Synthetic user generation
+│
+├── 🌐 API Layer
+│   └── api/
+│       └── main.py                      # FastAPI server with all endpoints
+│
+├── 🎨 Frontend Layer
+│   └── frontend/
+│       ├── src/
+│       │   ├── App.js                   # Main React application
+│       │   └── index.js                 # Entry point
+│       ├── public/                      # Static assets
+│       ├── build/                       # Production build
+│       └── package.json                 # Dependencies
+│
+└── 🧪 Testing & Analysis
+    ├── test_category_boosted.py         # Basic algorithm testing
+    ├── test_enhanced_category_boosted.py # Advanced subcategory testing
+    └── deep_analyze_category_boosted.py # Comprehensive analysis tool
+```
+---
+## 🔄 Data Flow Architecture
+### 1. Training Pipeline Flow
+```mermaid
+sequenceDiagram
+    participant D as Data Files
+    participant P as Preprocessing
+    participant IT as Item Tower
+    participant F as FAISS Index
+    participant JT as Joint Training
+    participant UT as User Tower
+    participant A as Artifacts
+    D->>P: Load datasets (items, users, interactions)
+    P->>IT: Preprocessed item features
+    IT->>IT: Pre-train item embeddings (128D)
+    IT->>F: Generate item vectors
+    F->>F: Build FAISS similarity index
+    IT->>JT: Pre-trained item tower
+    P->>JT: User features (7 demographics)
+    JT->>UT: Train user tower
+    JT->>A: Save trained models
+    F->>A: Save FAISS index
+```
+### 2. Inference Pipeline Flow
+```mermaid
+sequenceDiagram
+    participant U as User Request
+    participant API as FastAPI
+    participant RE as Recommendation Engine
+    participant F as FAISS Search
+    participant CB as Category Boosted
+    participant R as Response
+    U->>API: POST /recommendations
+    API->>RE: User profile + preferences
+    RE->>F: Query item embeddings
+    F->>RE: Similar items (k*10 wide search)
+    RE->>CB: Apply category-boosted algorithm
+    CB->>CB: 50% from user categories + proportional distribution
+    CB->>RE: Balanced recommendations
+    RE->>API: Scored & ranked items
+    API->>R: JSON response with recommendations
+```
+---
+## 🧠 Machine Learning Architecture
+### Two-Tower Architecture
+```
+┌─────────────────────────┐    ┌─────────────────────────┐
+│      ITEM TOWER         │    │      USER TOWER         │
+│                         │    │                         │
+│ ┌─────────────────────┐ │    │ ┌─────────────────────┐ │
+│ │   Item Features     │ │    │ │ Demographic Features│ │
+│ │                     │ │    │ │                     │ │
+│ │ • product_id        │ │    │ │ • age (normalized)  │ │
+│ │ • category_code     │ │    │ │ • gender (encoded)  │ │
+│ │ • brand             │ │    │ │ • income (binned)   │ │
+│ │ • price (log)       │ │    │ │ • profession        │ │
+│ │                     │ │    │ │ • location          │ │
+│ └─────────────────────┘ │    │ │ • education_level   │ │
+│           │             │    │ │ • marital_status    │ │
+│           ▼             │    │ └─────────────────────┘ │
+│ ┌─────────────────────┐ │    │           │             │
+│ │   Dense Layers      │ │    │           ▼             │
+│ │                     │ │    │ ┌─────────────────────┐ │
+│ │ • Dense(256, ReLU)  │ │    │ │   Dense Layers      │ │
+│ │ • Dropout(0.3)      │ │    │ │                     │ │
+│ │ • Dense(128, ReLU)  │ │    │ │ • Dense(128, ReLU)  │ │
+│ │ • L2 Regularization │ │    │ │ • Dropout(0.2)      │ │
+│ │                     │ │    │ │ • Dense(64, ReLU)   │ │
+│ └─────────────────────┘ │    │ │ • L2 Regularization │ │
+│           │             │    │ └─────────────────────┘ │
+│           ▼             │    │           │             │
+│ ┌─────────────────────┐ │    │           ▼             │
+│ │  Item Embedding     │ │    │ ┌─────────────────────┐ │
+│ │     (128D)          │ │    │ │  User Embedding     │ │
+│ │                     │ │    │ │     (64D)           │ │
+│ └─────────────────────┘ │    │ └─────────────────────┘ │
+└─────────────────────────┘    └─────────────────────────┘
+           │                              │
+           └──────────────┬───────────────┘
+                          ▼
+            ┌─────────────────────────┐
+            │    Dot Product          │
+            │   Similarity Score      │
+            │                         │
+            │ similarity = user_emb   │
+            │            · item_emb   │
+            └─────────────────────────┘
+```
+### Category-Boosted Algorithm
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                    CATEGORY-BOOSTED RECOMMENDATION FLOW                  │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                         │
+│ 1. USER INTERACTION ANALYSIS                                            │
+│    ┌─────────────────────────────────────┐                             │
+│    │ interaction_history: [1001, 2003,   │                             │
+│    │                      3045, 1099]    │                             │
+│    └─────────────────────────────────────┘                             │
+│                        │                                                │
+│                        ▼                                                │
+│    ┌─────────────────────────────────────┐                             │
+│    │ Extract 2-level subcategories:      │                             │
+│    │ • computers.components: 40%         │                             │
+│    │ • electronics.audio: 35%            │                             │
+│    │ • computers.peripherals: 25%        │                             │
+│    └─────────────────────────────────────┘                             │
+│                                                                         │
+│ 2. WIDE SIMILARITY SEARCH                                               │
+│    ┌─────────────────────────────────────┐                             │
+│    │ FAISS.search(user_embedding,        │                             │
+│    │              k = requested * 10)    │                             │
+│    │                                     │                             │
+│    │ Returns: ~1000 similar items        │                             │
+│    └─────────────────────────────────────┘                             │
+│                        │                                                │
+│                        ▼                                                │
+│ 3. CATEGORY ORGANIZATION                                                │
+│    ┌─────────────────────────────────────┐                             │
+│    │ Group by subcategories:             │                             │
+│    │                                     │                             │
+│    │ computers.components: [1001, 1099,  │                             │
+│    │                       1203, ...]    │                             │
+│    │ electronics.audio: [2003, 2156,     │                             │
+│    │                    2089, ...]       │                             │
+│    │ computers.peripherals: [3045, 3201, │                             │
+│    │                        3078, ...]   │                             │
+│    │ other_categories: [4001, 5002, ...] │                             │
+│    └─────────────────────────────────────┘                             │
+│                        │                                                │
+│                        ▼                                                │
+│ 4. PROPORTIONAL ALLOCATION                                              │
+│    ┌─────────────────────────────────────┐                             │
+│    │ Target: 50% from user categories    │                             │
+│    │ (50 items for 100 recommendations)  │                             │
+│    │                                     │                             │
+│    │ computers.components: 40% → 20 items│                             │
+│    │ electronics.audio: 35% → 18 items   │                             │
+│    │ computers.peripherals: 25% → 12 items│                            │
+│    │                                     │                             │
+│    │ Remaining 50 items: diverse mix     │                             │
+│    └─────────────────────────────────────┘                             │
+│                        │                                                │
+│                        ▼                                                │
+│ 5. FINAL RECOMMENDATION SET                                             │
+│    ┌─────────────────────────────────────┐                             │
+│    │ • 50 items from user's categories   │                             │
+│    │   (proportionally distributed)      │                             │
+│    │ • 50 items for exploration          │                             │
+│    │ • All ranked by similarity score    │                             │
+│    │ • Ensures category diversity        │                             │
+│    └─────────────────────────────────────┘                             │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+---
+## 🌐 API Architecture
+### FastAPI Server Endpoints
+```python
+# Core Recommendation Endpoints
+POST   /recommendations              # Main recommendation engine
+GET    /real-users                   # Fetch real user profiles
+GET    /items/{item_id}              # Get item details
+GET    /dataset-summary              # Dataset statistics
+# Algorithm-Specific Endpoints
+POST   /recommendations/hybrid       # Hybrid collaborative + content
+POST   /recommendations/collaborative # Pure collaborative filtering
+POST   /recommendations/content      # Aggregated history content-based recommendations
+POST   /recommendations/category_boosted # Category-boosted algorithm
+# Utility Endpoints
+GET    /                            # Health check
+GET    /sample-items                # Random item samples
+POST   /generate-interactions       # Synthetic interaction generation
+```
+### Request/Response Flow
+```mermaid
+graph LR
+    subgraph "Request Processing"
+        A[User Request] --> B[Validation]
+        B --> C[Feature Engineering]
+        C --> D[Model Inference]
+    end
+    subgraph "Recommendation Engine"
+        D --> E[FAISS Search]
+        E --> F[Category Analysis]
+        F --> G[Score Calculation]
+        G --> H[Ranking & Filtering]
+    end
+    subgraph "Response Generation"
+        H --> I[Item Enrichment]
+        I --> J[Metadata Addition]
+        J --> K[JSON Response]
+    end
+```
+---
+## 🎨 Frontend Architecture
+### React.js Component Structure
+```
+App.js (Main Container)
+├── User Profile Management
+│   ├── Demographics Form
+│   ├── Real User Selection
+│   └── Interaction History Display
+├── Recommendation Controls
+│   ├── Algorithm Selection
+│   ├── Count Configuration
+│   └── Weight Adjustment
+├── Results Display
+│   ├── Recommendation Cards
+│   ├── Category Analytics
+│   ├── Pagination Controls
+│   └── Similar Items View
+└── Analysis Components
+    ├── Category Interest Graphs
+    ├── Interaction Patterns
+    └── Performance Metrics
+```
+### State Management
+```javascript
+const [userProfile, setUserProfile] = useState({
+    age: 30,
+    gender: 'male',
+    income: 50000,
+    profession: 'Technology',
+    location: 'Urban',
+    education_level: "Bachelor's",
+    marital_status: 'Single',
+    interaction_history: []
+});
+const [recommendationType, setRecommendationType] = useState('category_boosted');
+const [recommendations, setRecommendations] = useState([]);
+const [realUsers, setRealUsers] = useState([]);
+const [datasetSummary, setDatasetSummary] = useState(null);
+```
+---
+## 🔍 Algorithm Deep Dive
+### 1. Hybrid Recommendation
+- **Collaborative Filtering**: User-item interaction patterns
+- **Aggregated Content-Based**: User's complete interaction history aggregated into single embedding
+- **Weight Balance**: Configurable collaborative weight (default: 0.7)
+### 1.5. Aggregated History Content-Based Filtering
+- **Revolutionary Approach**: Aggregates user's entire interaction history instead of single-item similarity
+- **Aggregation Methods**:
+  - **Weighted Mean**: `weights = exp(linspace(-1, 0, len(history)))` (recent interactions weighted higher)
+  - **Simple Mean**: Equal weighting of all interaction embeddings
+  - **Max Pooling**: Element-wise maximum across all embeddings
+- **Process Flow**:
+  1. **Embedding Extraction**: Get 128D vectors for each item in user's history
+  2. **Aggregation**: Apply selected aggregation method (weighted_mean by default)
+  3. **Normalization**: L2-normalize the aggregated embedding
+  4. **ANN Search**: Direct FAISS similarity search using aggregated user profile
+  5. **Filtering**: Remove already-interacted items from results
+- **Benefits**: Captures complete user preference profile, more robust than single-item seed
+### 2. Category-Boosted Algorithm
+- **Step 1**: Analyze user's subcategory preferences (2-level depth)
+- **Step 2**: Wide FAISS search (k × 10 multiplier)
+- **Step 3**: Category organization and candidate grouping
+- **Step 4**: Proportional allocation (50% from user categories)
+- **Step 5**: Exploration items filling (remaining 50%)
+### 3. FAISS Integration
+- **Index Type**: Flat L2 similarity search
+- **Vector Dimension**: 128D item embeddings
+- **Search Strategy**: Wide retrieval + post-processing
+- **Metadata**: Item-to-index mapping via pickle files
+---
+## 📊 Performance Characteristics
+### Scalability Metrics
+- **Items**: 15K+ products supported
+- **Users**: Unlimited (stateless design)
+- **Recommendations**: 1-1000 per request
+- **Response Time**: <2s for 100 recommendations
+- **Memory Usage**: ~500MB for full model + index
+### Algorithm Performance
+- **Category Matching**: ≥50% from user's categories
+- **Diversity Score**: Balanced exploration vs exploitation
+- **Cold Start**: Handles new users via demographic features
+- **Subcategory Precision**: 2-level category matching
+---
+## 🚀 Deployment Architecture
+### Development Environment
+```bash
+# Backend (FastAPI)
+cd /api && python main.py
+# Frontend (React)
+cd frontend && npm start
+# Training Pipeline
+python run_training_pipeline.py
+```
+### Production Considerations
+- **Containerization**: Docker support for API + Frontend
+- **Database**: PostgreSQL for production user/item storage
+- **Caching**: Redis for recommendation caching
+- **Load Balancing**: Nginx for multiple API instances
+- **Monitoring**: Prometheus + Grafana for metrics
+---
+## 🔧 Configuration & Customization
+### Model Configuration
+```python
+# Item Tower
+ITEM_EMBEDDING_DIM = 128
+ITEM_HIDDEN_LAYERS = [256, 128]
+ITEM_DROPOUT_RATE = 0.3
+# User Tower
+USER_EMBEDDING_DIM = 64
+USER_HIDDEN_LAYERS = [128, 64]
+USER_DROPOUT_RATE = 0.2
+# Training
+BATCH_SIZE = 512
+LEARNING_RATE = 0.001
+EPOCHS = 100
+VALIDATION_SPLIT = 0.2
+```
+### Algorithm Parameters
+```python
+# Category-Boosted
+WIDE_SEARCH_MULTIPLIER = 10
+USER_CATEGORY_PERCENTAGE = 0.5
+SUBCATEGORY_LEVELS = 2
+MIN_INTERACTION_THRESHOLD = 5
+# FAISS
+INDEX_TYPE = "Flat"
+SIMILARITY_METRIC = "L2"
+SEARCH_PARAMS = {"nprobe": 10}
+# Aggregated Content-Based
+AGGREGATION_METHOD = "weighted_mean"  # "mean", "weighted_mean", "max"
+TEMPORAL_DECAY_ALPHA = 1.0  # Controls recency weighting strength
+HISTORY_LIMIT = 50  # Max items to consider for aggregation
+```
+---
+## 🧪 Testing Framework
+### Test Coverage
+- **Unit Tests**: Individual algorithm components
+- **Integration Tests**: End-to-end recommendation flow
+- **Performance Tests**: Latency and throughput benchmarks
+- **Accuracy Tests**: Category matching validation
+### Analysis Tools
+- `test_category_boosted.py`: Basic algorithm validation
+- `test_enhanced_category_boosted.py`: Advanced subcategory testing
+- `deep_analyze_category_boosted.py`: Comprehensive performance analysis
+- **`analyze_recommendation_alignment.py`**: **NEW** - Multi-algorithm alignment analysis
+  - Tests all 4 algorithms (collaborative, content, hybrid, category_boosted)
+  - Category alignment scoring and coverage analysis
+  - Diversity vs relevance trade-off analysis
+  - User-specific algorithm performance comparison
+  - Generates comprehensive visualizations and reports
+### Algorithm Comparison Metrics
+- **Top-Level Alignment**: % of recommendations matching user's preferred categories
+- **Subcategory Precision**: 2-level category matching accuracy
+- **Coverage Score**: % of user's categories represented in recommendations
+- **Diversity Score**: Shannon entropy of recommendation categories
+- **Performance by Scale**: Algorithm behavior across 10-100+ recommendations
+---
+## 📈 Future Enhancements
+### Planned Features
+1. **Real-time Learning**: Online model updates
+2. **A/B Testing**: Algorithm comparison framework
+3. **Explainability**: Recommendation reasoning
+4. **Multi-objective**: Balancing relevance, diversity, novelty
+5. **Graph Neural Networks**: Advanced relationship modeling
+### Technical Debt
+- [ ] Add comprehensive error handling
+- [ ] Implement request caching
+- [ ] Add model versioning
+- [ ] Create automated testing pipeline
+- [ ] Add performance monitoring
+---
+This deep architecture documentation provides a comprehensive view of the RecSys-HP recommendation system, covering all layers from data storage to user interface, with detailed technical specifications and implementation details.

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Advanced Two-Tower Recommendation System
-A production-ready recommendation system implementation using TensorFlow Recommenders with an enhanced two-tower architecture. This system provides personalized item recommendations through collaborative filtering, content-based filtering, and hybrid approaches, featuring categorical demographics, curriculum learning, and advanced training strategies.
 ## 🎯 Project Overview
@@ -11,7 +11,7 @@ This recommendation system addresses the challenge of providing personalized ite
 - **🧠 Enhanced Two-Tower Architecture**: 128D embeddings with temperature scaling and contrastive learning
 - **📚 Curriculum Learning**: Progressive training strategy for improved convergence
 - **⚡ Real-time Inference**: Sub-100ms recommendation serving with FAISS indexing
-- **🔄 Multi-strategy Recommendations**: Collaborative, content-based, and hybrid approaches
 - **🎪 Category-Aware Boosting**: Enhanced personalization through user preference alignment
 - **🔍 Interactive Similar Items**: Click-to-explore with 60/40 category-balanced discovery
 - **📊 Comprehensive Analysis**: Quality metrics and performance evaluation tools
@@ -70,11 +70,22 @@ The system implements a sophisticated two-tower neural network architecture opti
 - **Stage 3**: Complex cases (long history) - 67th+ percentile
 - **Adaptive Learning Rates**: Decrease as stages progress for stability
-### 5. Category-Aware Recommendation Engine 🎪
 - **Enhanced Hybrid Recommendations**: Category boosting based on user preferences
 - **Category Alignment Analysis**: Measures personalization effectiveness
 - **Diversity Controls**: Balanced category representation in recommendations
-- **Explanation Generation**: Detailed reasoning for recommendation choices
 ## 📁 Project Structure
@@ -134,10 +145,8 @@ RecSys-HP/
 ├── 🚀 Training Scripts                   # Multiple training approaches
 │   ├── run_training_pipeline.py         # Main training orchestration
 │   ├── run_2phase_training.py           # 2-phase training approach
-│   ├── run_joint_training.py            # Joint training approach
-│   └── train_improved_model.py          # Enhanced model training
 │
-├── 📊 analyze_recommendations.py         # Recommendation quality analysis
 └── 📋 requirements.txt                   # Python dependencies
 ```
@@ -478,24 +487,11 @@ python api_2phase.py
 python api_joint.py
 ```
-### 📊 Analysis & Testing Tools
 ```bash
-# Comprehensive recommendation analysis
-python analyze_recommendations.py
-# → Generates recommendation_analysis_report.md + plots
-# Test individual engines
-python -m src.inference.enhanced_recommendation_engine_128d  # 128D enhanced
-python -m src.inference.enhanced_recommendation_engine      # Standard enhanced
-python -m src.inference.recommendation_engine              # Basic engine
-# Real user data utilities
 python -m src.utils.real_user_selector  # Demo real user extraction
-# Data processing utilities
-python -m src.preprocessing.data_loader
-python -m src.preprocessing.optimized_dataset_creator
 ```
 ### 🧪 Frontend Development
@@ -667,7 +663,8 @@ RecSys-HP/
 ✅ **Enhanced Architecture**: 128D embeddings, temperature scaling, contrastive learning
 ✅ **Curriculum Learning**: Progressive training for better convergence
 ✅ **Category-Aware Recommendations**: Intelligent personalization with diversity
-✅ **Comprehensive Analysis**: Quality metrics and performance evaluation
 ✅ **Production Ready**: Scalable API with enhanced frontend features
 **🎉 Ready to deliver next-generation personalized recommendations!**
@@ -687,11 +684,12 @@ This project provides multiple training strategies:
 - **2-Phase API** (`api_2phase.py`) - Specialized for 2-phase training
 - **Joint API** (`api_joint.py`) - Optimized for joint training approach
-## 📊 Analysis Tools
-- **Recommendation Analysis** (`analyze_recommendations.py`) - Quality metrics and evaluation
-## 🔧 Development & Testing
 ### Frontend Development
 ```bash

 # Advanced Two-Tower Recommendation System
+A production-ready recommendation system implementation using TensorFlow Recommenders with an enhanced two-tower architecture. This system provides personalized item recommendations through collaborative filtering, **aggregated history content-based filtering**, and hybrid approaches, featuring categorical demographics, curriculum learning, and advanced training strategies.
 ## 🎯 Project Overview
 - **🧠 Enhanced Two-Tower Architecture**: 128D embeddings with temperature scaling and contrastive learning
 - **📚 Curriculum Learning**: Progressive training strategy for improved convergence
 - **⚡ Real-time Inference**: Sub-100ms recommendation serving with FAISS indexing
+- **🔄 Multi-strategy Recommendations**: Collaborative, **aggregated history content-based**, and hybrid approaches
 - **🎪 Category-Aware Boosting**: Enhanced personalization through user preference alignment
 - **🔍 Interactive Similar Items**: Click-to-explore with 60/40 category-balanced discovery
 - **📊 Comprehensive Analysis**: Quality metrics and performance evaluation tools
 - **Stage 3**: Complex cases (long history) - 67th+ percentile
 - **Adaptive Learning Rates**: Decrease as stages progress for stability
+### 5. Aggregated History Content-Based Filtering 🔄
+- **Revolutionary Approach**: Uses aggregated user interaction history instead of single-item similarity
+- **Multiple Aggregation Methods**:
+  - **Weighted Mean**: Recent interactions weighted higher (exponential decay)
+  - **Simple Mean**: Equal weighting of all interactions
+  - **Max Pooling**: Element-wise maximum of embeddings
+- **ANN Search**: Direct similarity search using FAISS with aggregated user profile
+- **Enhanced Personalization**: Captures complete user preference profile, not just recent item
+- **Category-Aware**: Analyzes user's full category distribution for balanced recommendations
+### 6. Category-Aware Recommendation Engine 🎪
 - **Enhanced Hybrid Recommendations**: Category boosting based on user preferences
 - **Category Alignment Analysis**: Measures personalization effectiveness
 - **Diversity Controls**: Balanced category representation in recommendations
+- **Subcategory Precision**: 2-level category matching (e.g., "computers.components")
+- **Comprehensive Analysis Tools**: Multi-algorithm comparison and alignment scoring
 ## 📁 Project Structure
 ├── 🚀 Training Scripts                   # Multiple training approaches
 │   ├── run_training_pipeline.py         # Main training orchestration
 │   ├── run_2phase_training.py           # 2-phase training approach
+│   └── run_joint_training.py            # Joint training approach
 │
 └── 📋 requirements.txt                   # Python dependencies
 ```
 python api_joint.py
 ```
+### 🔧 System Testing
 ```bash
+# Test core system components
 python -m src.utils.real_user_selector  # Demo real user extraction
+python -m src.preprocessing.data_loader  # Verify data loading
 ```
 ### 🧪 Frontend Development
 ✅ **Enhanced Architecture**: 128D embeddings, temperature scaling, contrastive learning
 ✅ **Curriculum Learning**: Progressive training for better convergence
 ✅ **Category-Aware Recommendations**: Intelligent personalization with diversity
+✅ **Aggregated Content-Based Filtering**: Revolutionary user history aggregation approach
+✅ **Enhanced Demographic Support**: Improved cold-start user handling
 ✅ **Production Ready**: Scalable API with enhanced frontend features
 **🎉 Ready to deliver next-generation personalized recommendations!**
 - **2-Phase API** (`api_2phase.py`) - Specialized for 2-phase training
 - **Joint API** (`api_joint.py`) - Optimized for joint training approach
+## 🔧 Development Tools
+- **Real User Selection** (`src.utils.real_user_selector`) - Extract real user profiles for testing
+- **Data Loading Utilities** (`src.preprocessing.data_loader`) - Dataset loading and validation
+## 🧪 Development & Testing
 ### Frontend Development
 ```bash

analyze_recommendations.py DELETED Viewed

@@ -1,543 +0,0 @@
-#!/usr/bin/env python3
-"""
-Recommendation Analysis Script
-This script compares recommendations from both training approaches:
-1. 2-phase training (pre-trained item tower + joint fine-tuning)
-2. Single joint training (end-to-end optimization)
-It analyzes:
-- Category alignment between user interactions and recommendations
-- Diversity of recommended categories
-- Overlap between the two approaches
-- Performance on real users
-Usage:
-    python analyze_recommendations.py
-"""
-import os
-import sys
-import numpy as np
-import pandas as pd
-from collections import defaultdict, Counter
-from typing import Dict, List, Tuple
-import matplotlib.pyplot as plt
-import seaborn as sns
-# Add src to path
-sys.path.append('src')
-from src.inference.recommendation_engine import RecommendationEngine
-from src.utils.real_user_selector import RealUserSelector
-class RecommendationAnalyzer:
-    """Analyzer for comparing different recommendation approaches."""
-    def __init__(self):
-        self.recommendation_engine = None
-        self.real_user_selector = None
-        self.items_df = None
-        self.setup_engines()
-    def setup_engines(self):
-        """Setup recommendation engines and data."""
-        print("Loading recommendation engines...")
-        try:
-            # Load recommendation engine (assumes trained model artifacts exist)
-            self.recommendation_engine = RecommendationEngine()
-            print("✅ Recommendation engine loaded")
-        except Exception as e:
-            print(f"❌ Error loading recommendation engine: {e}")
-            return
-        try:
-            # Load real user selector
-            self.real_user_selector = RealUserSelector()
-            print("✅ Real user selector loaded")
-        except Exception as e:
-            print(f"❌ Error loading real user selector: {e}")
-        # Load items data for category analysis
-        self.items_df = pd.read_csv("datasets/items.csv")
-        print(f"✅ Loaded {len(self.items_df)} items")
-    def get_item_categories(self, item_ids: List[int]) -> List[str]:
-        """Get category codes for given item IDs."""
-        categories = []
-        for item_id in item_ids:
-            item_row = self.items_df[self.items_df['product_id'] == item_id]
-            if len(item_row) > 0:
-                categories.append(item_row.iloc[0]['category_code'])
-            else:
-                categories.append('unknown')
-        return categories
-    def analyze_user_recommendations(self,
-                                   user_profile: Dict,
-                                   recommendation_types: List[str] = None) -> Dict:
-        """Analyze recommendations for a single user across different approaches."""
-        if recommendation_types is None:
-            recommendation_types = ['collaborative', 'hybrid', 'content']
-        results = {
-            'user_profile': user_profile,
-            'interaction_categories': [],
-            'recommendations': {},
-            'category_analysis': {}
-        }
-        # Get categories from user's interaction history
-        if user_profile['interaction_history']:
-            results['interaction_categories'] = self.get_item_categories(
-                user_profile['interaction_history']
-            )
-        # Get recommendations for each type
-        for rec_type in recommendation_types:
-            try:
-                if rec_type == 'collaborative':
-                    recs = self.recommendation_engine.recommend_items_collaborative(
-                        age=user_profile['age'],
-                        gender=user_profile['gender'],
-                        income=user_profile['income'],
-                        interaction_history=user_profile['interaction_history'],
-                        k=10
-                    )
-                elif rec_type == 'hybrid':
-                    recs = self.recommendation_engine.recommend_items_hybrid(
-                        age=user_profile['age'],
-                        gender=user_profile['gender'],
-                        income=user_profile['income'],
-                        interaction_history=user_profile['interaction_history'],
-                        k=10
-                    )
-                elif rec_type == 'content' and user_profile['interaction_history']:
-                    recs = self.recommendation_engine.recommend_items_content_based(
-                        seed_item_id=user_profile['interaction_history'][-1],
-                        k=10
-                    )
-                else:
-                    continue
-                # Extract item IDs and categories
-                item_ids = [item_id for item_id, score, info in recs]
-                rec_categories = self.get_item_categories(item_ids)
-                results['recommendations'][rec_type] = {
-                    'items': recs,
-                    'item_ids': item_ids,
-                    'categories': rec_categories,
-                    'scores': [score for item_id, score, info in recs]
-                }
-                # Analyze category alignment
-                results['category_analysis'][rec_type] = self.analyze_category_alignment(
-                    results['interaction_categories'],
-                    rec_categories
-                )
-            except Exception as e:
-                print(f"Error generating {rec_type} recommendations: {e}")
-        return results
-    def analyze_category_alignment(self,
-                                 interaction_categories: List[str],
-                                 recommendation_categories: List[str]) -> Dict:
-        """Analyze alignment between interaction and recommendation categories."""
-        if not interaction_categories:
-            return {
-                'overlap_ratio': 0.0,
-                'unique_interaction_categories': 0,
-                'unique_recommendation_categories': len(set(recommendation_categories)),
-                'common_categories': [],
-                'category_distribution': Counter(recommendation_categories)
-            }
-        interaction_set = set(interaction_categories)
-        recommendation_set = set(recommendation_categories)
-        common_categories = interaction_set.intersection(recommendation_set)
-        overlap_ratio = len(common_categories) / len(interaction_set) if interaction_set else 0.0
-        return {
-            'overlap_ratio': overlap_ratio,
-            'unique_interaction_categories': len(interaction_set),
-            'unique_recommendation_categories': len(recommendation_set),
-            'common_categories': list(common_categories),
-            'category_distribution': Counter(recommendation_categories),
-            'interaction_category_distribution': Counter(interaction_categories)
-        }
-    def compare_recommendation_approaches(self,
-                                        users_sample: List[Dict],
-                                        approaches: List[str] = None) -> Dict:
-        """Compare different recommendation approaches across multiple users."""
-        if approaches is None:
-            approaches = ['collaborative', 'hybrid', 'content']
-        comparison_results = {
-            'approach_stats': defaultdict(list),
-            'cross_approach_analysis': {},
-            'user_results': []
-        }
-        print(f"Analyzing {len(users_sample)} users across {len(approaches)} approaches...")
-        for i, user in enumerate(users_sample):
-            print(f"Analyzing user {i+1}/{len(users_sample)}...")
-            user_results = self.analyze_user_recommendations(user, approaches)
-            comparison_results['user_results'].append(user_results)
-            # Aggregate stats by approach
-            for approach in approaches:
-                if approach in user_results['category_analysis']:
-                    analysis = user_results['category_analysis'][approach]
-                    comparison_results['approach_stats'][approach].append({
-                        'overlap_ratio': analysis['overlap_ratio'],
-                        'unique_rec_categories': analysis['unique_recommendation_categories'],
-                        'common_categories_count': len(analysis['common_categories'])
-                    })
-        # Calculate aggregate statistics
-        for approach in approaches:
-            stats = comparison_results['approach_stats'][approach]
-            if stats:
-                comparison_results['approach_stats'][approach] = {
-                    'avg_overlap_ratio': np.mean([s['overlap_ratio'] for s in stats]),
-                    'std_overlap_ratio': np.std([s['overlap_ratio'] for s in stats]),
-                    'avg_unique_categories': np.mean([s['unique_rec_categories'] for s in stats]),
-                    'avg_common_categories': np.mean([s['common_categories_count'] for s in stats]),
-                    'total_users': len(stats)
-                }
-        # Cross-approach analysis
-        comparison_results['cross_approach_analysis'] = self.cross_approach_analysis(
-            comparison_results['user_results'], approaches
-        )
-        return comparison_results
-    def cross_approach_analysis(self, user_results: List[Dict], approaches: List[str]) -> Dict:
-        """Analyze similarities and differences between approaches."""
-        cross_analysis = {
-            'item_overlap': defaultdict(dict),
-            'category_overlap': defaultdict(dict),
-            'score_correlation': defaultdict(dict)
-        }
-        for user_result in user_results:
-            recommendations = user_result['recommendations']
-            # Compare each pair of approaches
-            for i, approach1 in enumerate(approaches):
-                for approach2 in approaches[i+1:]:
-                    if approach1 in recommendations and approach2 in recommendations:
-                        # Item overlap
-                        items1 = set(recommendations[approach1]['item_ids'])
-                        items2 = set(recommendations[approach2]['item_ids'])
-                        item_overlap_ratio = len(items1.intersection(items2)) / len(items1.union(items2))
-                        # Category overlap
-                        cats1 = set(recommendations[approach1]['categories'])
-                        cats2 = set(recommendations[approach2]['categories'])
-                        cat_overlap_ratio = len(cats1.intersection(cats2)) / len(cats1.union(cats2)) if cats1.union(cats2) else 0
-                        # Store results
-                        pair_key = f"{approach1}_vs_{approach2}"
-                        if pair_key not in cross_analysis['item_overlap']:
-                            cross_analysis['item_overlap'][pair_key] = []
-                            cross_analysis['category_overlap'][pair_key] = []
-                        cross_analysis['item_overlap'][pair_key].append(item_overlap_ratio)
-                        cross_analysis['category_overlap'][pair_key].append(cat_overlap_ratio)
-        # Calculate averages
-        for pair_key in cross_analysis['item_overlap']:
-            cross_analysis['item_overlap'][pair_key] = {
-                'avg': np.mean(cross_analysis['item_overlap'][pair_key]),
-                'std': np.std(cross_analysis['item_overlap'][pair_key])
-            }
-            cross_analysis['category_overlap'][pair_key] = {
-                'avg': np.mean(cross_analysis['category_overlap'][pair_key]),
-                'std': np.std(cross_analysis['category_overlap'][pair_key])
-            }
-        return cross_analysis
-    def generate_report(self, comparison_results: Dict, output_file: str = "recommendation_analysis_report.md"):
-        """Generate a comprehensive analysis report."""
-        report = []
-        report.append("# Recommendation System Analysis Report")
-        report.append(f"Generated: {pd.Timestamp.now()}")
-        report.append("")
-        # Overall Statistics
-        report.append("## Overall Statistics")
-        report.append("")
-        for approach, stats in comparison_results['approach_stats'].items():
-            if isinstance(stats, dict):
-                report.append(f"### {approach.title()} Recommendations")
-                report.append(f"- **Average Category Overlap**: {stats['avg_overlap_ratio']:.3f} ± {stats['std_overlap_ratio']:.3f}")
-                report.append(f"- **Average Unique Categories per User**: {stats['avg_unique_categories']:.1f}")
-                report.append(f"- **Average Common Categories**: {stats['avg_common_categories']:.1f}")
-                report.append(f"- **Users Analyzed**: {stats['total_users']}")
-                report.append("")
-        # Cross-Approach Analysis
-        report.append("## Cross-Approach Comparison")
-        report.append("")
-        cross_analysis = comparison_results['cross_approach_analysis']
-        report.append("### Item Overlap Between Approaches")
-        for pair, overlap_stats in cross_analysis['item_overlap'].items():
-            report.append(f"- **{pair.replace('_', ' ').title()}**: {overlap_stats['avg']:.3f} ± {overlap_stats['std']:.3f}")
-        report.append("")
-        report.append("### Category Overlap Between Approaches")
-        for pair, overlap_stats in cross_analysis['category_overlap'].items():
-            report.append(f"- **{pair.replace('_', ' ').title()}**: {overlap_stats['avg']:.3f} ± {overlap_stats['std']:.3f}")
-        report.append("")
-        # Category Alignment Analysis
-        report.append("## Category Alignment Analysis")
-        report.append("")
-        report.append("Category alignment measures how well recommendations match the categories")
-        report.append("of items users have previously interacted with.")
-        report.append("")
-        # Find best performing approach
-        best_approach = max(
-            comparison_results['approach_stats'].keys(),
-            key=lambda k: comparison_results['approach_stats'][k]['avg_overlap_ratio']
-            if isinstance(comparison_results['approach_stats'][k], dict) else 0
-        )
-        report.append(f"**Best Category Alignment**: {best_approach.title()} approach")
-        report.append("")
-        # Recommendations
-        report.append("## Key Findings & Recommendations")
-        report.append("")
-        # Analyze overlap ratios to provide insights
-        overlap_ratios = {
-            k: v['avg_overlap_ratio'] for k, v in comparison_results['approach_stats'].items()
-            if isinstance(v, dict)
-        }
-        if overlap_ratios:
-            avg_overlap = np.mean(list(overlap_ratios.values()))
-            if avg_overlap > 0.5:
-                report.append("✅ **Strong Category Alignment**: Recommendations show good alignment with user interaction patterns.")
-            elif avg_overlap > 0.3:
-                report.append("⚠️ **Moderate Category Alignment**: Some alignment present but room for improvement.")
-            else:
-                report.append("❌ **Weak Category Alignment**: Recommendations may be too diverse or not well-aligned with user preferences.")
-            report.append("")
-            # Compare approaches
-            if len(overlap_ratios) > 1:
-                sorted_approaches = sorted(overlap_ratios.items(), key=lambda x: x[1], reverse=True)
-                report.append("### Approach Rankings (by category alignment):")
-                for i, (approach, ratio) in enumerate(sorted_approaches, 1):
-                    report.append(f"{i}. **{approach.title()}**: {ratio:.3f}")
-                report.append("")
-        # Write report
-        with open(output_file, 'w') as f:
-            f.write('\n'.join(report))
-        print(f"✅ Analysis report saved to: {output_file}")
-        return '\n'.join(report)
-    def visualize_results(self, comparison_results: Dict, save_plots: bool = True):
-        """Create visualizations for the analysis results."""
-        # Set up plotting style
-        plt.style.use('default')
-        sns.set_palette("husl")
-        # Create figure with subplots
-        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
-        fig.suptitle('Recommendation System Analysis', fontsize=16, fontweight='bold')
-        # 1. Category Overlap by Approach
-        ax1 = axes[0, 0]
-        approaches = []
-        overlap_means = []
-        overlap_stds = []
-        for approach, stats in comparison_results['approach_stats'].items():
-            if isinstance(stats, dict):
-                approaches.append(approach.title())
-                overlap_means.append(stats['avg_overlap_ratio'])
-                overlap_stds.append(stats['std_overlap_ratio'])
-        bars1 = ax1.bar(approaches, overlap_means, yerr=overlap_stds, capsize=5, alpha=0.7)
-        ax1.set_title('Average Category Overlap by Approach')
-        ax1.set_ylabel('Category Overlap Ratio')
-        ax1.set_ylim(0, 1)
-        # Add value labels on bars
-        for bar, mean in zip(bars1, overlap_means):
-            height = bar.get_height()
-            ax1.text(bar.get_x() + bar.get_width()/2., height + 0.01,
-                    f'{mean:.3f}', ha='center', va='bottom')
-        # 2. Cross-Approach Item Overlap
-        ax2 = axes[0, 1]
-        cross_analysis = comparison_results['cross_approach_analysis']
-        pair_names = []
-        item_overlaps = []
-        for pair, overlap_stats in cross_analysis['item_overlap'].items():
-            pair_names.append(pair.replace('_vs_', ' vs ').title())
-            item_overlaps.append(overlap_stats['avg'])
-        if pair_names:
-            bars2 = ax2.bar(pair_names, item_overlaps, alpha=0.7, color='coral')
-            ax2.set_title('Item Overlap Between Approaches')
-            ax2.set_ylabel('Item Overlap Ratio')
-            ax2.set_ylim(0, 1)
-            plt.setp(ax2.get_xticklabels(), rotation=45, ha='right')
-            # Add value labels
-            for bar, overlap in zip(bars2, item_overlaps):
-                height = bar.get_height()
-                ax2.text(bar.get_x() + bar.get_width()/2., height + 0.01,
-                        f'{overlap:.3f}', ha='center', va='bottom')
-        # 3. Category Diversity
-        ax3 = axes[1, 0]
-        unique_categories = []
-        for approach, stats in comparison_results['approach_stats'].items():
-            if isinstance(stats, dict):
-                unique_categories.append(stats['avg_unique_categories'])
-        bars3 = ax3.bar(approaches, unique_categories, alpha=0.7, color='lightgreen')
-        ax3.set_title('Average Unique Categories per Recommendation')
-        ax3.set_ylabel('Number of Unique Categories')
-        for bar, cats in zip(bars3, unique_categories):
-            height = bar.get_height()
-            ax3.text(bar.get_x() + bar.get_width()/2., height + 0.1,
-                    f'{cats:.1f}', ha='center', va='bottom')
-        # 4. Category vs Item Overlap Comparison
-        ax4 = axes[1, 1]
-        if cross_analysis['item_overlap'] and cross_analysis['category_overlap']:
-            pairs = list(cross_analysis['item_overlap'].keys())
-            item_overlaps = [cross_analysis['item_overlap'][p]['avg'] for p in pairs]
-            cat_overlaps = [cross_analysis['category_overlap'][p]['avg'] for p in pairs]
-            x = np.arange(len(pairs))
-            width = 0.35
-            bars4a = ax4.bar(x - width/2, item_overlaps, width, label='Item Overlap', alpha=0.7)
-            bars4b = ax4.bar(x + width/2, cat_overlaps, width, label='Category Overlap', alpha=0.7)
-            ax4.set_title('Item vs Category Overlap Between Approaches')
-            ax4.set_ylabel('Overlap Ratio')
-            ax4.set_xticks(x)
-            ax4.set_xticklabels([p.replace('_vs_', ' vs ') for p in pairs], rotation=45, ha='right')
-            ax4.legend()
-            ax4.set_ylim(0, 1)
-        plt.tight_layout()
-        if save_plots:
-            plt.savefig('recommendation_analysis_plots.png', dpi=300, bbox_inches='tight')
-            print("✅ Plots saved to: recommendation_analysis_plots.png")
-        plt.show()
-def main():
-    """Main function to run the recommendation analysis."""
-    print("🔍 Starting Recommendation Analysis...")
-    print("=" * 50)
-    # Initialize analyzer
-    analyzer = RecommendationAnalyzer()
-    if analyzer.recommendation_engine is None:
-        print("❌ Cannot proceed without recommendation engine. Please ensure model is trained.")
-        return
-    # Get sample of real users for analysis
-    print("Getting real user sample...")
-    try:
-        real_users = analyzer.real_user_selector.get_real_users(n=20, min_interactions=3)
-        print(f"✅ Loaded {len(real_users)} real users for analysis")
-    except Exception as e:
-        print(f"❌ Error loading real users: {e}")
-        # Fallback to synthetic users
-        real_users = [
-            {
-                'age': 32, 'gender': 'male', 'income': 75000,
-                'interaction_history': [1000978, 1001588, 1001618, 1002039]
-            },
-            {
-                'age': 28, 'gender': 'female', 'income': 45000,
-                'interaction_history': [1003456, 1004567, 1005678]
-            },
-            {
-                'age': 45, 'gender': 'male', 'income': 85000,
-                'interaction_history': [1006789, 1007890, 1008901, 1009012, 1010123]
-            }
-        ]
-        print(f"Using {len(real_users)} synthetic users for analysis")
-    # Run comprehensive analysis
-    print("Running recommendation analysis...")
-    approaches = ['collaborative', 'hybrid', 'content']
-    comparison_results = analyzer.compare_recommendation_approaches(
-        users_sample=real_users,
-        approaches=approaches
-    )
-    # Generate report
-    print("Generating analysis report...")
-    report = analyzer.generate_report(comparison_results)
-    # Create visualizations
-    print("Creating visualizations...")
-    try:
-        analyzer.visualize_results(comparison_results, save_plots=True)
-    except Exception as e:
-        print(f"Warning: Could not create visualizations: {e}")
-    # Print summary
-    print("\n" + "=" * 50)
-    print("📊 ANALYSIS SUMMARY")
-    print("=" * 50)
-    for approach, stats in comparison_results['approach_stats'].items():
-        if isinstance(stats, dict):
-            print(f"{approach.title()}: {stats['avg_overlap_ratio']:.3f} avg category overlap")
-    print(f"\n✅ Analysis complete! Check:")
-    print("   📄 recommendation_analysis_report.md")
-    print("   📊 recommendation_analysis_plots.png")
-if __name__ == "__main__":
-    main()

api/main.py CHANGED Viewed

@@ -33,7 +33,6 @@ app.add_middleware(
 # Global instances
 recommendation_engine = None
-enhanced_recommendation_engine = None
 real_user_selector = None
@@ -75,13 +74,17 @@ class UserProfile(BaseModel):
     age: int
     gender: str  # "male" or "female"
     income: float
     interaction_history: Optional[List[int]] = []
 class RecommendationRequest(BaseModel):
     user_profile: UserProfile
     num_recommendations: int = 10
-    recommendation_type: str = "hybrid"  # "collaborative", "content", "hybrid", "enhanced", "enhanced_128d", "category_focused"
     collaborative_weight: Optional[float] = 0.7
     category_boost: Optional[float] = 1.5  # For enhanced recommendations
     enable_category_boost: Optional[bool] = True
@@ -132,6 +135,10 @@ class RealUserProfile(BaseModel):
     age: int
     gender: str
     income: int
     interaction_history: List[int]
     interaction_stats: Dict[str, int]
     interaction_pattern: str
@@ -169,39 +176,24 @@ class EnrichedBehavioralPatternsResponse(BaseModel):
 @app.on_event("startup")
 async def startup_event():
-    """Initialize the recommendation engines and real user selector on startup."""
-    global recommendation_engine, enhanced_recommendation_engine, real_user_selector
     try:
-        print("Loading recommendation engine...")
         recommendation_engine = RecommendationEngine()
-        print("Recommendation engine loaded successfully!")
     except Exception as e:
-        print(f"Error loading recommendation engine: {e}")
         recommendation_engine = None
-    try:
-        print("Loading enhanced recommendation engine...")
-        # Try enhanced 128D engine first, fallback to regular enhanced
-        try:
-            from src.inference.enhanced_recommendation_engine_128d import Enhanced128DRecommendationEngine
-            enhanced_recommendation_engine = Enhanced128DRecommendationEngine()
-            print("✅ Using Enhanced 128D Recommendation Engine")
-        except:
-            from src.inference.enhanced_recommendation_engine import EnhancedRecommendationEngine
-            enhanced_recommendation_engine = EnhancedRecommendationEngine()
-            print("⚠️  Using fallback Enhanced Recommendation Engine")
-        print("Enhanced recommendation engine loaded successfully!")
-    except Exception as e:
-        print(f"Error loading enhanced recommendation engine: {e}")
-        enhanced_recommendation_engine = None
     try:
         print("Loading real user selector...")
         real_user_selector = RealUserSelector()
-        print("Real user selector loaded successfully!")
     except Exception as e:
-        print(f"Error loading real user selector: {e}")
         real_user_selector = None
@@ -211,7 +203,12 @@ async def root():
     return {
         "message": "Two-Tower Recommendation API",
         "version": "1.0.0",
-        "status": "active" if recommendation_engine is not None else "initialization_failed"
     }
@@ -220,7 +217,13 @@ async def health_check():
     """Health check endpoint."""
     return {
         "status": "healthy" if recommendation_engine is not None else "unhealthy",
-        "engine_loaded": recommendation_engine is not None
     }
@@ -378,6 +381,10 @@ async def get_recommendations(request: RecommendationRequest):
                 age=user_profile.age,
                 gender=user_profile.gender,
                 income=user_profile.income,
                 interaction_history=filtered_interaction_history,
                 k=request.num_recommendations * 2  # Get more to allow for filtering
             )
@@ -390,11 +397,11 @@ async def get_recommendations(request: RecommendationRequest):
                            (f" in category '{request.selected_category}'" if request.selected_category else "")
                 )
-            # Use most recent interaction as seed
-            seed_item = filtered_interaction_history[-1]
-            recommendations = recommendation_engine.recommend_items_content_based(
-                seed_item_id=seed_item,
-                k=request.num_recommendations * 2  # Get more to allow for filtering
             )
         elif request.recommendation_type == "hybrid":
@@ -402,79 +409,40 @@ async def get_recommendations(request: RecommendationRequest):
                 age=user_profile.age,
                 gender=user_profile.gender,
                 income=user_profile.income,
                 interaction_history=filtered_interaction_history,
                 k=request.num_recommendations * 2,  # Get more to allow for filtering
                 collaborative_weight=request.collaborative_weight
             )
-        elif request.recommendation_type == "enhanced":
-            if enhanced_recommendation_engine is None:
-                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
-            # Check if it's the 128D engine or fallback
-            if hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
-                # 128D Enhanced engine
-                recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
-                    age=user_profile.age,
-                    gender=user_profile.gender,
-                    income=user_profile.income,
-                    interaction_history=filtered_interaction_history,
-                    k=request.num_recommendations * 2,  # Get more to allow for filtering
-                    diversity_weight=0.3 if request.enable_diversity else 0.0,
-                    category_boost=request.category_boost if request.enable_category_boost else 1.0
-                )
-            else:
-                # Fallback enhanced engine
-                recommendations = enhanced_recommendation_engine.recommend_items_enhanced_hybrid(
-                    age=user_profile.age,
-                    gender=user_profile.gender,
-                    income=user_profile.income,
-                    interaction_history=filtered_interaction_history,
-                    k=request.num_recommendations * 2,  # Get more to allow for filtering
-                    collaborative_weight=request.collaborative_weight,
-                    category_boost=request.category_boost,
-                    enable_category_boost=request.enable_category_boost,
-                    enable_diversity=request.enable_diversity
-                )
-        elif request.recommendation_type == "enhanced_128d":
-            if enhanced_recommendation_engine is None or not hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
-                raise HTTPException(status_code=503, detail="Enhanced 128D recommendation engine not available")
-            recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
                 age=user_profile.age,
                 gender=user_profile.gender,
                 income=user_profile.income,
                 interaction_history=filtered_interaction_history,
-                k=request.num_recommendations * 2,  # Get more to allow for filtering
-                diversity_weight=0.3 if request.enable_diversity else 0.0,
-                category_boost=request.category_boost if request.enable_category_boost else 1.0
-            )
-        elif request.recommendation_type == "category_focused":
-            if enhanced_recommendation_engine is None:
-                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
-            recommendations = enhanced_recommendation_engine.recommend_items_category_focused(
-                age=user_profile.age,
-                gender=user_profile.gender,
-                income=user_profile.income,
-                interaction_history=filtered_interaction_history,
-                k=request.num_recommendations * 2,  # Get more to allow for filtering
-                focus_percentage=0.8
             )
         else:
             raise HTTPException(
                 status_code=400,
-                detail="Invalid recommendation_type. Must be 'collaborative', 'content', 'hybrid', 'enhanced', 'enhanced_128d', or 'category_focused'"
             )
         # Apply category filtering to final recommendations if needed
         if request.selected_category:
             recommendations = filter_recommendations_by_category(recommendations, request.selected_category)
-            # Limit to requested number after filtering
-            recommendations = recommendations[:request.num_recommendations]
         # Format response
         formatted_recommendations = []
@@ -543,8 +511,12 @@ async def predict_user_item_rating(request: RatingPredictionRequest):
             age=user_profile.age,
             gender=user_profile.gender,
             income=user_profile.income,
-            interaction_history=user_profile.interaction_history,
-            item_id=request.item_id
         )
         item_info = recommendation_engine._get_item_info(request.item_id)

 # Global instances
 recommendation_engine = None
 real_user_selector = None
     age: int
     gender: str  # "male" or "female"
     income: float
+    profession: Optional[str] = "Other"
+    location: Optional[str] = "Urban"
+    education_level: Optional[str] = "High School"
+    marital_status: Optional[str] = "Single"
     interaction_history: Optional[List[int]] = []
 class RecommendationRequest(BaseModel):
     user_profile: UserProfile
     num_recommendations: int = 10
+    recommendation_type: str = "hybrid"  # "collaborative", "content" (aggregated history), "hybrid"
     collaborative_weight: Optional[float] = 0.7
     category_boost: Optional[float] = 1.5  # For enhanced recommendations
     enable_category_boost: Optional[bool] = True
     age: int
     gender: str
     income: int
+    profession: Optional[str] = None
+    location: Optional[str] = None
+    education_level: Optional[str] = None
+    marital_status: Optional[str] = None
     interaction_history: List[int]
     interaction_stats: Dict[str, int]
     interaction_pattern: str
 @app.on_event("startup")
 async def startup_event():
+    """Initialize the recommendation engine and real user selector on startup."""
+    global recommendation_engine, real_user_selector
     try:
+        print("Loading recommendation engine with enhanced demographics...")
         recommendation_engine = RecommendationEngine()
+        print("✅ Recommendation engine loaded successfully!")
+        print("   Supports 7 demographic features: age, gender, income, profession, location, education, marital_status")
     except Exception as e:
+        print(f"❌ Error loading recommendation engine: {e}")
         recommendation_engine = None
     try:
         print("Loading real user selector...")
         real_user_selector = RealUserSelector()
+        print("✅ Real user selector loaded successfully!")
     except Exception as e:
+        print(f"❌ Error loading real user selector: {e}")
         real_user_selector = None
     return {
         "message": "Two-Tower Recommendation API",
         "version": "1.0.0",
+        "status": "active" if recommendation_engine is not None else "initialization_failed",
+        "enhanced_demographics": True,
+        "supported_demographics": [
+            "age", "gender", "income", "profession",
+            "location", "education_level", "marital_status"
+        ]
     }
     """Health check endpoint."""
     return {
         "status": "healthy" if recommendation_engine is not None else "unhealthy",
+        "engine_loaded": recommendation_engine is not None,
+        "enhanced_demographics": True,
+        "demographic_features": 7,
+        "supported_demographics": [
+            "age", "gender", "income", "profession",
+            "location", "education_level", "marital_status"
+        ]
     }
                 age=user_profile.age,
                 gender=user_profile.gender,
                 income=user_profile.income,
+                profession=user_profile.profession or "Other",
+                location=user_profile.location or "Urban",
+                education_level=user_profile.education_level or "High School",
+                marital_status=user_profile.marital_status or "Single",
                 interaction_history=filtered_interaction_history,
                 k=request.num_recommendations * 2  # Get more to allow for filtering
             )
                            (f" in category '{request.selected_category}'" if request.selected_category else "")
                 )
+            # Use aggregated interaction history for content-based recommendations
+            recommendations = recommendation_engine.recommend_items_content_based_from_history(
+                interaction_history=filtered_interaction_history,
+                k=request.num_recommendations * 2,  # Get more to allow for filtering
+                aggregation_method="weighted_mean"
             )
         elif request.recommendation_type == "hybrid":
                 age=user_profile.age,
                 gender=user_profile.gender,
                 income=user_profile.income,
+                profession=user_profile.profession or "Other",
+                location=user_profile.location or "Urban",
+                education_level=user_profile.education_level or "High School",
+                marital_status=user_profile.marital_status or "Single",
                 interaction_history=filtered_interaction_history,
                 k=request.num_recommendations * 2,  # Get more to allow for filtering
                 collaborative_weight=request.collaborative_weight
             )
+        elif request.recommendation_type == "category_boosted":
+            recommendations = recommendation_engine.recommend_items_category_boosted(
                 age=user_profile.age,
                 gender=user_profile.gender,
                 income=user_profile.income,
+                profession=user_profile.profession or "Other",
+                location=user_profile.location or "Urban",
+                education_level=user_profile.education_level or "High School",
+                marital_status=user_profile.marital_status or "Single",
                 interaction_history=filtered_interaction_history,
+                k=request.num_recommendations * 2  # Get more to allow for filtering
             )
         else:
             raise HTTPException(
                 status_code=400,
+                detail="Invalid recommendation_type. Must be 'collaborative', 'content', 'hybrid', or 'category_boosted'"
             )
         # Apply category filtering to final recommendations if needed
         if request.selected_category:
             recommendations = filter_recommendations_by_category(recommendations, request.selected_category)
+        # Always limit to requested number of recommendations
+        recommendations = recommendations[:request.num_recommendations]
         # Format response
         formatted_recommendations = []
             age=user_profile.age,
             gender=user_profile.gender,
             income=user_profile.income,
+            item_id=request.item_id,
+            profession=user_profile.profession or "Other",
+            location=user_profile.location or "Urban",
+            education_level=user_profile.education_level or "High School",
+            marital_status=user_profile.marital_status or "Single",
+            interaction_history=user_profile.interaction_history
         )
         item_info = recommendation_engine._get_item_info(request.item_id)

api_2phase.py DELETED Viewed

@@ -1,521 +0,0 @@
-#!/usr/bin/env python3
-"""
-API for 2-Phase Trained Recommendation System
-This API serves recommendations from a model trained using the 2-phase approach:
-1. Pre-trained item tower
-2. Joint training with fine-tuned item tower
-Usage:
-    python api_2phase.py
-Then access: http://localhost:8000
-"""
-from fastapi import FastAPI, HTTPException
-from fastapi.middleware.cors import CORSMiddleware
-from pydantic import BaseModel
-from typing import List, Optional, Dict, Any
-import uvicorn
-import os
-import sys
-import pandas as pd
-# Add src to path for imports and set working directory
-parent_dir = os.path.dirname(__file__)
-sys.path.append(parent_dir)
-os.chdir(parent_dir)  # Change to project root directory
-from src.inference.recommendation_engine import RecommendationEngine
-from src.utils.real_user_selector import RealUserSelector
-# Initialize FastAPI app
-app = FastAPI(
-    title="Two-Tower Recommendation API (2-Phase Training)",
-    description="API for serving recommendations using a two-tower architecture trained with 2-phase approach",
-    version="1.0.0-2phase"
-)
-# Add CORS middleware
-app.add_middleware(
-    CORSMiddleware,
-    allow_origins=["*"],  # Configure appropriately for production
-    allow_credentials=True,
-    allow_methods=["*"],
-    allow_headers=["*"],
-)
-# Global instances
-recommendation_engine = None
-enhanced_recommendation_engine = None
-real_user_selector = None
-# Pydantic models for request/response
-class UserProfile(BaseModel):
-    age: int
-    gender: str  # "male" or "female"
-    income: float
-    interaction_history: Optional[List[int]] = []
-class RecommendationRequest(BaseModel):
-    user_profile: UserProfile
-    num_recommendations: int = 10
-    recommendation_type: str = "hybrid"  # "collaborative", "content", "hybrid", "enhanced", "enhanced_128d", "category_focused"
-    collaborative_weight: Optional[float] = 0.7
-    category_boost: Optional[float] = 1.5  # For enhanced recommendations
-    enable_category_boost: Optional[bool] = True
-    enable_diversity: Optional[bool] = True
-class ItemSimilarityRequest(BaseModel):
-    item_id: int
-    num_recommendations: int = 10
-class RatingPredictionRequest(BaseModel):
-    user_profile: UserProfile
-    item_id: int
-class ItemInfo(BaseModel):
-    product_id: int
-    category_id: int
-    category_code: str
-    brand: str
-    price: float
-class RecommendationResponse(BaseModel):
-    item_id: int
-    score: float
-    item_info: ItemInfo
-class RecommendationsResponse(BaseModel):
-    recommendations: List[RecommendationResponse]
-    user_profile: UserProfile
-    recommendation_type: str
-    total_count: int
-    training_approach: str = "2-phase"
-class RatingPredictionResponse(BaseModel):
-    user_profile: UserProfile
-    item_id: int
-    predicted_rating: float
-    item_info: ItemInfo
-class RealUserProfile(BaseModel):
-    user_id: int
-    age: int
-    gender: str
-    income: int
-    interaction_history: List[int]
-    interaction_stats: Dict[str, int]
-    interaction_pattern: str
-    summary: str
-class RealUsersResponse(BaseModel):
-    users: List[RealUserProfile]
-    total_count: int
-    dataset_summary: Dict[str, Any]
-@app.on_event("startup")
-async def startup_event():
-    """Initialize the recommendation engines and real user selector on startup."""
-    global recommendation_engine, enhanced_recommendation_engine, real_user_selector
-    print("🚀 Starting 2-Phase Training API...")
-    print("   Training approach: Pre-trained item tower + Joint fine-tuning")
-    try:
-        print("Loading 2-phase trained recommendation engine...")
-        recommendation_engine = RecommendationEngine()
-        print("✅ 2-phase recommendation engine loaded successfully!")
-    except Exception as e:
-        print(f"❌ Error loading recommendation engine: {e}")
-        recommendation_engine = None
-    try:
-        print("Loading enhanced recommendation engine...")
-        # Try enhanced 128D engine first, fallback to regular enhanced
-        try:
-            from src.inference.enhanced_recommendation_engine_128d import Enhanced128DRecommendationEngine
-            enhanced_recommendation_engine = Enhanced128DRecommendationEngine()
-            print("✅ Using Enhanced 128D Recommendation Engine")
-        except:
-            from src.inference.enhanced_recommendation_engine import EnhancedRecommendationEngine
-            enhanced_recommendation_engine = EnhancedRecommendationEngine()
-            print("⚠️  Using fallback Enhanced Recommendation Engine")
-        print("Enhanced recommendation engine loaded successfully!")
-    except Exception as e:
-        print(f"Error loading enhanced recommendation engine: {e}")
-        enhanced_recommendation_engine = None
-    try:
-        print("Loading real user selector...")
-        real_user_selector = RealUserSelector()
-        print("Real user selector loaded successfully!")
-    except Exception as e:
-        print(f"Error loading real user selector: {e}")
-        real_user_selector = None
-    print("🎯 2-Phase API ready to serve recommendations!")
-@app.get("/")
-async def root():
-    """Root endpoint with API information."""
-    return {
-        "message": "Two-Tower Recommendation API (2-Phase Training)",
-        "version": "1.0.0-2phase",
-        "training_approach": "2-phase (pre-trained item tower + joint fine-tuning)",
-        "status": "active" if recommendation_engine is not None else "initialization_failed"
-    }
-@app.get("/health")
-async def health_check():
-    """Health check endpoint."""
-    return {
-        "status": "healthy" if recommendation_engine is not None else "unhealthy",
-        "engine_loaded": recommendation_engine is not None,
-        "training_approach": "2-phase"
-    }
-@app.get("/model-info")
-async def model_info():
-    """Get information about the loaded model."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    return {
-        "training_approach": "2-phase",
-        "description": "Pre-trained item tower followed by joint training with user tower",
-        "phases": [
-            "Phase 1: Item tower pre-training on item features only",
-            "Phase 2: Joint training of user tower + fine-tuning pre-trained item tower"
-        ],
-        "embedding_dimension": 128,
-        "item_vocab_size": len(recommendation_engine.data_processor.item_vocab) if recommendation_engine.data_processor else "unknown",
-        "artifacts_loaded": {
-            "item_tower_pretrained": "src/artifacts/item_tower_weights",
-            "item_tower_finetuned": "src/artifacts/item_tower_weights_finetuned_best",
-            "user_tower": "src/artifacts/user_tower_weights_best",
-            "rating_model": "src/artifacts/rating_model_weights_best"
-        }
-    }
-@app.get("/real-users", response_model=RealUsersResponse)
-async def get_real_users(count: int = 100, min_interactions: int = 5):
-    """Get real user profiles with genuine interaction histories."""
-    if real_user_selector is None:
-        raise HTTPException(status_code=503, detail="Real user selector not available")
-    try:
-        # Get real user profiles
-        real_users = real_user_selector.get_real_users(n=count, min_interactions=min_interactions)
-        # Get dataset summary
-        dataset_summary = real_user_selector.get_dataset_summary()
-        # Format users for response
-        formatted_users = []
-        for user in real_users:
-            formatted_users.append(RealUserProfile(**user))
-        return RealUsersResponse(
-            users=formatted_users,
-            total_count=len(formatted_users),
-            dataset_summary=dataset_summary
-        )
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error retrieving real users: {str(e)}")
-@app.get("/real-users/{user_id}")
-async def get_real_user_details(user_id: int):
-    """Get detailed interaction breakdown for a specific real user."""
-    if real_user_selector is None:
-        raise HTTPException(status_code=503, detail="Real user selector not available")
-    try:
-        user_details = real_user_selector.get_user_interaction_details(user_id)
-        if "error" in user_details:
-            raise HTTPException(status_code=404, detail=user_details["error"])
-        return user_details
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error retrieving user details: {str(e)}")
-@app.get("/dataset-summary")
-async def get_dataset_summary():
-    """Get summary statistics of the real dataset."""
-    if real_user_selector is None:
-        raise HTTPException(status_code=503, detail="Real user selector not available")
-    try:
-        return real_user_selector.get_dataset_summary()
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error retrieving dataset summary: {str(e)}")
-@app.post("/recommendations", response_model=RecommendationsResponse)
-async def get_recommendations(request: RecommendationRequest):
-    """Get item recommendations for a user."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    try:
-        user_profile = request.user_profile
-        # Generate recommendations based on type
-        if request.recommendation_type == "collaborative":
-            recommendations = recommendation_engine.recommend_items_collaborative(
-                age=user_profile.age,
-                gender=user_profile.gender,
-                income=user_profile.income,
-                interaction_history=user_profile.interaction_history,
-                k=request.num_recommendations
-            )
-        elif request.recommendation_type == "content":
-            if not user_profile.interaction_history:
-                raise HTTPException(
-                    status_code=400,
-                    detail="Content-based recommendations require interaction history"
-                )
-            # Use most recent interaction as seed
-            seed_item = user_profile.interaction_history[-1]
-            recommendations = recommendation_engine.recommend_items_content_based(
-                seed_item_id=seed_item,
-                k=request.num_recommendations
-            )
-        elif request.recommendation_type == "hybrid":
-            recommendations = recommendation_engine.recommend_items_hybrid(
-                age=user_profile.age,
-                gender=user_profile.gender,
-                income=user_profile.income,
-                interaction_history=user_profile.interaction_history,
-                k=request.num_recommendations,
-                collaborative_weight=request.collaborative_weight
-            )
-        elif request.recommendation_type == "enhanced":
-            if enhanced_recommendation_engine is None:
-                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
-            # Check if it's the 128D engine or fallback
-            if hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
-                # 128D Enhanced engine
-                recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
-                    age=user_profile.age,
-                    gender=user_profile.gender,
-                    income=user_profile.income,
-                    interaction_history=user_profile.interaction_history,
-                    k=request.num_recommendations,
-                    diversity_weight=0.3 if request.enable_diversity else 0.0,
-                    category_boost=request.category_boost if request.enable_category_boost else 1.0
-                )
-            else:
-                # Fallback enhanced engine
-                recommendations = enhanced_recommendation_engine.recommend_items_enhanced_hybrid(
-                    age=user_profile.age,
-                    gender=user_profile.gender,
-                    income=user_profile.income,
-                    interaction_history=user_profile.interaction_history,
-                    k=request.num_recommendations,
-                    collaborative_weight=request.collaborative_weight,
-                    category_boost=request.category_boost,
-                    enable_category_boost=request.enable_category_boost,
-                    enable_diversity=request.enable_diversity
-                )
-        elif request.recommendation_type == "enhanced_128d":
-            if enhanced_recommendation_engine is None or not hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
-                raise HTTPException(status_code=503, detail="Enhanced 128D recommendation engine not available")
-            recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
-                age=user_profile.age,
-                gender=user_profile.gender,
-                income=user_profile.income,
-                interaction_history=user_profile.interaction_history,
-                k=request.num_recommendations,
-                diversity_weight=0.3 if request.enable_diversity else 0.0,
-                category_boost=request.category_boost if request.enable_category_boost else 1.0
-            )
-        elif request.recommendation_type == "category_focused":
-            if enhanced_recommendation_engine is None:
-                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
-            recommendations = enhanced_recommendation_engine.recommend_items_category_focused(
-                age=user_profile.age,
-                gender=user_profile.gender,
-                income=user_profile.income,
-                interaction_history=user_profile.interaction_history,
-                k=request.num_recommendations,
-                focus_percentage=0.8
-            )
-        else:
-            raise HTTPException(
-                status_code=400,
-                detail="Invalid recommendation_type. Must be 'collaborative', 'content', 'hybrid', 'enhanced', 'enhanced_128d', or 'category_focused'"
-            )
-        # Format response
-        formatted_recommendations = []
-        for item_id, score, item_info in recommendations:
-            formatted_recommendations.append(
-                RecommendationResponse(
-                    item_id=item_id,
-                    score=score,
-                    item_info=ItemInfo(**item_info)
-                )
-            )
-        return RecommendationsResponse(
-            recommendations=formatted_recommendations,
-            user_profile=user_profile,
-            recommendation_type=request.recommendation_type,
-            total_count=len(formatted_recommendations),
-            training_approach="2-phase"
-        )
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error generating recommendations: {str(e)}")
-@app.post("/item-similarity", response_model=List[RecommendationResponse])
-async def get_similar_items(request: ItemSimilarityRequest):
-    """Get items similar to a given item."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    try:
-        recommendations = recommendation_engine.recommend_items_content_based(
-            seed_item_id=request.item_id,
-            k=request.num_recommendations
-        )
-        formatted_recommendations = []
-        for item_id, score, item_info in recommendations:
-            formatted_recommendations.append(
-                RecommendationResponse(
-                    item_id=item_id,
-                    score=score,
-                    item_info=ItemInfo(**item_info)
-                )
-            )
-        return formatted_recommendations
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error finding similar items: {str(e)}")
-@app.post("/predict-rating", response_model=RatingPredictionResponse)
-async def predict_user_item_rating(request: RatingPredictionRequest):
-    """Predict rating for a user-item pair."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    try:
-        user_profile = request.user_profile
-        predicted_rating = recommendation_engine.predict_rating(
-            age=user_profile.age,
-            gender=user_profile.gender,
-            income=user_profile.income,
-            interaction_history=user_profile.interaction_history,
-            item_id=request.item_id
-        )
-        item_info = recommendation_engine._get_item_info(request.item_id)
-        return RatingPredictionResponse(
-            user_profile=user_profile,
-            item_id=request.item_id,
-            predicted_rating=predicted_rating,
-            item_info=ItemInfo(**item_info)
-        )
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error predicting rating: {str(e)}")
-@app.get("/items/{item_id}", response_model=ItemInfo)
-async def get_item_info(item_id: int):
-    """Get information about a specific item."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    try:
-        item_info = recommendation_engine._get_item_info(item_id)
-        return ItemInfo(**item_info)
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error retrieving item info: {str(e)}")
-@app.get("/items")
-async def get_sample_items(limit: int = 20):
-    """Get a sample of items for testing."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    try:
-        # Get sample items from the dataframe
-        sample_items = recommendation_engine.items_df.sample(n=min(limit, len(recommendation_engine.items_df)))
-        items = []
-        for _, row in sample_items.iterrows():
-            items.append({
-                "product_id": int(row['product_id']),
-                "category_id": int(row['category_id']),
-                "category_code": str(row['category_code']),
-                "brand": str(row['brand']) if pd.notna(row['brand']) else 'Unknown',
-                "price": float(row['price'])
-            })
-        return {"items": items, "total": len(items), "training_approach": "2-phase"}
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error retrieving sample items: {str(e)}")
-if __name__ == "__main__":
-    print("🚀 Starting 2-Phase Training Recommendation API...")
-    print("📊 Training approach: Pre-trained item tower + Joint fine-tuning")
-    print("🌐 Server will be available at: http://localhost:8000")
-    print("📚 API docs at: http://localhost:8000/docs")
-    uvicorn.run(
-        "api_2phase:app",
-        host="0.0.0.0",
-        port=8000,
-        reload=True
-    )

api_joint.py DELETED Viewed

@@ -1,522 +0,0 @@
-#!/usr/bin/env python3
-"""
-API for Single Joint Trained Recommendation System
-This API serves recommendations from a model trained using the single joint approach:
-- Both user and item towers trained simultaneously from scratch
-- End-to-end optimization without pre-training phases
-Usage:
-    python api_joint.py
-Then access: http://localhost:8000
-"""
-from fastapi import FastAPI, HTTPException
-from fastapi.middleware.cors import CORSMiddleware
-from pydantic import BaseModel
-from typing import List, Optional, Dict, Any
-import uvicorn
-import os
-import sys
-import pandas as pd
-# Add src to path for imports and set working directory
-parent_dir = os.path.dirname(__file__)
-sys.path.append(parent_dir)
-os.chdir(parent_dir)  # Change to project root directory
-from src.inference.recommendation_engine import RecommendationEngine
-from src.utils.real_user_selector import RealUserSelector
-# Initialize FastAPI app
-app = FastAPI(
-    title="Two-Tower Recommendation API (Single Joint Training)",
-    description="API for serving recommendations using a two-tower architecture trained with single joint approach",
-    version="1.0.0-joint"
-)
-# Add CORS middleware
-app.add_middleware(
-    CORSMiddleware,
-    allow_origins=["*"],  # Configure appropriately for production
-    allow_credentials=True,
-    allow_methods=["*"],
-    allow_headers=["*"],
-)
-# Global instances
-recommendation_engine = None
-enhanced_recommendation_engine = None
-real_user_selector = None
-# Pydantic models for request/response
-class UserProfile(BaseModel):
-    age: int
-    gender: str  # "male" or "female"
-    income: float
-    interaction_history: Optional[List[int]] = []
-class RecommendationRequest(BaseModel):
-    user_profile: UserProfile
-    num_recommendations: int = 10
-    recommendation_type: str = "hybrid"  # "collaborative", "content", "hybrid", "enhanced", "enhanced_128d", "category_focused"
-    collaborative_weight: Optional[float] = 0.7
-    category_boost: Optional[float] = 1.5  # For enhanced recommendations
-    enable_category_boost: Optional[bool] = True
-    enable_diversity: Optional[bool] = True
-class ItemSimilarityRequest(BaseModel):
-    item_id: int
-    num_recommendations: int = 10
-class RatingPredictionRequest(BaseModel):
-    user_profile: UserProfile
-    item_id: int
-class ItemInfo(BaseModel):
-    product_id: int
-    category_id: int
-    category_code: str
-    brand: str
-    price: float
-class RecommendationResponse(BaseModel):
-    item_id: int
-    score: float
-    item_info: ItemInfo
-class RecommendationsResponse(BaseModel):
-    recommendations: List[RecommendationResponse]
-    user_profile: UserProfile
-    recommendation_type: str
-    total_count: int
-    training_approach: str = "single-joint"
-class RatingPredictionResponse(BaseModel):
-    user_profile: UserProfile
-    item_id: int
-    predicted_rating: float
-    item_info: ItemInfo
-class RealUserProfile(BaseModel):
-    user_id: int
-    age: int
-    gender: str
-    income: int
-    interaction_history: List[int]
-    interaction_stats: Dict[str, int]
-    interaction_pattern: str
-    summary: str
-class RealUsersResponse(BaseModel):
-    users: List[RealUserProfile]
-    total_count: int
-    dataset_summary: Dict[str, Any]
-@app.on_event("startup")
-async def startup_event():
-    """Initialize the recommendation engines and real user selector on startup."""
-    global recommendation_engine, enhanced_recommendation_engine, real_user_selector
-    print("🚀 Starting Single Joint Training API...")
-    print("   Training approach: End-to-end joint optimization from scratch")
-    try:
-        print("Loading single joint trained recommendation engine...")
-        recommendation_engine = RecommendationEngine()
-        print("✅ Single joint recommendation engine loaded successfully!")
-    except Exception as e:
-        print(f"❌ Error loading recommendation engine: {e}")
-        recommendation_engine = None
-    try:
-        print("Loading enhanced recommendation engine...")
-        # Try enhanced 128D engine first, fallback to regular enhanced
-        try:
-            from src.inference.enhanced_recommendation_engine_128d import Enhanced128DRecommendationEngine
-            enhanced_recommendation_engine = Enhanced128DRecommendationEngine()
-            print("✅ Using Enhanced 128D Recommendation Engine")
-        except:
-            from src.inference.enhanced_recommendation_engine import EnhancedRecommendationEngine
-            enhanced_recommendation_engine = EnhancedRecommendationEngine()
-            print("⚠️  Using fallback Enhanced Recommendation Engine")
-        print("Enhanced recommendation engine loaded successfully!")
-    except Exception as e:
-        print(f"Error loading enhanced recommendation engine: {e}")
-        enhanced_recommendation_engine = None
-    try:
-        print("Loading real user selector...")
-        real_user_selector = RealUserSelector()
-        print("Real user selector loaded successfully!")
-    except Exception as e:
-        print(f"Error loading real user selector: {e}")
-        real_user_selector = None
-    print("🎯 Single Joint API ready to serve recommendations!")
-@app.get("/")
-async def root():
-    """Root endpoint with API information."""
-    return {
-        "message": "Two-Tower Recommendation API (Single Joint Training)",
-        "version": "1.0.0-joint",
-        "training_approach": "single-joint (end-to-end optimization from scratch)",
-        "status": "active" if recommendation_engine is not None else "initialization_failed"
-    }
-@app.get("/health")
-async def health_check():
-    """Health check endpoint."""
-    return {
-        "status": "healthy" if recommendation_engine is not None else "unhealthy",
-        "engine_loaded": recommendation_engine is not None,
-        "training_approach": "single-joint"
-    }
-@app.get("/model-info")
-async def model_info():
-    """Get information about the loaded model."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    return {
-        "training_approach": "single-joint",
-        "description": "User and item towers trained simultaneously from scratch",
-        "advantages": [
-            "End-to-end optimization for better task alignment",
-            "No pre-training phase required",
-            "Faster overall training pipeline",
-            "Direct optimization for recommendation objectives"
-        ],
-        "embedding_dimension": 128,
-        "item_vocab_size": len(recommendation_engine.data_processor.item_vocab) if recommendation_engine.data_processor else "unknown",
-        "artifacts_loaded": {
-            "user_tower": "src/artifacts/user_tower_weights_best",
-            "item_tower_joint": "src/artifacts/item_tower_weights_finetuned_best",
-            "rating_model": "src/artifacts/rating_model_weights_best"
-        }
-    }
-@app.get("/real-users", response_model=RealUsersResponse)
-async def get_real_users(count: int = 100, min_interactions: int = 5):
-    """Get real user profiles with genuine interaction histories."""
-    if real_user_selector is None:
-        raise HTTPException(status_code=503, detail="Real user selector not available")
-    try:
-        # Get real user profiles
-        real_users = real_user_selector.get_real_users(n=count, min_interactions=min_interactions)
-        # Get dataset summary
-        dataset_summary = real_user_selector.get_dataset_summary()
-        # Format users for response
-        formatted_users = []
-        for user in real_users:
-            formatted_users.append(RealUserProfile(**user))
-        return RealUsersResponse(
-            users=formatted_users,
-            total_count=len(formatted_users),
-            dataset_summary=dataset_summary
-        )
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error retrieving real users: {str(e)}")
-@app.get("/real-users/{user_id}")
-async def get_real_user_details(user_id: int):
-    """Get detailed interaction breakdown for a specific real user."""
-    if real_user_selector is None:
-        raise HTTPException(status_code=503, detail="Real user selector not available")
-    try:
-        user_details = real_user_selector.get_user_interaction_details(user_id)
-        if "error" in user_details:
-            raise HTTPException(status_code=404, detail=user_details["error"])
-        return user_details
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error retrieving user details: {str(e)}")
-@app.get("/dataset-summary")
-async def get_dataset_summary():
-    """Get summary statistics of the real dataset."""
-    if real_user_selector is None:
-        raise HTTPException(status_code=503, detail="Real user selector not available")
-    try:
-        return real_user_selector.get_dataset_summary()
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error retrieving dataset summary: {str(e)}")
-@app.post("/recommendations", response_model=RecommendationsResponse)
-async def get_recommendations(request: RecommendationRequest):
-    """Get item recommendations for a user."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    try:
-        user_profile = request.user_profile
-        # Generate recommendations based on type
-        if request.recommendation_type == "collaborative":
-            recommendations = recommendation_engine.recommend_items_collaborative(
-                age=user_profile.age,
-                gender=user_profile.gender,
-                income=user_profile.income,
-                interaction_history=user_profile.interaction_history,
-                k=request.num_recommendations
-            )
-        elif request.recommendation_type == "content":
-            if not user_profile.interaction_history:
-                raise HTTPException(
-                    status_code=400,
-                    detail="Content-based recommendations require interaction history"
-                )
-            # Use most recent interaction as seed
-            seed_item = user_profile.interaction_history[-1]
-            recommendations = recommendation_engine.recommend_items_content_based(
-                seed_item_id=seed_item,
-                k=request.num_recommendations
-            )
-        elif request.recommendation_type == "hybrid":
-            recommendations = recommendation_engine.recommend_items_hybrid(
-                age=user_profile.age,
-                gender=user_profile.gender,
-                income=user_profile.income,
-                interaction_history=user_profile.interaction_history,
-                k=request.num_recommendations,
-                collaborative_weight=request.collaborative_weight
-            )
-        elif request.recommendation_type == "enhanced":
-            if enhanced_recommendation_engine is None:
-                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
-            # Check if it's the 128D engine or fallback
-            if hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
-                # 128D Enhanced engine
-                recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
-                    age=user_profile.age,
-                    gender=user_profile.gender,
-                    income=user_profile.income,
-                    interaction_history=user_profile.interaction_history,
-                    k=request.num_recommendations,
-                    diversity_weight=0.3 if request.enable_diversity else 0.0,
-                    category_boost=request.category_boost if request.enable_category_boost else 1.0
-                )
-            else:
-                # Fallback enhanced engine
-                recommendations = enhanced_recommendation_engine.recommend_items_enhanced_hybrid(
-                    age=user_profile.age,
-                    gender=user_profile.gender,
-                    income=user_profile.income,
-                    interaction_history=user_profile.interaction_history,
-                    k=request.num_recommendations,
-                    collaborative_weight=request.collaborative_weight,
-                    category_boost=request.category_boost,
-                    enable_category_boost=request.enable_category_boost,
-                    enable_diversity=request.enable_diversity
-                )
-        elif request.recommendation_type == "enhanced_128d":
-            if enhanced_recommendation_engine is None or not hasattr(enhanced_recommendation_engine, 'recommend_items_enhanced'):
-                raise HTTPException(status_code=503, detail="Enhanced 128D recommendation engine not available")
-            recommendations = enhanced_recommendation_engine.recommend_items_enhanced(
-                age=user_profile.age,
-                gender=user_profile.gender,
-                income=user_profile.income,
-                interaction_history=user_profile.interaction_history,
-                k=request.num_recommendations,
-                diversity_weight=0.3 if request.enable_diversity else 0.0,
-                category_boost=request.category_boost if request.enable_category_boost else 1.0
-            )
-        elif request.recommendation_type == "category_focused":
-            if enhanced_recommendation_engine is None:
-                raise HTTPException(status_code=503, detail="Enhanced recommendation engine not available")
-            recommendations = enhanced_recommendation_engine.recommend_items_category_focused(
-                age=user_profile.age,
-                gender=user_profile.gender,
-                income=user_profile.income,
-                interaction_history=user_profile.interaction_history,
-                k=request.num_recommendations,
-                focus_percentage=0.8
-            )
-        else:
-            raise HTTPException(
-                status_code=400,
-                detail="Invalid recommendation_type. Must be 'collaborative', 'content', 'hybrid', 'enhanced', 'enhanced_128d', or 'category_focused'"
-            )
-        # Format response
-        formatted_recommendations = []
-        for item_id, score, item_info in recommendations:
-            formatted_recommendations.append(
-                RecommendationResponse(
-                    item_id=item_id,
-                    score=score,
-                    item_info=ItemInfo(**item_info)
-                )
-            )
-        return RecommendationsResponse(
-            recommendations=formatted_recommendations,
-            user_profile=user_profile,
-            recommendation_type=request.recommendation_type,
-            total_count=len(formatted_recommendations),
-            training_approach="single-joint"
-        )
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error generating recommendations: {str(e)}")
-@app.post("/item-similarity", response_model=List[RecommendationResponse])
-async def get_similar_items(request: ItemSimilarityRequest):
-    """Get items similar to a given item."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    try:
-        recommendations = recommendation_engine.recommend_items_content_based(
-            seed_item_id=request.item_id,
-            k=request.num_recommendations
-        )
-        formatted_recommendations = []
-        for item_id, score, item_info in recommendations:
-            formatted_recommendations.append(
-                RecommendationResponse(
-                    item_id=item_id,
-                    score=score,
-                    item_info=ItemInfo(**item_info)
-                )
-            )
-        return formatted_recommendations
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error finding similar items: {str(e)}")
-@app.post("/predict-rating", response_model=RatingPredictionResponse)
-async def predict_user_item_rating(request: RatingPredictionRequest):
-    """Predict rating for a user-item pair."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    try:
-        user_profile = request.user_profile
-        predicted_rating = recommendation_engine.predict_rating(
-            age=user_profile.age,
-            gender=user_profile.gender,
-            income=user_profile.income,
-            interaction_history=user_profile.interaction_history,
-            item_id=request.item_id
-        )
-        item_info = recommendation_engine._get_item_info(request.item_id)
-        return RatingPredictionResponse(
-            user_profile=user_profile,
-            item_id=request.item_id,
-            predicted_rating=predicted_rating,
-            item_info=ItemInfo(**item_info)
-        )
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error predicting rating: {str(e)}")
-@app.get("/items/{item_id}", response_model=ItemInfo)
-async def get_item_info(item_id: int):
-    """Get information about a specific item."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    try:
-        item_info = recommendation_engine._get_item_info(item_id)
-        return ItemInfo(**item_info)
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error retrieving item info: {str(e)}")
-@app.get("/items")
-async def get_sample_items(limit: int = 20):
-    """Get a sample of items for testing."""
-    if recommendation_engine is None:
-        raise HTTPException(status_code=503, detail="Recommendation engine not available")
-    try:
-        # Get sample items from the dataframe
-        sample_items = recommendation_engine.items_df.sample(n=min(limit, len(recommendation_engine.items_df)))
-        items = []
-        for _, row in sample_items.iterrows():
-            items.append({
-                "product_id": int(row['product_id']),
-                "category_id": int(row['category_id']),
-                "category_code": str(row['category_code']),
-                "brand": str(row['brand']) if pd.notna(row['brand']) else 'Unknown',
-                "price": float(row['price'])
-            })
-        return {"items": items, "total": len(items), "training_approach": "single-joint"}
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Error retrieving sample items: {str(e)}")
-if __name__ == "__main__":
-    print("🚀 Starting Single Joint Training Recommendation API...")
-    print("⚡ Training approach: End-to-end joint optimization from scratch")
-    print("🌐 Server will be available at: http://localhost:8000")
-    print("📚 API docs at: http://localhost:8000/docs")
-    uvicorn.run(
-        "api_joint:app",
-        host="0.0.0.0",
-        port=8000,
-        reload=True
-    )

datasets/interactions.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

datasets/items.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

datasets/users.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

frontend/src/App.css CHANGED Viewed

@@ -3,6 +3,50 @@
   font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', sans-serif;
 }
 /* Performance Monitoring Widget */
 .performance-widget {
   position: fixed;
@@ -1220,6 +1264,24 @@
   font-size: 12px;
 }
 .pattern-summary {
   display: flex;
   gap: 20px;

   font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', sans-serif;
 }
+/* Enhanced Demographics Styling */
+.demographic-features {
+  background: linear-gradient(135deg, #f8f9fa 0%, #e9ecef 100%);
+  padding: 20px;
+  border-radius: 10px;
+  border: 2px solid #dee2e6;
+  margin: 15px 0;
+}
+.demographic-features .form-group {
+  position: relative;
+}
+.demographic-features .form-group label {
+  font-weight: 600;
+  color: #495057;
+  font-size: 14px;
+  margin-bottom: 8px;
+  display: block;
+}
+.demographic-features .form-group select {
+  width: 100%;
+  padding: 10px 12px;
+  border: 2px solid #ced4da;
+  border-radius: 6px;
+  font-size: 14px;
+  background: white;
+  color: #495057;
+  transition: all 0.2s ease;
+}
+.demographic-features .form-group select:focus {
+  outline: none;
+  border-color: #007bff;
+  box-shadow: 0 0 0 2px rgba(0, 123, 255, 0.25);
+}
+.demographic-features .form-group select:disabled {
+  background-color: #f8f9fa;
+  color: #6c757d;
+  cursor: not-allowed;
+}
 /* Performance Monitoring Widget */
 .performance-widget {
   position: fixed;
   font-size: 12px;
 }
+.pattern-btn.new-user-pattern {
+  border-color: #6c757d;
+  color: #6c757d;
+  background: #f8f9fa;
+}
+.pattern-btn.new-user-pattern:hover {
+  background: #6c757d;
+  color: white;
+  box-shadow: 0 4px 8px rgba(108, 117, 125, 0.3);
+}
+.pattern-btn.new-user-pattern.active {
+  background: #6c757d;
+  color: white;
+  box-shadow: 0 4px 8px rgba(108, 117, 125, 0.3);
+}
 .pattern-summary {
   display: flex;
   gap: 20px;

frontend/src/App.js CHANGED Viewed

@@ -6,6 +6,7 @@ const API_BASE_URL = process.env.REACT_APP_API_URL || 'http://localhost:8000';
 // Interaction patterns with realistic ratios
 const INTERACTION_PATTERNS = [
   { name: 'Light Browsing', views: 15, carts: 2, purchases: 0 },
   { name: 'Window Shopping', views: 25, carts: 5, purchases: 1 },
   { name: 'Serious Shopper', views: 35, carts: 8, purchases: 3 },
@@ -18,11 +19,15 @@ function App() {
     age: 30,
     gender: 'male',
     income: 50000,
     interaction_history: []
   });
   const [recommendationType, setRecommendationType] = useState('hybrid');
-  const [numRecommendations, setNumRecommendations] = useState(100);
   const [collaborativeWeight, setCollaborativeWeight] = useState(0.7);
   const [recommendations, setRecommendations] = useState([]);
@@ -301,6 +306,10 @@ function App() {
       age: user.age,
       gender: user.gender,
       income: user.income,
       interaction_history: user.interaction_history.slice(0, 50) // Limit to 50 items
     });
     // Clear any synthetic interactions and expanded states
@@ -443,7 +452,20 @@ function App() {
   const handlePatternSelect = (pattern) => {
     setSelectedPattern(pattern);
-    generateRealisticInteractions(pattern);
   };
   const toggleInteractionExpand = (interactionId) => {
@@ -472,6 +494,46 @@ function App() {
   const counts = getInteractionCounts();
   // Calculate category percentages from user interactions
   const getCategoryPercentages = () => {
     console.log('getCategoryPercentages called:', {
@@ -504,10 +566,7 @@ function App() {
           console.log('Enriched behavioral pattern results:', { categoryCounts, totalInteractions });
           if (totalInteractions > 0) {
-            const categoryPercentages = {};
-            Object.keys(categoryCounts).forEach(category => {
-              categoryPercentages[category] = ((categoryCounts[category] / totalInteractions) * 100).toFixed(1);
-            });
             console.log('Returning enriched behavioral pattern percentages:', categoryPercentages);
             return categoryPercentages;
           }
@@ -522,20 +581,15 @@ function App() {
         interactions.forEach(interaction => {
           console.log('Processing interaction:', interaction);
-          if (interaction.category && interaction.category !== 'Unknown') {
-            const category = interaction.category;
-            categoryCounts[category] = (categoryCounts[category] || 0) + 1;
-            totalInteractions++;
-          }
         });
         console.log('Synthetic interaction results:', { categoryCounts, totalInteractions });
         if (totalInteractions > 0) {
-          const categoryPercentages = {};
-          Object.keys(categoryCounts).forEach(category => {
-            categoryPercentages[category] = ((categoryCounts[category] / totalInteractions) * 100).toFixed(1);
-          });
           console.log('Returning synthetic percentages:', categoryPercentages);
           return categoryPercentages;
         }
@@ -554,11 +608,7 @@ function App() {
         totalInteractions++;
       });
-      const categoryPercentages = {};
-      Object.keys(categoryCounts).forEach(category => {
-        categoryPercentages[category] = ((categoryCounts[category] / totalInteractions) * 100).toFixed(1);
-      });
       return categoryPercentages;
     }
@@ -708,7 +758,13 @@ function App() {
                 <div className="real-user-stats">
                   <div className="user-stat">
                     <span className="stat-label">Demographics:</span>
-                    <span className="stat-value">{selectedRealUser.age}yr {selectedRealUser.gender}, ${selectedRealUser.income.toLocaleString()}</span>
                   </div>
                   <div className="user-stat">
                     <span className="stat-label">Behavior Pattern:</span>
@@ -847,6 +903,77 @@ function App() {
               />
             </div>
           </div>
         </div>
         {/* Random Behavioral Patterns for Custom Users */}
@@ -979,7 +1106,7 @@ function App() {
                         <div className="category-percentages">
                           {Object.entries(categoryPercentages)
                             .sort((a, b) => parseFloat(b[1]) - parseFloat(a[1]))
-                            .slice(0, 5)
                             .map(([category, percentage]) => (
                               <div key={category} className="category-item">
                                 <div className="category-bar-container">
@@ -1110,7 +1237,7 @@ function App() {
               )}
               {/* Category Analysis for Custom Users */}
-              {(selectedBehavioralPattern || interactions.length > 0 || userProfile.interaction_history.length > 0) && (
                 <div
                   key={`category-analysis-${interactions.length}-${selectedBehavioralPattern?.id || 'none'}-${sampleItems.length}`}
                   className="category-analysis"
@@ -1128,7 +1255,7 @@ function App() {
                           {Object.keys(categoryPercentages).length > 0 ? (
                             Object.entries(categoryPercentages)
                               .sort((a, b) => parseFloat(b[1]) - parseFloat(a[1]))
-                              .slice(0, 5)
                               .map(([category, percentage]) => (
                                 <div key={category} className="category-item">
                                   <div className="category-bar-container">
@@ -1141,6 +1268,23 @@ function App() {
                                   <span className="category-percent">{percentage}%</span>
                                 </div>
                             ))
                           ) : (
                             <div className="category-loading">
                               <p>Processing interaction categories...</p>
@@ -1325,18 +1469,24 @@ function App() {
               )}
               <h3>Synthetic Interaction Patterns</h3>
-              <p>Generate realistic user behavior patterns with proportional view, cart, and purchase events</p>
               <div className="pattern-buttons">
                 {INTERACTION_PATTERNS.map((pattern, index) => (
                   <button
                     key={index}
-                    className={`pattern-btn ${selectedPattern?.name === pattern.name ? 'active' : ''}`}
                     onClick={() => handlePatternSelect(pattern)}
                   >
                     {pattern.name}
                     <br />
-                    <small>{pattern.views}V • {pattern.carts}C • {pattern.purchases}P</small>
                   </button>
                 ))}
                 <button
@@ -1347,6 +1497,25 @@ function App() {
                   Clear All
                 </button>
               </div>
             </>
           )}
@@ -1498,11 +1667,10 @@ function App() {
                 value={recommendationType}
                 onChange={(e) => setRecommendationType(e.target.value)}
               >
-                <option value="hybrid">Hybrid</option>
-                <option value="enhanced">🎯 Enhanced Hybrid (Category-Aware)</option>
-                <option value="category_focused">🎯 Category Focused (80% Match)</option>
                 <option value="collaborative">Collaborative Filtering</option>
                 <option value="content">Content-Based</option>
               </select>
             </div>
@@ -1521,7 +1689,7 @@ function App() {
               </select>
             </div>
-            {(recommendationType === 'hybrid' || recommendationType === 'enhanced') && (
               <div className="form-group">
                 <label htmlFor="collabWeight">Collaborative Weight:</label>
                 <input
@@ -1548,7 +1716,13 @@ function App() {
           {recommendationType === 'content' && userProfile.interaction_history.length === 0 && (
             <p style={{color: '#dc3545', marginTop: '10px', fontSize: '14px'}}>
-              Content-based recommendations require interaction history. Please select an interaction pattern above.
             </p>
           )}
         </div>
@@ -1567,7 +1741,7 @@ function App() {
             <div className="stats">
               <strong>User Profile:</strong> {userProfile.age}yr {userProfile.gender},
-              ${userProfile.income.toLocaleString()} income
               {selectedCategory && (
                 <span> | <strong>Category Filter:</strong> <span className="category-filter-display">{selectedCategory.replace(/\./g, ' > ')}</span></span>
               )}

 // Interaction patterns with realistic ratios
 const INTERACTION_PATTERNS = [
+  { name: 'New User (No History)', views: 0, carts: 0, purchases: 0, isNewUser: true },
   { name: 'Light Browsing', views: 15, carts: 2, purchases: 0 },
   { name: 'Window Shopping', views: 25, carts: 5, purchases: 1 },
   { name: 'Serious Shopper', views: 35, carts: 8, purchases: 3 },
     age: 30,
     gender: 'male',
     income: 50000,
+    profession: 'Technology',
+    location: 'Urban',
+    education_level: "Bachelor's",
+    marital_status: 'Single',
     interaction_history: []
   });
   const [recommendationType, setRecommendationType] = useState('hybrid');
+  const [numRecommendations, setNumRecommendations] = useState(10);
   const [collaborativeWeight, setCollaborativeWeight] = useState(0.7);
   const [recommendations, setRecommendations] = useState([]);
       age: user.age,
       gender: user.gender,
       income: user.income,
+      profession: user.profession || 'Other',
+      location: user.location || 'Urban',
+      education_level: user.education_level || 'High School',
+      marital_status: user.marital_status || 'Single',
       interaction_history: user.interaction_history.slice(0, 50) // Limit to 50 items
     });
     // Clear any synthetic interactions and expanded states
   const handlePatternSelect = (pattern) => {
     setSelectedPattern(pattern);
+    // Handle New User (zero interactions) pattern specially
+    if (pattern.isNewUser) {
+      // Clear all interactions and interaction history for new user
+      setInteractions([]);
+      setUserProfile(prev => ({
+        ...prev,
+        interaction_history: []
+      }));
+      console.log('Selected New User pattern - cleared all interactions');
+    } else {
+      // Generate realistic interactions for other patterns
+      generateRealisticInteractions(pattern);
+    }
   };
   const toggleInteractionExpand = (interactionId) => {
   const counts = getInteractionCounts();
+  // Utility function to normalize percentages to sum to exactly 100%
+  const normalizePercentages = (categoryCounts, totalInteractions) => {
+    if (totalInteractions === 0) return {};
+    const categories = Object.keys(categoryCounts);
+    if (categories.length === 0) return {};
+    // Calculate raw percentages
+    const rawPercentages = {};
+    categories.forEach(category => {
+      rawPercentages[category] = (categoryCounts[category] / totalInteractions) * 100;
+    });
+    // Round all percentages to 1 decimal place
+    const roundedPercentages = {};
+    let totalRounded = 0;
+    categories.forEach(category => {
+      roundedPercentages[category] = Math.round(rawPercentages[category] * 10) / 10;
+      totalRounded += roundedPercentages[category];
+    });
+    // Adjust the largest category to make total exactly 100%
+    const difference = 100.0 - totalRounded;
+    if (Math.abs(difference) > 0.01) {
+      // Find category with largest raw percentage
+      const largestCategory = categories.reduce((max, category) =>
+        rawPercentages[category] > rawPercentages[max] ? category : max
+      );
+      roundedPercentages[largestCategory] = Math.round((roundedPercentages[largestCategory] + difference) * 10) / 10;
+    }
+    // Convert to string with 1 decimal place
+    const normalizedPercentages = {};
+    categories.forEach(category => {
+      normalizedPercentages[category] = roundedPercentages[category].toFixed(1);
+    });
+    return normalizedPercentages;
+  };
   // Calculate category percentages from user interactions
   const getCategoryPercentages = () => {
     console.log('getCategoryPercentages called:', {
           console.log('Enriched behavioral pattern results:', { categoryCounts, totalInteractions });
           if (totalInteractions > 0) {
+            const categoryPercentages = normalizePercentages(categoryCounts, totalInteractions);
             console.log('Returning enriched behavioral pattern percentages:', categoryPercentages);
             return categoryPercentages;
           }
         interactions.forEach(interaction => {
           console.log('Processing interaction:', interaction);
+          const category = interaction.category_code || interaction.category || 'Unknown';
+          categoryCounts[category] = (categoryCounts[category] || 0) + 1;
+          totalInteractions++;
         });
         console.log('Synthetic interaction results:', { categoryCounts, totalInteractions });
         if (totalInteractions > 0) {
+          const categoryPercentages = normalizePercentages(categoryCounts, totalInteractions);
           console.log('Returning synthetic percentages:', categoryPercentages);
           return categoryPercentages;
         }
         totalInteractions++;
       });
+      const categoryPercentages = normalizePercentages(categoryCounts, totalInteractions);
       return categoryPercentages;
     }
                 <div className="real-user-stats">
                   <div className="user-stat">
                     <span className="stat-label">Demographics:</span>
+                    <span className="stat-value">
+                      {selectedRealUser.age}yr {selectedRealUser.gender}, ${selectedRealUser.income.toLocaleString()}
+                      {selectedRealUser.profession && ` | ${selectedRealUser.profession}`}
+                      {selectedRealUser.location && ` | ${selectedRealUser.location}`}
+                      {selectedRealUser.education_level && ` | ${selectedRealUser.education_level}`}
+                      {selectedRealUser.marital_status && ` | ${selectedRealUser.marital_status}`}
+                    </span>
                   </div>
                   <div className="user-stat">
                     <span className="stat-label">Behavior Pattern:</span>
               />
             </div>
           </div>
+          {/* New Demographic Features */}
+          <div className="form-row demographic-features">
+            <div className="form-group">
+              <label htmlFor="profession">Profession:</label>
+              <select
+                id="profession"
+                value={userProfile.profession}
+                onChange={(e) => handleProfileChange('profession', e.target.value)}
+                disabled={useRealUsers && selectedRealUser}
+                style={{backgroundColor: useRealUsers && selectedRealUser ? '#f5f5f5' : 'white'}}
+              >
+                <option value="Technology">Technology</option>
+                <option value="Healthcare">Healthcare</option>
+                <option value="Education">Education</option>
+                <option value="Finance">Finance</option>
+                <option value="Retail">Retail</option>
+                <option value="Manufacturing">Manufacturing</option>
+                <option value="Services">Services</option>
+                <option value="Other">Other</option>
+              </select>
+            </div>
+            <div className="form-group">
+              <label htmlFor="location">Location:</label>
+              <select
+                id="location"
+                value={userProfile.location}
+                onChange={(e) => handleProfileChange('location', e.target.value)}
+                disabled={useRealUsers && selectedRealUser}
+                style={{backgroundColor: useRealUsers && selectedRealUser ? '#f5f5f5' : 'white'}}
+              >
+                <option value="Urban">Urban</option>
+                <option value="Suburban">Suburban</option>
+                <option value="Rural">Rural</option>
+              </select>
+            </div>
+            <div className="form-group">
+              <label htmlFor="education_level">Education Level:</label>
+              <select
+                id="education_level"
+                value={userProfile.education_level}
+                onChange={(e) => handleProfileChange('education_level', e.target.value)}
+                disabled={useRealUsers && selectedRealUser}
+                style={{backgroundColor: useRealUsers && selectedRealUser ? '#f5f5f5' : 'white'}}
+              >
+                <option value="High School">High School</option>
+                <option value="Some College">Some College</option>
+                <option value="Bachelor's">Bachelor's</option>
+                <option value="Master's">Master's</option>
+                <option value="PhD+">PhD+</option>
+              </select>
+            </div>
+            <div className="form-group">
+              <label htmlFor="marital_status">Marital Status:</label>
+              <select
+                id="marital_status"
+                value={userProfile.marital_status}
+                onChange={(e) => handleProfileChange('marital_status', e.target.value)}
+                disabled={useRealUsers && selectedRealUser}
+                style={{backgroundColor: useRealUsers && selectedRealUser ? '#f5f5f5' : 'white'}}
+              >
+                <option value="Single">Single</option>
+                <option value="Married">Married</option>
+                <option value="Divorced">Divorced</option>
+                <option value="Widowed">Widowed</option>
+              </select>
+            </div>
+          </div>
         </div>
         {/* Random Behavioral Patterns for Custom Users */}
                         <div className="category-percentages">
                           {Object.entries(categoryPercentages)
                             .sort((a, b) => parseFloat(b[1]) - parseFloat(a[1]))
+                            .slice(0, 10)
                             .map(([category, percentage]) => (
                               <div key={category} className="category-item">
                                 <div className="category-bar-container">
               )}
               {/* Category Analysis for Custom Users */}
+              {(selectedBehavioralPattern || interactions.length > 0 || userProfile.interaction_history.length > 0 || (recommendations.length > 0 && selectedPattern?.isNewUser)) && (
                 <div
                   key={`category-analysis-${interactions.length}-${selectedBehavioralPattern?.id || 'none'}-${sampleItems.length}`}
                   className="category-analysis"
                           {Object.keys(categoryPercentages).length > 0 ? (
                             Object.entries(categoryPercentages)
                               .sort((a, b) => parseFloat(b[1]) - parseFloat(a[1]))
+                              .slice(0, 10)
                               .map(([category, percentage]) => (
                                 <div key={category} className="category-item">
                                   <div className="category-bar-container">
                                   <span className="category-percent">{percentage}%</span>
                                 </div>
                             ))
+                          ) : selectedPattern?.isNewUser ? (
+                            <div className="new-user-category-message">
+                              <div style={{
+                                padding: '20px',
+                                backgroundColor: '#f8f9fa',
+                                border: '2px dashed #6c757d',
+                                borderRadius: '8px',
+                                textAlign: 'center',
+                                color: '#495057'
+                              }}>
+                                <h6 style={{margin: '0 0 8px 0', color: '#343a40'}}>🆕 New User - No History</h6>
+                                <p style={{margin: '0', fontSize: '14px'}}>
+                                  No category preferences yet.<br />
+                                  Recommendations are based on demographics only.
+                                </p>
+                              </div>
+                            </div>
                           ) : (
                             <div className="category-loading">
                               <p>Processing interaction categories...</p>
               )}
               <h3>Synthetic Interaction Patterns</h3>
+              <p>Generate realistic user behavior patterns with proportional view, cart, and purchase events. Choose "New User" to test cold-start scenarios.</p>
               <div className="pattern-buttons">
                 {INTERACTION_PATTERNS.map((pattern, index) => (
                   <button
                     key={index}
+                    className={`pattern-btn ${selectedPattern?.name === pattern.name ? 'active' : ''} ${pattern.isNewUser ? 'new-user-pattern' : ''}`}
                     onClick={() => handlePatternSelect(pattern)}
                   >
                     {pattern.name}
                     <br />
+                    <small>
+                      {pattern.isNewUser ? (
+                        <span style={{fontStyle: 'italic', color: '#6c757d'}}>Cold Start User</span>
+                      ) : (
+                        `${pattern.views}V • ${pattern.carts}C • ${pattern.purchases}P`
+                      )}
+                    </small>
                   </button>
                 ))}
                 <button
                   Clear All
                 </button>
               </div>
+              {/* Show informational message for New User pattern */}
+              {selectedPattern?.isNewUser && (
+                <div style={{
+                  backgroundColor: '#e3f2fd',
+                  border: '1px solid #90caf9',
+                  borderRadius: '8px',
+                  padding: '15px',
+                  margin: '15px 0',
+                  color: '#1565c0'
+                }}>
+                  <h4 style={{margin: '0 0 10px 0', color: '#0d47a1'}}>🆕 New User (Cold Start) Selected</h4>
+                  <p style={{margin: '0', fontSize: '14px', lineHeight: '1.4'}}>
+                    Testing cold-start scenario with no interaction history.
+                    <br /><strong>Compatible algorithms:</strong> Collaborative ✅, Hybrid ✅ (demographics-based)
+                    <br /><strong>Incompatible:</strong> Content-based ❌, Category-boosted ❌ (require history)
+                  </p>
+                </div>
+              )}
             </>
           )}
                 value={recommendationType}
                 onChange={(e) => setRecommendationType(e.target.value)}
               >
+                <option value="hybrid">Hybrid (Recommended)</option>
                 <option value="collaborative">Collaborative Filtering</option>
                 <option value="content">Content-Based</option>
+                <option value="category_boosted">📊 Category Boosted (50% from user categories)</option>
               </select>
             </div>
               </select>
             </div>
+            {recommendationType === 'hybrid' && (
               <div className="form-group">
                 <label htmlFor="collabWeight">Collaborative Weight:</label>
                 <input
           {recommendationType === 'content' && userProfile.interaction_history.length === 0 && (
             <p style={{color: '#dc3545', marginTop: '10px', fontSize: '14px'}}>
+              ⚠️ Content-based recommendations require interaction history. Please select a pattern with interactions above, or choose 'Collaborative' or 'Hybrid' for new users.
+            </p>
+          )}
+          {recommendationType === 'category_boosted' && userProfile.interaction_history.length === 0 && (
+            <p style={{color: '#dc3545', marginTop: '10px', fontSize: '14px'}}>
+              ⚠️ Category-boosted recommendations require interaction history to analyze preferences. Please select a pattern with interactions above, or choose 'Collaborative' for new users.
             </p>
           )}
         </div>
             <div className="stats">
               <strong>User Profile:</strong> {userProfile.age}yr {userProfile.gender},
+              ${userProfile.income.toLocaleString()} income, {userProfile.profession}, {userProfile.location}, {userProfile.education_level}, {userProfile.marital_status}
               {selectedCategory && (
                 <span> | <strong>Category Filter:</strong> <span className="category-filter-display">{selectedCategory.replace(/\./g, ' > ')}</span></span>
               )}

src/data_generation/generate_demographics.py ADDED Viewed

	@@ -0,0 +1,292 @@

+import pandas as pd
+import numpy as np
+from typing import Dict, List
+import os
+class DemographicDataGenerator:
+    """Generate realistic categorical demographic data correlating with existing age/income."""
+    def __init__(self, seed: int = 42):
+        np.random.seed(seed)
+        # Define categorical mappings
+        self.profession_categories = [
+            "Technology", "Healthcare", "Education", "Finance",
+            "Retail", "Manufacturing", "Services", "Other"
+        ]
+        self.location_categories = ["Urban", "Suburban", "Rural"]
+        self.education_categories = [
+            "High School", "Some College", "Bachelor's", "Master's", "PhD+"
+        ]
+        self.marital_categories = ["Single", "Married", "Divorced", "Widowed"]
+    def generate_profession(self, age: int, income: float, gender: str) -> str:
+        """Generate profession based on age, income, and gender correlations."""
+        # Age-based profession probabilities
+        if age < 25:
+            # Young adults - more likely in retail, services, some tech
+            probs = [0.15, 0.10, 0.08, 0.05, 0.25, 0.10, 0.20, 0.07]
+        elif age < 35:
+            # Early career - tech, healthcare, finance growth
+            probs = [0.25, 0.15, 0.10, 0.15, 0.12, 0.08, 0.10, 0.05]
+        elif age < 50:
+            # Mid career - established in all fields
+            probs = [0.20, 0.18, 0.15, 0.18, 0.08, 0.12, 0.07, 0.02]
+        else:
+            # Senior career - more in education, healthcare, services
+            probs = [0.15, 0.20, 0.20, 0.15, 0.05, 0.15, 0.08, 0.02]
+        # Income adjustments
+        if income > 90000:  # High income
+            # Boost tech, finance, healthcare
+            probs[0] *= 1.5  # Technology
+            probs[3] *= 1.5  # Finance
+            probs[1] *= 1.3  # Healthcare
+            probs[4] *= 0.5  # Retail
+            probs[6] *= 0.7  # Services
+        elif income < 40000:  # Lower income
+            # Boost retail, services, manufacturing
+            probs[4] *= 2.0  # Retail
+            probs[6] *= 1.8  # Services
+            probs[5] *= 1.5  # Manufacturing
+            probs[0] *= 0.3  # Technology
+            probs[3] *= 0.3  # Finance
+        # Normalize probabilities
+        probs = np.array(probs)
+        probs = probs / np.sum(probs)
+        return np.random.choice(self.profession_categories, p=probs)
+    def generate_location(self, income: float, profession: str) -> str:
+        """Generate location based on income and profession."""
+        # Base probabilities (roughly US distribution)
+        probs = [0.62, 0.27, 0.11]  # Urban, Suburban, Rural
+        # Income adjustments
+        if income > 80000:
+            # Higher income -> more suburban
+            probs = [0.45, 0.45, 0.10]
+        elif income < 35000:
+            # Lower income -> more urban/rural
+            probs = [0.70, 0.15, 0.15]
+        # Profession adjustments
+        if profession in ["Technology", "Finance"]:
+            # Tech/Finance -> more urban
+            probs[0] *= 1.4
+            probs[2] *= 0.5
+        elif profession in ["Manufacturing", "Other"]:
+            # Manufacturing -> more rural/suburban
+            probs[1] *= 1.3
+            probs[2] *= 1.5
+            probs[0] *= 0.7
+        # Normalize
+        probs = np.array(probs)
+        probs = probs / np.sum(probs)
+        return np.random.choice(self.location_categories, p=probs)
+    def generate_education_level(self, age: int, income: float, profession: str) -> str:
+        """Generate education level based on age, income, and profession."""
+        # Base probabilities (roughly US distribution)
+        probs = [0.27, 0.20, 0.33, 0.13, 0.07]  # HS, Some College, Bachelor's, Master's, PhD+
+        # Age adjustments (older generations had less college access)
+        if age > 55:
+            probs = [0.40, 0.25, 0.25, 0.08, 0.02]
+        elif age > 40:
+            probs = [0.32, 0.23, 0.30, 0.12, 0.03]
+        elif age < 30:
+            # Younger generation has more education
+            probs = [0.20, 0.15, 0.40, 0.18, 0.07]
+        # Income adjustments
+        if income > 100000:
+            # High income -> more advanced degrees
+            probs = [0.10, 0.10, 0.35, 0.30, 0.15]
+        elif income > 70000:
+            # Good income -> more bachelor's/master's
+            probs = [0.15, 0.15, 0.45, 0.20, 0.05]
+        elif income < 40000:
+            # Lower income -> less higher education
+            probs = [0.45, 0.30, 0.20, 0.04, 0.01]
+        # Profession adjustments
+        if profession in ["Technology", "Healthcare", "Finance"]:
+            # Professional fields -> more degrees
+            probs = [0.05, 0.10, 0.40, 0.30, 0.15]
+        elif profession == "Education":
+            # Education -> even more advanced degrees
+            probs = [0.02, 0.05, 0.25, 0.45, 0.23]
+        elif profession in ["Retail", "Services", "Manufacturing"]:
+            # Service industries -> less higher education
+            probs = [0.40, 0.25, 0.25, 0.08, 0.02]
+        # Normalize
+        probs = np.array(probs)
+        probs = probs / np.sum(probs)
+        return np.random.choice(self.education_categories, p=probs)
+    def generate_marital_status(self, age: int, gender: str) -> str:
+        """Generate marital status based on age and gender."""
+        # Age-based probabilities
+        if age < 25:
+            probs = [0.85, 0.13, 0.02, 0.00]  # Single, Married, Divorced, Widowed
+        elif age < 35:
+            probs = [0.45, 0.50, 0.05, 0.00]
+        elif age < 50:
+            probs = [0.15, 0.70, 0.14, 0.01]
+        elif age < 65:
+            probs = [0.10, 0.65, 0.20, 0.05]
+        else:
+            probs = [0.08, 0.55, 0.15, 0.22]
+        # Gender adjustments (women tend to be widowed more often in older ages)
+        if age > 65 and gender == 'female':
+            probs[3] *= 2.0  # More widowed women
+            probs[1] *= 0.8  # Fewer married
+        # Normalize
+        probs = np.array(probs)
+        probs = probs / np.sum(probs)
+        return np.random.choice(self.marital_categories, p=probs)
+    def generate_user_demographics(self, users_df: pd.DataFrame) -> pd.DataFrame:
+        """Generate all demographic features for all users."""
+        print(f"Generating demographic data for {len(users_df)} users...")
+        # Create a copy to avoid modifying original
+        enhanced_users = users_df.copy()
+        # Generate each demographic feature
+        professions = []
+        locations = []
+        education_levels = []
+        marital_statuses = []
+        for idx, row in users_df.iterrows():
+            age = row['age']
+            income = row['income']
+            gender = row['gender']
+            # Generate profession first as it influences other features
+            profession = self.generate_profession(age, income, gender)
+            professions.append(profession)
+            # Generate location based on income and profession
+            location = self.generate_location(income, profession)
+            locations.append(location)
+            # Generate education based on age, income, and profession
+            education = self.generate_education_level(age, income, profession)
+            education_levels.append(education)
+            # Generate marital status based on age and gender
+            marital_status = self.generate_marital_status(age, gender)
+            marital_statuses.append(marital_status)
+        # Add new columns
+        enhanced_users['profession'] = professions
+        enhanced_users['location'] = locations
+        enhanced_users['education_level'] = education_levels
+        enhanced_users['marital_status'] = marital_statuses
+        return enhanced_users
+    def print_demographic_statistics(self, users_df: pd.DataFrame):
+        """Print statistics about the generated demographics."""
+        print("\n=== Demographic Statistics ===")
+        # Profession distribution
+        print(f"\nProfession Distribution:")
+        prof_counts = users_df['profession'].value_counts()
+        for prof, count in prof_counts.items():
+            pct = (count / len(users_df)) * 100
+            print(f"  {prof}: {count:,} ({pct:.1f}%)")
+        # Location distribution
+        print(f"\nLocation Distribution:")
+        loc_counts = users_df['location'].value_counts()
+        for loc, count in loc_counts.items():
+            pct = (count / len(users_df)) * 100
+            print(f"  {loc}: {count:,} ({pct:.1f}%)")
+        # Education distribution
+        print(f"\nEducation Level Distribution:")
+        edu_counts = users_df['education_level'].value_counts()
+        for edu, count in edu_counts.items():
+            pct = (count / len(users_df)) * 100
+            print(f"  {edu}: {count:,} ({pct:.1f}%)")
+        # Marital status distribution
+        print(f"\nMarital Status Distribution:")
+        marital_counts = users_df['marital_status'].value_counts()
+        for status, count in marital_counts.items():
+            pct = (count / len(users_df)) * 100
+            print(f"  {status}: {count:,} ({pct:.1f}%)")
+        print(f"\nTotal users: {len(users_df):,}")
+        # Cross-tabulations to show correlations
+        print(f"\n=== Key Correlations ===")
+        # High income professions
+        high_income = users_df[users_df['income'] > 80000]
+        print(f"\nTop professions for high income (>${80000:,}+):")
+        high_income_prof = high_income['profession'].value_counts(normalize=True) * 100
+        for prof, pct in high_income_prof.head().items():
+            print(f"  {prof}: {pct:.1f}%")
+        # Education by profession
+        print(f"\nEducation levels in Technology:")
+        tech_edu = users_df[users_df['profession'] == 'Technology']['education_level'].value_counts(normalize=True) * 100
+        for edu, pct in tech_edu.items():
+            print(f"  {edu}: {pct:.1f}%")
+def main():
+    """Main function to generate and save enhanced demographic data."""
+    # Load existing users data
+    users_path = "datasets/users.csv"
+    if not os.path.exists(users_path):
+        print(f"Error: {users_path} not found!")
+        return
+    print(f"Loading users data from {users_path}")
+    users_df = pd.read_csv(users_path)
+    print(f"Original data shape: {users_df.shape}")
+    print(f"Original columns: {list(users_df.columns)}")
+    # Generate demographic data
+    generator = DemographicDataGenerator(seed=42)
+    enhanced_users = generator.generate_user_demographics(users_df)
+    # Print statistics
+    generator.print_demographic_statistics(enhanced_users)
+    # Save enhanced data
+    output_path = "datasets/users_enhanced.csv"
+    enhanced_users.to_csv(output_path, index=False)
+    print(f"\nEnhanced users data saved to {output_path}")
+    print(f"Enhanced data shape: {enhanced_users.shape}")
+    print(f"New columns: {list(enhanced_users.columns)}")
+if __name__ == "__main__":
+    main()

src/inference/enhanced_recommendation_engine.py DELETED Viewed

@@ -1,303 +0,0 @@
-#!/usr/bin/env python3
-"""
-Enhanced recommendation engine with category-aware filtering and improved user alignment.
-"""
-import numpy as np
-import pandas as pd
-from typing import Dict, List, Tuple, Optional
-from collections import Counter
-import random
-import sys
-import os
-sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))
-from src.inference.recommendation_engine import RecommendationEngine
-from src.utils.real_user_selector import RealUserSelector
-class EnhancedRecommendationEngine(RecommendationEngine):
-    """Enhanced recommendation engine with category-aware improvements."""
-    def __init__(self, artifacts_path: str = "src/artifacts/"):
-        super().__init__(artifacts_path)
-        self.real_user_selector = RealUserSelector()
-    def _analyze_user_category_preferences(self, interaction_history: List[int]) -> Dict[str, float]:
-        """Analyze user's category preferences from interaction history."""
-        if not interaction_history:
-            return {}
-        category_counts = Counter()
-        for item_id in interaction_history:
-            # Get item category from items dataframe
-            item_row = self.items_df[self.items_df['product_id'] == item_id]
-            if not item_row.empty:
-                category = item_row.iloc[0].get('category_code', 'Unknown')
-                category_counts[category] += 1
-        # Convert to percentages
-        total_interactions = sum(category_counts.values())
-        if total_interactions == 0:
-            return {}
-        category_preferences = {}
-        for category, count in category_counts.items():
-            category_preferences[category] = count / total_interactions
-        return category_preferences
-    def _boost_category_aligned_recommendations(self,
-                                              recommendations: List[Tuple[int, float, Dict]],
-                                              user_category_preferences: Dict[str, float],
-                                              boost_factor: float = 1.5) -> List[Tuple[int, float, Dict]]:
-        """Boost recommendations that align with user's category preferences."""
-        if not user_category_preferences:
-            return recommendations
-        boosted_recs = []
-        for item_id, score, item_info in recommendations:
-            item_category = item_info.get('category_code', 'Unknown')
-            # Apply category boost if user has preference for this category
-            category_preference = user_category_preferences.get(item_category, 0)
-            if category_preference > 0:
-                # Boost score based on user's preference strength
-                boosted_score = score * (1 + boost_factor * category_preference)
-                boosted_recs.append((item_id, boosted_score, item_info))
-            else:
-                boosted_recs.append((item_id, score, item_info))
-        # Re-sort by boosted scores
-        boosted_recs.sort(key=lambda x: x[1], reverse=True)
-        return boosted_recs
-    def _diversify_recommendations(self,
-                                  recommendations: List[Tuple[int, float, Dict]],
-                                  max_per_category: int = 3) -> List[Tuple[int, float, Dict]]:
-        """Ensure category diversity in recommendations."""
-        category_counts = Counter()
-        diversified_recs = []
-        for item_id, score, item_info in recommendations:
-            item_category = item_info.get('category_code', 'Unknown')
-            if category_counts[item_category] < max_per_category:
-                diversified_recs.append((item_id, score, item_info))
-                category_counts[item_category] += 1
-        return diversified_recs
-    def recommend_items_enhanced_hybrid(self,
-                                       age: int,
-                                       gender: str,
-                                       income: float,
-                                       interaction_history: List[int] = None,
-                                       k: int = 10,
-                                       collaborative_weight: float = 0.7,
-                                       category_boost: float = 1.5,
-                                       enable_category_boost: bool = True,
-                                       enable_diversity: bool = True,
-                                       max_per_category: int = 3) -> List[Tuple[int, float, Dict]]:
-        """Generate enhanced hybrid recommendations with category awareness."""
-        # Start with base hybrid recommendations (get more than needed)
-        base_k = k * 3  # Get 3x more candidates for filtering
-        base_recommendations = self.recommend_items_hybrid(
-            age=age,
-            gender=gender,
-            income=income,
-            interaction_history=interaction_history,
-            k=base_k,
-            collaborative_weight=collaborative_weight
-        )
-        if not base_recommendations:
-            return []
-        # Analyze user's category preferences
-        if enable_category_boost and interaction_history:
-            user_category_preferences = self._analyze_user_category_preferences(interaction_history)
-            # Apply category-based boosting
-            base_recommendations = self._boost_category_aligned_recommendations(
-                base_recommendations,
-                user_category_preferences,
-                boost_factor=category_boost
-            )
-        # Apply diversity filtering if enabled
-        if enable_diversity:
-            base_recommendations = self._diversify_recommendations(
-                base_recommendations,
-                max_per_category=max_per_category
-            )
-        # Return top k
-        return base_recommendations[:k]
-    def recommend_items_category_focused(self,
-                                        age: int,
-                                        gender: str,
-                                        income: float,
-                                        interaction_history: List[int] = None,
-                                        k: int = 10,
-                                        focus_percentage: float = 0.7) -> List[Tuple[int, float, Dict]]:
-        """Generate recommendations focused on user's preferred categories."""
-        if not interaction_history:
-            # Fall back to regular hybrid for users without history
-            return self.recommend_items_hybrid(age, gender, income, interaction_history, k)
-        # Analyze user preferences
-        user_category_preferences = self._analyze_user_category_preferences(interaction_history)
-        if not user_category_preferences:
-            return self.recommend_items_hybrid(age, gender, income, interaction_history, k)
-        # Get top categories (sorted by preference)
-        top_categories = sorted(user_category_preferences.items(),
-                               key=lambda x: x[1], reverse=True)
-        # Determine how many recs to focus on preferred categories
-        focused_k = int(k * focus_percentage)
-        exploration_k = k - focused_k
-        # Get base recommendations
-        all_recommendations = self.recommend_items_hybrid(
-            age, gender, income, interaction_history, k * 2
-        )
-        # Split into focused and exploration recommendations
-        focused_recs = []
-        exploration_recs = []
-        # Get user's top 3 categories
-        preferred_categories = set([cat for cat, _ in top_categories[:3]])
-        for item_id, score, item_info in all_recommendations:
-            item_category = item_info.get('category_code', 'Unknown')
-            if (item_category in preferred_categories and
-                len(focused_recs) < focused_k):
-                focused_recs.append((item_id, score, item_info))
-            elif len(exploration_recs) < exploration_k:
-                exploration_recs.append((item_id, score, item_info))
-        # Combine focused and exploration recommendations
-        final_recommendations = focused_recs + exploration_recs
-        return final_recommendations[:k]
-    def get_recommendation_explanation(self,
-                                     recommendations: List[Tuple[int, float, Dict]],
-                                     interaction_history: List[int] = None) -> Dict:
-        """Provide explanation for why these recommendations were generated."""
-        if not recommendations:
-            return {"message": "No recommendations generated"}
-        # Analyze recommendation categories
-        rec_categories = Counter()
-        for _, _, item_info in recommendations:
-            category = item_info.get('category_code', 'Unknown')
-            rec_categories[category] += 1
-        explanation = {
-            "total_recommendations": len(recommendations),
-            "categories_covered": len(rec_categories),
-            "category_breakdown": dict(rec_categories.most_common())
-        }
-        # Add user preference analysis if history available
-        if interaction_history:
-            user_preferences = self._analyze_user_category_preferences(interaction_history)
-            # Calculate alignment
-            user_cats = set(user_preferences.keys())
-            rec_cats = set(rec_categories.keys())
-            alignment = len(user_cats & rec_cats) / len(rec_cats) * 100 if rec_cats else 0
-            explanation.update({
-                "user_category_preferences": user_preferences,
-                "alignment_percentage": round(alignment, 1),
-                "matched_categories": list(user_cats & rec_cats),
-                "new_categories": list(rec_cats - user_cats)
-            })
-        return explanation
-def demo_enhanced_recommendations():
-    """Demo the enhanced recommendation engine."""
-    print("🚀 ENHANCED RECOMMENDATION ENGINE DEMO")
-    print("="*70)
-    # Initialize enhanced engine
-    engine = EnhancedRecommendationEngine()
-    # Get a real user for testing
-    real_user_selector = RealUserSelector()
-    test_users = real_user_selector.get_real_users(n=3, min_interactions=15)
-    for user in test_users:
-        print(f"\n📊 Testing User {user['user_id']} ({user['age']}yr {user['gender']}):")
-        print(f"   Interaction History: {len(user['interaction_history'])} items")
-        # Test different recommendation methods
-        methods = [
-            ("Original Hybrid", lambda: engine.recommend_items_hybrid(
-                age=user['age'],
-                gender=user['gender'],
-                income=user['income'],
-                interaction_history=user['interaction_history'][:20],
-                k=10,
-                collaborative_weight=0.7
-            )),
-            ("Enhanced Hybrid", lambda: engine.recommend_items_enhanced_hybrid(
-                age=user['age'],
-                gender=user['gender'],
-                income=user['income'],
-                interaction_history=user['interaction_history'][:20],
-                k=10,
-                collaborative_weight=0.7,
-                category_boost=1.5
-            )),
-            ("Category Focused", lambda: engine.recommend_items_category_focused(
-                age=user['age'],
-                gender=user['gender'],
-                income=user['income'],
-                interaction_history=user['interaction_history'][:20],
-                k=10,
-                focus_percentage=0.8
-            ))
-        ]
-        for method_name, method_func in methods:
-            try:
-                recs = method_func()
-                explanation = engine.get_recommendation_explanation(
-                    recs, user['interaction_history'][:20]
-                )
-                print(f"\n   🎯 {method_name}:")
-                print(f"      Categories: {explanation.get('category_breakdown', {})}")
-                print(f"      Alignment: {explanation.get('alignment_percentage', 'N/A')}%")
-            except Exception as e:
-                print(f"      ❌ Error: {str(e)[:40]}...")
-    print(f"\n✅ Enhanced recommendation engine demo completed!")
-if __name__ == "__main__":
-    demo_enhanced_recommendations()

src/inference/enhanced_recommendation_engine_128d.py DELETED Viewed

@@ -1,499 +0,0 @@
-#!/usr/bin/env python3
-"""
-Enhanced recommendation engine using 128D embeddings with diversity regularization.
-"""
-import numpy as np
-import pandas as pd
-import tensorflow as tf
-import pickle
-import os
-from typing import Dict, List, Tuple, Optional
-from collections import Counter, defaultdict
-import sys
-sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))
-from src.models.enhanced_two_tower import EnhancedItemTower, EnhancedUserTower
-from src.inference.faiss_index import FAISSItemIndex
-from src.preprocessing.data_loader import DataProcessor
-from src.preprocessing.user_data_preparation import prepare_user_features
-from src.utils.real_user_selector import RealUserSelector
-class Enhanced128DRecommendationEngine:
-    """Enhanced recommendation engine with 128D embeddings and all improvements."""
-    def __init__(self, artifacts_path: str = "src/artifacts/"):
-        self.artifacts_path = artifacts_path
-        self.embedding_dim = 128  # Fixed to 128D
-        # Model components
-        self.item_tower = None
-        self.user_tower = None
-        self.rating_model = None
-        self.faiss_index = None
-        self.data_processor = None
-        # Data
-        self.items_df = None
-        self.users_df = None
-        self.income_thresholds = None
-        # Load all components
-        self._load_all_components()
-    def _load_all_components(self):
-        """Load all enhanced model components."""
-        print("Loading enhanced 128D recommendation engine...")
-        # Load data processor
-        self.data_processor = DataProcessor()
-        try:
-            self.data_processor.load_vocabularies(f"{self.artifacts_path}/vocabularies.pkl")
-        except FileNotFoundError:
-            print("❌ Vocabularies not found. Please train the model first.")
-            return
-        # Load datasets
-        self.items_df = pd.read_csv("datasets/items.csv")
-        self.users_df = pd.read_csv("datasets/users.csv")
-        # Load enhanced model components
-        self._load_enhanced_models()
-        # Load FAISS index with 128D
-        try:
-            self.faiss_index = FAISSItemIndex(embedding_dim=self.embedding_dim)
-            # Try to load enhanced embeddings first
-            if os.path.exists(f"{self.artifacts_path}/enhanced_item_embeddings.npy"):
-                enhanced_embeddings = np.load(
-                    f"{self.artifacts_path}/enhanced_item_embeddings.npy",
-                    allow_pickle=True
-                ).item()
-                self.faiss_index.build_index(enhanced_embeddings)
-                print("✅ Loaded enhanced 128D FAISS index")
-            else:
-                print("⚠️  Enhanced embeddings not found. Train enhanced model first.")
-                self.faiss_index = None
-        except Exception as e:
-            print(f"⚠️  Could not load FAISS index: {e}")
-            self.faiss_index = None
-        # Load income thresholds for categorical demographics
-        self._load_income_thresholds()
-        print("✅ Enhanced 128D engine loaded successfully!")
-    def _load_enhanced_models(self):
-        """Load enhanced model components."""
-        try:
-            # Create model architecture
-            self.item_tower = EnhancedItemTower(
-                item_vocab_size=len(self.data_processor.item_vocab),
-                category_vocab_size=len(self.data_processor.category_vocab),
-                brand_vocab_size=len(self.data_processor.brand_vocab),
-                embedding_dim=self.embedding_dim,
-                use_bias=True,
-                use_diversity_reg=False  # Disable during inference
-            )
-            self.user_tower = EnhancedUserTower(
-                max_history_length=50,
-                embedding_dim=self.embedding_dim,
-                use_bias=True,
-                use_diversity_reg=False  # Disable during inference
-            )
-            # Create rating model
-            self.rating_model = tf.keras.Sequential([
-                tf.keras.layers.Dense(512, activation="relu"),
-                tf.keras.layers.BatchNormalization(),
-                tf.keras.layers.Dropout(0.3),
-                tf.keras.layers.Dense(256, activation="relu"),
-                tf.keras.layers.BatchNormalization(),
-                tf.keras.layers.Dropout(0.2),
-                tf.keras.layers.Dense(64, activation="relu"),
-                tf.keras.layers.Dense(1, activation="sigmoid")
-            ])
-            # Load weights - try enhanced first, fall back to regular
-            model_files = [
-                ('enhanced_item_tower_weights_enhanced_best', 'enhanced_user_tower_weights_enhanced_best', 'enhanced_rating_model_weights_enhanced_best'),
-                ('enhanced_item_tower_weights_enhanced_final', 'enhanced_user_tower_weights_enhanced_final', 'enhanced_rating_model_weights_enhanced_final'),
-            ]
-            loaded = False
-            for item_file, user_file, rating_file in model_files:
-                try:
-                    # Need to build models first with dummy data
-                    self._build_models()
-                    self.item_tower.load_weights(f"{self.artifacts_path}/{item_file}")
-                    self.user_tower.load_weights(f"{self.artifacts_path}/{user_file}")
-                    self.rating_model.load_weights(f"{self.artifacts_path}/{rating_file}")
-                    print(f"✅ Loaded enhanced model: {item_file}")
-                    loaded = True
-                    break
-                except Exception as e:
-                    print(f"⚠️  Could not load {item_file}: {e}")
-                    continue
-            if not loaded:
-                print("❌ No enhanced model weights found. Please train enhanced model first.")
-                self.item_tower = None
-                self.user_tower = None
-                self.rating_model = None
-        except Exception as e:
-            print(f"❌ Failed to load enhanced models: {e}")
-            self.item_tower = None
-            self.user_tower = None
-            self.rating_model = None
-    def _build_models(self):
-        """Build models with dummy data to initialize weights."""
-        # Dummy item features
-        dummy_item_features = {
-            'product_id': tf.constant([0]),
-            'category_id': tf.constant([0]),
-            'brand_id': tf.constant([0]),
-            'price': tf.constant([100.0])
-        }
-        # Dummy user features
-        dummy_user_features = {
-            'age': tf.constant([2]),  # Adult category
-            'gender': tf.constant([0]),  # Female
-            'income': tf.constant([2]),  # Middle income
-            'item_history_embeddings': tf.constant(np.zeros((1, 50, self.embedding_dim), dtype=np.float32))
-        }
-        # Forward pass to build models
-        _ = self.item_tower(dummy_item_features, training=False)
-        _ = self.user_tower(dummy_user_features, training=False)
-        # Build rating model
-        dummy_concat = tf.constant(np.zeros((1, self.embedding_dim * 2), dtype=np.float32))
-        _ = self.rating_model(dummy_concat, training=False)
-    def _load_income_thresholds(self):
-        """Load income thresholds for categorical processing."""
-        # Calculate income thresholds from training data
-        user_incomes = self.users_df['income'].values
-        self.income_thresholds = np.percentile(user_incomes, [0, 20, 40, 60, 80, 100])
-        print(f"Income thresholds: {self.income_thresholds}")
-    def categorize_age(self, age: float) -> int:
-        """Categorize age into 6 groups."""
-        if age < 18: return 0      # Teen
-        elif age < 26: return 1    # Young Adult
-        elif age < 36: return 2    # Adult
-        elif age < 51: return 3    # Middle Age
-        elif age < 66: return 4    # Mature
-        else: return 5             # Senior
-    def categorize_income(self, income: float) -> int:
-        """Categorize income into 5 percentile groups."""
-        category = np.digitize([income], self.income_thresholds[1:-1])[0]
-        return min(max(category, 0), 4)
-    def categorize_gender(self, gender: str) -> int:
-        """Categorize gender."""
-        return 1 if gender.lower() == 'male' else 0
-    def get_user_embedding(self,
-                          age: int,
-                          gender: str,
-                          income: float,
-                          interaction_history: List[int] = None) -> np.ndarray:
-        """Generate user embedding with categorical demographics."""
-        if self.user_tower is None:
-            print("❌ User tower not loaded")
-            return None
-        # Categorize demographics
-        age_cat = self.categorize_age(age)
-        gender_cat = self.categorize_gender(gender)
-        income_cat = self.categorize_income(income)
-        # Prepare interaction history embeddings
-        if interaction_history is None:
-            interaction_history = []
-        # Get item embeddings for history
-        history_embeddings = np.zeros((50, self.embedding_dim), dtype=np.float32)
-        for i, item_id in enumerate(interaction_history[:50]):
-            if self.faiss_index and item_id in self.faiss_index.item_id_to_idx:
-                item_emb = self.faiss_index.get_item_embedding(item_id)
-                if item_emb is not None:
-                    history_embeddings[i] = item_emb
-        # Create user features
-        user_features = {
-            'age': tf.constant([age_cat]),
-            'gender': tf.constant([gender_cat]),
-            'income': tf.constant([income_cat]),
-            'item_history_embeddings': tf.constant([history_embeddings])
-        }
-        # Get embedding
-        user_output = self.user_tower(user_features, training=False)
-        if isinstance(user_output, tuple):
-            user_embedding = user_output[0].numpy()[0]
-        else:
-            user_embedding = user_output.numpy()[0]
-        return user_embedding
-    def get_item_embedding(self, item_id: int) -> Optional[np.ndarray]:
-        """Get item embedding."""
-        if self.faiss_index:
-            return self.faiss_index.get_item_embedding(item_id)
-        # Fallback to model computation
-        if self.item_tower is None:
-            return None
-        item_row = self.items_df[self.items_df['product_id'] == item_id]
-        if item_row.empty:
-            return None
-        item_data = item_row.iloc[0]
-        # Prepare features
-        item_features = {
-            'product_id': tf.constant([self.data_processor.item_vocab.get(item_id, 0)]),
-            'category_id': tf.constant([self.data_processor.category_vocab.get(item_data['category_id'], 0)]),
-            'brand_id': tf.constant([self.data_processor.brand_vocab.get(item_data.get('brand', 'unknown'), 0)]),
-            'price': tf.constant([float(item_data.get('price', 0.0))])
-        }
-        # Get embedding
-        item_output = self.item_tower(item_features, training=False)
-        if isinstance(item_output, tuple):
-            item_embedding = item_output[0].numpy()[0]
-        else:
-            item_embedding = item_output.numpy()[0]
-        return item_embedding
-    def recommend_items_enhanced(self,
-                               age: int,
-                               gender: str,
-                               income: float,
-                               interaction_history: List[int] = None,
-                               k: int = 10,
-                               diversity_weight: float = 0.3,
-                               category_boost: float = 1.5) -> List[Tuple[int, float, Dict]]:
-        """Generate enhanced recommendations with diversity and category boosting."""
-        if not self.faiss_index:
-            print("❌ FAISS index not available")
-            return []
-        # Get user embedding
-        user_embedding = self.get_user_embedding(age, gender, income, interaction_history)
-        if user_embedding is None:
-            return []
-        # Get candidate recommendations (more than needed for filtering)
-        candidates = self.faiss_index.search_by_embedding(user_embedding, k * 3)
-        # Filter out items from interaction history
-        if interaction_history:
-            history_set = set(interaction_history)
-            candidates = [(item_id, score) for item_id, score in candidates
-                         if item_id not in history_set]
-        # Add item metadata and apply enhancements
-        enhanced_candidates = []
-        for item_id, similarity_score in candidates[:k * 2]:
-            # Get item info
-            item_row = self.items_df[self.items_df['product_id'] == item_id]
-            if item_row.empty:
-                continue
-            item_info = item_row.iloc[0].to_dict()
-            # Enhanced scoring with multiple factors
-            final_score = similarity_score
-            # Category boosting based on user history
-            if interaction_history and category_boost > 1.0:
-                user_categories = self._get_user_categories(interaction_history)
-                item_category = item_info.get('category_code', '')
-                if item_category in user_categories:
-                    category_preference = user_categories[item_category]
-                    final_score *= (1 + (category_boost - 1) * category_preference)
-            enhanced_candidates.append((item_id, final_score, item_info))
-        # Sort by enhanced scores
-        enhanced_candidates.sort(key=lambda x: x[1], reverse=True)
-        # Apply diversity filtering
-        if diversity_weight > 0:
-            diversified_candidates = self._apply_diversity_filter(
-                enhanced_candidates, diversity_weight
-            )
-        else:
-            diversified_candidates = enhanced_candidates
-        return diversified_candidates[:k]
-    def _get_user_categories(self, interaction_history: List[int]) -> Dict[str, float]:
-        """Get user's category preferences from history."""
-        category_counts = Counter()
-        for item_id in interaction_history:
-            item_row = self.items_df[self.items_df['product_id'] == item_id]
-            if not item_row.empty:
-                category = item_row.iloc[0].get('category_code', 'Unknown')
-                category_counts[category] += 1
-        # Convert to preferences (percentages)
-        total = sum(category_counts.values())
-        if total == 0:
-            return {}
-        return {cat: count / total for cat, count in category_counts.items()}
-    def _apply_diversity_filter(self,
-                              candidates: List[Tuple[int, float, Dict]],
-                              diversity_weight: float,
-                              max_per_category: int = 3) -> List[Tuple[int, float, Dict]]:
-        """Apply diversity filtering to recommendations."""
-        category_counts = defaultdict(int)
-        diversified = []
-        for item_id, score, item_info in candidates:
-            category = item_info.get('category_code', 'Unknown')
-            # Apply diversity penalty
-            if category_counts[category] >= max_per_category:
-                # Penalty for over-representation
-                diversity_penalty = diversity_weight * (category_counts[category] - max_per_category + 1)
-                adjusted_score = score * (1 - diversity_penalty)
-            else:
-                adjusted_score = score
-            diversified.append((item_id, adjusted_score, item_info))
-            category_counts[category] += 1
-        # Re-sort by adjusted scores
-        diversified.sort(key=lambda x: x[1], reverse=True)
-        return diversified
-    def predict_rating(self,
-                      age: int,
-                      gender: str,
-                      income: float,
-                      item_id: int,
-                      interaction_history: List[int] = None) -> float:
-        """Predict rating for user-item pair."""
-        if self.rating_model is None:
-            return 0.5  # Default rating
-        # Get embeddings
-        user_embedding = self.get_user_embedding(age, gender, income, interaction_history)
-        item_embedding = self.get_item_embedding(item_id)
-        if user_embedding is None or item_embedding is None:
-            return 0.5
-        # Concatenate embeddings
-        combined = np.concatenate([user_embedding, item_embedding])
-        combined = tf.constant([combined])
-        # Predict rating
-        rating = self.rating_model(combined, training=False)
-        return float(rating.numpy()[0][0])
-def demo_enhanced_engine():
-    """Demo the enhanced 128D recommendation engine."""
-    print("🚀 ENHANCED 128D RECOMMENDATION ENGINE DEMO")
-    print("="*70)
-    try:
-        # Initialize engine
-        engine = Enhanced128DRecommendationEngine()
-        if engine.item_tower is None:
-            print("❌ Enhanced model not available. Please train first using:")
-            print("   python train_enhanced_model.py")
-            return
-        # Get real user for testing
-        real_user_selector = RealUserSelector()
-        test_users = real_user_selector.get_real_users(n=2, min_interactions=10)
-        for user in test_users:
-            print(f"\n📊 Testing User {user['user_id']} ({user['age']}yr {user['gender']}):")
-            print(f"   Income: ${user['income']:,}")
-            print(f"   History: {len(user['interaction_history'])} items")
-            # Test enhanced recommendations
-            try:
-                recs = engine.recommend_items_enhanced(
-                    age=user['age'],
-                    gender=user['gender'],
-                    income=user['income'],
-                    interaction_history=user['interaction_history'][:20],
-                    k=10,
-                    diversity_weight=0.3,
-                    category_boost=1.5
-                )
-                print(f"   🎯 Enhanced Recommendations:")
-                categories = []
-                for i, (item_id, score, item_info) in enumerate(recs[:5]):
-                    category = item_info.get('category_code', 'Unknown')[:30]
-                    price = item_info.get('price', 0)
-                    categories.append(category)
-                    print(f"      #{i+1} Item {item_id}: {score:.4f} | ${price:.2f} | {category}")
-                # Analyze diversity
-                unique_categories = len(set(categories))
-                print(f"   📈 Diversity: {unique_categories}/{len(categories)} unique categories")
-                # Test rating prediction
-                if recs:
-                    test_item = recs[0][0]
-                    predicted_rating = engine.predict_rating(
-                        age=user['age'],
-                        gender=user['gender'],
-                        income=user['income'],
-                        item_id=test_item,
-                        interaction_history=user['interaction_history'][:20]
-                    )
-                    print(f"   ⭐ Rating prediction for item {test_item}: {predicted_rating:.3f}")
-            except Exception as e:
-                print(f"   ❌ Error: {e}")
-        print(f"\n✅ Enhanced 128D engine demo completed!")
-    except Exception as e:
-        print(f"❌ Demo failed: {e}")
-        import traceback
-        traceback.print_exc()
-if __name__ == "__main__":
-    demo_enhanced_engine()

src/inference/recommendation_engine.py CHANGED Viewed

@@ -61,6 +61,50 @@ class RecommendationEngine:
         category = np.digitize([income], self.income_thresholds[1:-1])[0]
         return min(max(category, 0), 4)
     def _load_all_components(self):
         """Load all required components for inference."""
@@ -139,6 +183,10 @@ class RecommendationEngine:
             'age': tf.constant([2]),  # Adult category (26-35)
             'gender': tf.constant([1]),  # Male
             'income': tf.constant([2]),  # Middle income category
             'item_history_embeddings': tf.constant([[[0.0] * 128] * 50])  # Changed from 64 to 128
         }
         _ = self.user_tower(dummy_input)
@@ -195,6 +243,10 @@ class RecommendationEngine:
                             age: int,
                             gender: str,
                             income: float,
                             interaction_history: List[int] = None) -> Dict[str, tf.Tensor]:
         """Prepare user features for inference."""
@@ -204,9 +256,13 @@ class RecommendationEngine:
         # Convert gender
         gender_numeric = 1 if gender.lower() == 'male' else 0
-        # Categorize age and income
         age_category = self.categorize_age(age)
         income_category = self.categorize_income(income)
         # Get item embeddings for history
         history_embeddings = []
@@ -235,6 +291,10 @@ class RecommendationEngine:
             'age': tf.constant([age_category]),  # Categorical age (0-5)
             'gender': tf.constant([gender_numeric]),  # Categorical gender (0-1)
             'income': tf.constant([income_category]),  # Categorical income (0-4)
             'item_history_embeddings': tf.constant([history_embeddings])
         }
@@ -275,14 +335,77 @@ class RecommendationEngine:
                           age: int,
                           gender: str,
                           income: float,
                           interaction_history: List[int] = None) -> np.ndarray:
         """Get user embedding from user tower."""
-        user_features = self.prepare_user_features(age, gender, income, interaction_history)
         user_embedding = self.user_tower(user_features, training=False)
         return user_embedding.numpy()[0]
     def get_item_embedding(self, item_id: int) -> Optional[np.ndarray]:
         """Get item embedding from FAISS index or item tower."""
@@ -301,14 +424,18 @@ class RecommendationEngine:
                                    age: int,
                                    gender: str,
                                    income: float,
                                    interaction_history: List[int] = None,
                                    k: int = 10,
                                    exclude_history: bool = True,
                                    category_boost: float = 1.3) -> List[Tuple[int, float, Dict]]:
         """Generate recommendations using collaborative filtering with category awareness."""
-        # Get user embedding
-        user_embedding = self.get_user_embedding(age, gender, income, interaction_history)
         # Find similar items using FAISS (get more candidates for boosting)
         similar_items = self.faiss_index.search_by_embedding(user_embedding, k * 4)
@@ -354,6 +481,153 @@ class RecommendationEngine:
         return recommendations
     def recommend_items_content_based(self,
                                     seed_item_id: int,
                                     k: int = 10,
@@ -431,6 +705,10 @@ class RecommendationEngine:
                              age: int,
                              gender: str,
                              income: float,
                              interaction_history: List[int] = None,
                              k: int = 10,
                              collaborative_weight: float = 0.7) -> List[Tuple[int, float, Dict]]:
@@ -438,15 +716,16 @@ class RecommendationEngine:
         # Get collaborative recommendations
         collab_recs = self.recommend_items_collaborative(
-            age, gender, income, interaction_history, k * 2
         )
-        # Get content-based recommendations from recent interactions
         content_recs = []
         if interaction_history:
-            # Use most recent item as seed
-            recent_item = interaction_history[-1]
-            content_recs = self.recommend_items_content_based(recent_item, k)
         # Combine recommendations with weighted scores
         item_scores = {}
@@ -484,6 +763,254 @@ class RecommendationEngine:
         return hybrid_recommendations[:k]
     def _get_item_info(self, item_id: int) -> Dict:
         """Get item metadata."""
@@ -512,6 +1039,10 @@ class RecommendationEngine:
                       gender: str,
                       income: float,
                       item_id: int,
                       interaction_history: List[int] = None) -> float:
         """Predict rating for a specific user-item pair."""
@@ -519,7 +1050,7 @@ class RecommendationEngine:
             return 0.5  # Default prediction
         # Prepare user features
-        user_features = self.prepare_user_features(age, gender, income, interaction_history)
         # Prepare item features
         if item_id not in self.data_processor.item_vocab:
@@ -552,6 +1083,10 @@ def main():
         'age': 32,
         'gender': 'male',
         'income': 75000,
         'interaction_history': [1000978, 1001588, 1001618]  # Sample item IDs
     }
@@ -559,20 +1094,28 @@ def main():
     print(f"Age: {demo_user['age']}")
     print(f"Gender: {demo_user['gender']}")
     print(f"Income: ${demo_user['income']:,}")
     print(f"Interaction history: {demo_user['interaction_history']}")
     # Generate collaborative recommendations
-    print("\n=== Collaborative Filtering Recommendations ===")
-    collab_recs = engine.recommend_items_collaborative(**demo_user, k=5)
     for i, (item_id, score, info) in enumerate(collab_recs, 1):
         print(f"{i}. Item {item_id}: {info['brand']} - ${info['price']:.2f} (Score: {score:.4f})")
-    # Generate content-based recommendations
-    print("\n=== Content-Based Recommendations (similar to recent item) ===")
     if demo_user['interaction_history']:
-        content_recs = engine.recommend_items_content_based(
-            seed_item_id=demo_user['interaction_history'][-1], k=5
         )
         for i, (item_id, score, info) in enumerate(content_recs, 1):
@@ -580,7 +1123,9 @@ def main():
     # Generate hybrid recommendations
     print("\n=== Hybrid Recommendations ===")
-    hybrid_recs = engine.recommend_items_hybrid(**demo_user, k=5)
     for i, (item_id, score, info) in enumerate(hybrid_recs, 1):
         print(f"{i}. Item {item_id}: {info['brand']} - ${info['price']:.2f} (Score: {score:.4f})")

         category = np.digitize([income], self.income_thresholds[1:-1])[0]
         return min(max(category, 0), 4)
+    def categorize_profession(self, profession: str) -> int:
+        """Categorize profession into numeric categories."""
+        profession_map = {
+            "Technology": 0,
+            "Healthcare": 1,
+            "Education": 2,
+            "Finance": 3,
+            "Retail": 4,
+            "Manufacturing": 5,
+            "Services": 6,
+            "Other": 7
+        }
+        return profession_map.get(profession, 7)  # Default to "Other"
+    def categorize_location(self, location: str) -> int:
+        """Categorize location into numeric categories."""
+        location_map = {
+            "Urban": 0,
+            "Suburban": 1,
+            "Rural": 2
+        }
+        return location_map.get(location, 0)  # Default to "Urban"
+    def categorize_education_level(self, education: str) -> int:
+        """Categorize education level into numeric categories."""
+        education_map = {
+            "High School": 0,
+            "Some College": 1,
+            "Bachelor's": 2,
+            "Master's": 3,
+            "PhD+": 4
+        }
+        return education_map.get(education, 0)  # Default to "High School"
+    def categorize_marital_status(self, marital_status: str) -> int:
+        """Categorize marital status into numeric categories."""
+        marital_map = {
+            "Single": 0,
+            "Married": 1,
+            "Divorced": 2,
+            "Widowed": 3
+        }
+        return marital_map.get(marital_status, 0)  # Default to "Single"
     def _load_all_components(self):
         """Load all required components for inference."""
             'age': tf.constant([2]),  # Adult category (26-35)
             'gender': tf.constant([1]),  # Male
             'income': tf.constant([2]),  # Middle income category
+            'profession': tf.constant([0]),  # Technology
+            'location': tf.constant([0]),  # Urban
+            'education_level': tf.constant([2]),  # Bachelor's
+            'marital_status': tf.constant([1]),  # Married
             'item_history_embeddings': tf.constant([[[0.0] * 128] * 50])  # Changed from 64 to 128
         }
         _ = self.user_tower(dummy_input)
                             age: int,
                             gender: str,
                             income: float,
+                            profession: str = "Other",
+                            location: str = "Urban",
+                            education_level: str = "High School",
+                            marital_status: str = "Single",
                             interaction_history: List[int] = None) -> Dict[str, tf.Tensor]:
         """Prepare user features for inference."""
         # Convert gender
         gender_numeric = 1 if gender.lower() == 'male' else 0
+        # Categorize all demographics
         age_category = self.categorize_age(age)
         income_category = self.categorize_income(income)
+        profession_category = self.categorize_profession(profession)
+        location_category = self.categorize_location(location)
+        education_category = self.categorize_education_level(education_level)
+        marital_category = self.categorize_marital_status(marital_status)
         # Get item embeddings for history
         history_embeddings = []
             'age': tf.constant([age_category]),  # Categorical age (0-5)
             'gender': tf.constant([gender_numeric]),  # Categorical gender (0-1)
             'income': tf.constant([income_category]),  # Categorical income (0-4)
+            'profession': tf.constant([profession_category]),  # Categorical profession (0-7)
+            'location': tf.constant([location_category]),  # Categorical location (0-2)
+            'education_level': tf.constant([education_category]),  # Categorical education (0-4)
+            'marital_status': tf.constant([marital_category]),  # Categorical marital status (0-3)
             'item_history_embeddings': tf.constant([history_embeddings])
         }
                           age: int,
                           gender: str,
                           income: float,
+                          profession: str = "Other",
+                          location: str = "Urban",
+                          education_level: str = "High School",
+                          marital_status: str = "Single",
                           interaction_history: List[int] = None) -> np.ndarray:
         """Get user embedding from user tower."""
+        user_features = self.prepare_user_features(age, gender, income, profession, location, education_level, marital_status, interaction_history)
         user_embedding = self.user_tower(user_features, training=False)
         return user_embedding.numpy()[0]
+    def get_user_embedding_enhanced(self,
+                                  age: int,
+                                  gender: str,
+                                  income: float,
+                                  profession: str = "Other",
+                                  location: str = "Urban",
+                                  education_level: str = "High School",
+                                  marital_status: str = "Single",
+                                  interaction_history: List[int] = None) -> np.ndarray:
+        """Enhanced user embedding that handles zero interactions better."""
+        # Get base embedding
+        base_embedding = self.get_user_embedding(
+            age, gender, income, profession, location, education_level, marital_status, interaction_history
+        )
+        # Check if this is a zero-interaction user
+        has_interactions = interaction_history and len(interaction_history) > 0
+        if not has_interactions:
+            # For zero interactions, amplify the demographic component
+            # This is a heuristic fix until we retrain the model
+            # Create demographic-enhanced embedding
+            demographic_mask = np.ones_like(base_embedding)
+            # Amplify first 50% of dimensions (likely demographic-influenced)
+            mid_point = len(base_embedding) // 2
+            demographic_mask[:mid_point] *= 3.0  # Strong amplification
+            # Reduce influence of latter dimensions (likely history-influenced)
+            demographic_mask[mid_point:] *= 0.2  # Strong reduction
+            enhanced_embedding = base_embedding * demographic_mask
+            # Add demographic-specific variation to differentiate profiles
+            demographic_hash = (
+                age * 1000 +
+                (1 if gender.lower() == 'male' else 0) * 100 +
+                int(income / 10000) * 10 +
+                self.categorize_profession(profession) * 7 +
+                self.categorize_location(location) * 3 +
+                self.categorize_education_level(education_level) * 5 +
+                self.categorize_marital_status(marital_status) * 2
+            )
+            np.random.seed(demographic_hash % 2**32)  # Reproducible noise
+            demographic_noise = np.random.normal(0, 0.02, base_embedding.shape)  # Increased noise
+            enhanced_embedding += demographic_noise
+            # Renormalize
+            enhanced_embedding = enhanced_embedding / np.linalg.norm(enhanced_embedding)
+            print(f"Enhanced embedding for zero interactions: age={age}, gender={gender}, profession={profession}")
+            return enhanced_embedding.astype(np.float32)
+        return base_embedding
     def get_item_embedding(self, item_id: int) -> Optional[np.ndarray]:
         """Get item embedding from FAISS index or item tower."""
                                    age: int,
                                    gender: str,
                                    income: float,
+                                   profession: str = "Other",
+                                   location: str = "Urban",
+                                   education_level: str = "High School",
+                                   marital_status: str = "Single",
                                    interaction_history: List[int] = None,
                                    k: int = 10,
                                    exclude_history: bool = True,
                                    category_boost: float = 1.3) -> List[Tuple[int, float, Dict]]:
         """Generate recommendations using collaborative filtering with category awareness."""
+        # Get enhanced user embedding (better for zero interactions)
+        user_embedding = self.get_user_embedding_enhanced(age, gender, income, profession, location, education_level, marital_status, interaction_history)
         # Find similar items using FAISS (get more candidates for boosting)
         similar_items = self.faiss_index.search_by_embedding(user_embedding, k * 4)
         return recommendations
+    def _aggregate_user_history_embedding(self,
+                                        interaction_history: List[int],
+                                        aggregation_method: str = "weighted_mean") -> Optional[np.ndarray]:
+        """Aggregate user's interaction history into a single embedding vector."""
+        if not interaction_history:
+            return None
+        # Get embeddings for items in history
+        item_embeddings = []
+        valid_items = []
+        for item_id in interaction_history:
+            embedding = self.faiss_index.get_item_embedding(item_id)
+            if embedding is not None:
+                item_embeddings.append(embedding)
+                valid_items.append(item_id)
+        if not item_embeddings:
+            print(f"No valid embeddings found for interaction history: {interaction_history}")
+            return None
+        item_embeddings = np.array(item_embeddings)
+        print(f"Aggregating {len(item_embeddings)} item embeddings using {aggregation_method}")
+        # Apply aggregation method
+        if aggregation_method == "mean":
+            # Simple mean pooling
+            aggregated = np.mean(item_embeddings, axis=0)
+        elif aggregation_method == "weighted_mean":
+            # Weight recent interactions higher (exponential decay)
+            weights = np.exp(np.linspace(-1, 0, len(item_embeddings)))  # More recent = higher weight
+            weights = weights / np.sum(weights)  # Normalize weights
+            aggregated = np.average(item_embeddings, axis=0, weights=weights)
+            print(f"Applied weighted mean with weights: {weights[-3:]} (showing last 3)")
+        elif aggregation_method == "max":
+            # Element-wise maximum pooling
+            aggregated = np.max(item_embeddings, axis=0)
+        else:
+            raise ValueError(f"Unknown aggregation method: {aggregation_method}")
+        # L2 normalize the aggregated embedding
+        aggregated = aggregated / np.linalg.norm(aggregated)
+        return aggregated.astype('float32')
+    def recommend_items_content_based_from_history(self,
+                                                 interaction_history: List[int],
+                                                 k: int = 10,
+                                                 aggregation_method: str = "weighted_mean",
+                                                 same_category_ratio: float = None) -> List[Tuple[int, float, Dict]]:
+        """Generate recommendations using content-based filtering from aggregated user history."""
+        # Aggregate user's interaction history
+        aggregated_embedding = self._aggregate_user_history_embedding(
+            interaction_history, aggregation_method
+        )
+        if aggregated_embedding is None:
+            print("Could not create aggregated embedding from interaction history")
+            return []
+        if same_category_ratio is None:
+            # Direct ANN search with aggregated embedding
+            similar_items = self.faiss_index.search_by_embedding(aggregated_embedding, k)
+            recommendations = []
+            # Filter out items already in interaction history
+            interaction_set = set(interaction_history)
+            for item_id, score in similar_items:
+                if item_id not in interaction_set:  # Exclude already interacted items
+                    item_info = self._get_item_info(item_id)
+                    recommendations.append((item_id, score, item_info))
+                    if len(recommendations) >= k:
+                        break
+            print(f"Found {len(recommendations)} content-based recommendations from aggregated history")
+            return recommendations
+        else:
+            # Category-aware approach with aggregated embedding
+            print(f"Finding similar items with {same_category_ratio*100}% category constraint from aggregated history")
+            # Analyze user's category preferences from interaction history
+            user_categories = {}
+            total_interactions = len(interaction_history)
+            for item_id in interaction_history:
+                item_info = self._get_item_info(item_id)
+                category = item_info.get('category_code', '')
+                if category:
+                    user_categories[category] = user_categories.get(category, 0) + 1
+            # Convert to percentages
+            for category in user_categories:
+                user_categories[category] = user_categories[category] / total_interactions
+            print(f"User category preferences: {user_categories}")
+            # Get more candidates for category filtering
+            candidate_items = self.faiss_index.search_by_embedding(aggregated_embedding, k * 3)
+            interaction_set = set(interaction_history)
+            # Separate by category alignment with user preferences
+            preferred_category_items = []
+            other_category_items = []
+            for item_id, score in candidate_items:
+                if item_id in interaction_set:
+                    continue  # Skip already interacted items
+                item_info = self._get_item_info(item_id)
+                item_category = item_info.get('category_code', '')
+                # Check if item category matches user's preferred categories
+                if item_category in user_categories:
+                    preferred_category_items.append((item_id, score, item_info))
+                else:
+                    other_category_items.append((item_id, score, item_info))
+            # Calculate target distribution
+            preferred_count = int(k * same_category_ratio)
+            other_count = k - preferred_count
+            print(f"Target: {preferred_count} from preferred categories, {other_count} for exploration")
+            # Build balanced recommendations
+            recommendations = []
+            recommendations.extend(preferred_category_items[:preferred_count])
+            recommendations.extend(other_category_items[:other_count])
+            # Fill remaining slots with best available items
+            if len(recommendations) < k:
+                remaining_items = (preferred_category_items[preferred_count:] +
+                                 other_category_items[other_count:])
+                remaining_items.sort(key=lambda x: x[1], reverse=True)  # Sort by score
+                needed = k - len(recommendations)
+                recommendations.extend(remaining_items[:needed])
+            print(f"Final recommendations: {len(recommendations)} items")
+            return recommendations[:k]
     def recommend_items_content_based(self,
                                     seed_item_id: int,
                                     k: int = 10,
                              age: int,
                              gender: str,
                              income: float,
+                             profession: str = "Other",
+                             location: str = "Urban",
+                             education_level: str = "High School",
+                             marital_status: str = "Single",
                              interaction_history: List[int] = None,
                              k: int = 10,
                              collaborative_weight: float = 0.7) -> List[Tuple[int, float, Dict]]:
         # Get collaborative recommendations
         collab_recs = self.recommend_items_collaborative(
+            age, gender, income, profession, location, education_level, marital_status, interaction_history, k * 2
         )
+        # Get content-based recommendations from aggregated user history
         content_recs = []
         if interaction_history:
+            # Use aggregated history embedding instead of single recent item
+            content_recs = self.recommend_items_content_based_from_history(
+                interaction_history, k, aggregation_method="weighted_mean"
+            )
         # Combine recommendations with weighted scores
         item_scores = {}
         return hybrid_recommendations[:k]
+    def recommend_items_category_boosted(self,
+                                       age: int,
+                                       gender: str,
+                                       income: float,
+                                       profession: str = "Other",
+                                       location: str = "Urban",
+                                       education_level: str = "High School",
+                                       marital_status: str = "Single",
+                                       interaction_history: List[int] = None,
+                                       k: int = 10,
+                                       exclude_history: bool = True) -> List[Tuple[int, float, Dict]]:
+        """Generate category-boosted recommendations ensuring 50% from user's interacted categories."""
+        if not interaction_history or len(interaction_history) == 0:
+            # Fallback to collaborative filtering if no interaction history
+            return self.recommend_items_collaborative(
+                age, gender, income, profession, location, education_level, marital_status,
+                interaction_history, k, exclude_history
+            )
+        # Step 1: Calculate category percentages from interaction history
+        category_percentages = self._calculate_category_percentages(interaction_history)
+        if not category_percentages:
+            # Fallback if no categories found
+            return self.recommend_items_collaborative(
+                age, gender, income, profession, location, education_level, marital_status,
+                interaction_history, k, exclude_history
+            )
+        # Step 2: Get enhanced user embedding and do wide search (increased for better subcategory coverage)
+        user_embedding = self.get_user_embedding_enhanced(age, gender, income, profession, location, education_level, marital_status, interaction_history)
+        similar_items = self.faiss_index.search_by_embedding(user_embedding, k * 10)  # Increased from k*6 to k*10
+        # Step 3: Organize candidates by subcategory with parent fallback
+        category_candidates = {category: [] for category in category_percentages.keys()}
+        parent_category_mapping = {}  # Track parent categories for fallback
+        other_candidates = []
+        history_set = set(interaction_history) if exclude_history else set()
+        # Build parent category mapping for fallback
+        for subcategory in category_percentages.keys():
+            if '.' in subcategory:
+                parent = subcategory.split('.')[0]
+                if parent not in parent_category_mapping:
+                    parent_category_mapping[parent] = []
+                parent_category_mapping[parent].append(subcategory)
+        for item_id, score in similar_items:
+            if item_id in history_set:
+                continue
+            # Get item category
+            item_row = self.items_df[self.items_df['product_id'] == item_id]
+            if len(item_row) > 0:
+                full_item_category = item_row.iloc[0]['category_code']
+                # Extract 2-level subcategory for matching
+                if '.' in full_item_category:
+                    category_parts = full_item_category.split('.')
+                    if len(category_parts) >= 2:
+                        item_subcategory = f"{category_parts[0]}.{category_parts[1]}"
+                    else:
+                        item_subcategory = category_parts[0]
+                else:
+                    item_subcategory = full_item_category
+                # Try exact subcategory match first
+                if item_subcategory in category_percentages:
+                    category_candidates[item_subcategory].append((item_id, score))
+                else:
+                    # Fallback: try parent category match
+                    parent_category = item_subcategory.split('.')[0] if '.' in item_subcategory else item_subcategory
+                    matched = False
+                    if parent_category in parent_category_mapping:
+                        # Add to the first subcategory of this parent (round-robin could be improved later)
+                        target_subcategory = parent_category_mapping[parent_category][0]
+                        category_candidates[target_subcategory].append((item_id, score))
+                        matched = True
+                    if not matched:
+                        other_candidates.append((item_id, score))
+        # Step 4: Calculate target counts for each subcategory (50% distributed proportionally)
+        category_target_count = max(1, k // 2)  # At least 50% from user categories
+        # Calculate proportional distribution with proper rounding
+        category_counts = self._calculate_proportional_distribution(
+            category_percentages, category_target_count
+        )
+        # Step 5: Select items with round-robin filling and rebalancing
+        selected_recommendations = []
+        # Fill from user's categories with rebalancing for insufficient candidates
+        actual_selections = {}
+        unused_allocations = {}
+        for category, target_count in category_counts.items():
+            candidates = sorted(category_candidates[category], key=lambda x: x[1], reverse=True)
+            available_count = len(candidates)
+            selected_count = min(target_count, available_count)
+            print(f"[DEBUG] Category {category}: target={target_count}, available={available_count}, selected={selected_count}")
+            actual_selections[category] = selected_count
+            if selected_count < target_count:
+                unused_allocations[category] = target_count - selected_count
+            # Select items from this category
+            for i in range(selected_count):
+                item_id, score = candidates[i]
+                item_info = self._get_item_info(item_id)
+                selected_recommendations.append((item_id, score, item_info))
+        # Step 6: Redistribute unused allocations proportionally
+        total_unused = sum(unused_allocations.values())
+        if total_unused > 0:
+            print(f"[DEBUG] Redistributing {total_unused} unused slots")
+            # Find categories with remaining candidates for redistribution
+            categories_with_extras = {}
+            for category, candidates in category_candidates.items():
+                used_count = actual_selections.get(category, 0)
+                available_extras = len(candidates) - used_count
+                if available_extras > 0:
+                    categories_with_extras[category] = available_extras
+            # Redistribute based on original proportions and availability
+            redistributed = 0
+            for category in sorted(categories_with_extras.keys(), key=lambda c: category_percentages.get(c, 0), reverse=True):
+                if redistributed >= total_unused:
+                    break
+                extra_slots = min(unused_allocations.get(category, 0) + 1, categories_with_extras[category])
+                candidates = sorted(category_candidates[category], key=lambda x: x[1], reverse=True)
+                used_count = actual_selections.get(category, 0)
+                for i in range(used_count, min(used_count + extra_slots, len(candidates))):
+                    if redistributed >= total_unused:
+                        break
+                    item_id, score = candidates[i]
+                    item_info = self._get_item_info(item_id)
+                    selected_recommendations.append((item_id, score, item_info))
+                    redistributed += 1
+        # Step 7: Fill remaining slots with diverse recommendations
+        remaining_slots = k - len(selected_recommendations)
+        if remaining_slots > 0:
+            # Collect all unused candidates (both from user categories and other categories)
+            all_remaining = []
+            # Add unused items from user categories
+            for category, candidates in category_candidates.items():
+                used_count = len([rec for rec in selected_recommendations if rec[2].get('category_code', '').startswith(category.split('.')[0])])
+                sorted_candidates = sorted(candidates, key=lambda x: x[1], reverse=True)
+                for i in range(used_count, len(sorted_candidates)):
+                    all_remaining.append(sorted_candidates[i])
+            # Add items from other categories
+            all_remaining.extend(other_candidates)
+            # Sort by score and take best remaining
+            all_remaining.sort(key=lambda x: x[1], reverse=True)
+            print(f"[DEBUG] Filling {remaining_slots} remaining slots from {len(all_remaining)} candidates")
+            for i in range(min(remaining_slots, len(all_remaining))):
+                item_id, score = all_remaining[i]
+                item_info = self._get_item_info(item_id)
+                selected_recommendations.append((item_id, score, item_info))
+        # Step 7: Sort final recommendations by score and return top k
+        selected_recommendations.sort(key=lambda x: x[1], reverse=True)
+        return selected_recommendations[:k]
+    def _calculate_category_percentages(self, interaction_history: List[int]) -> Dict[str, float]:
+        """Calculate subcategory percentages from interaction history (2-level depth)."""
+        if not interaction_history:
+            return {}
+        category_counts = {}
+        total_interactions = 0
+        for item_id in interaction_history:
+            item_row = self.items_df[self.items_df['product_id'] == item_id]
+            if len(item_row) > 0:
+                full_category = item_row.iloc[0]['category_code']
+                # Use 2-level subcategory (e.g., "computers.components" from "computers.components.memory")
+                if '.' in full_category:
+                    category_parts = full_category.split('.')
+                    if len(category_parts) >= 2:
+                        subcategory = f"{category_parts[0]}.{category_parts[1]}"
+                    else:
+                        subcategory = category_parts[0]  # Fallback to top-level if only one part
+                else:
+                    subcategory = full_category
+                category_counts[subcategory] = category_counts.get(subcategory, 0) + 1
+                total_interactions += 1
+        # Convert to percentages
+        category_percentages = {}
+        for category, count in category_counts.items():
+            category_percentages[category] = (count / total_interactions) * 100
+        return category_percentages
+    def _calculate_proportional_distribution(self, category_percentages: Dict[str, float],
+                                           total_target: int) -> Dict[str, int]:
+        """Calculate proportional distribution with proper rounding and no minimum distortion."""
+        if not category_percentages or total_target <= 0:
+            return {}
+        # Calculate raw allocations (without minimum guarantee)
+        total_percentage = sum(category_percentages.values())
+        raw_allocations = {}
+        remainders = {}
+        for category, percentage in category_percentages.items():
+            if total_percentage > 0:
+                raw_allocation = (percentage / total_percentage) * total_target
+                raw_allocations[category] = int(raw_allocation)  # Floor
+                remainders[category] = raw_allocation - int(raw_allocation)  # Remainder
+            else:
+                raw_allocations[category] = 0
+                remainders[category] = 0
+        # Distribute remaining slots based on largest remainders
+        allocated_so_far = sum(raw_allocations.values())
+        remaining_slots = total_target - allocated_so_far
+        # Sort categories by remainder (largest first) to distribute remaining slots
+        sorted_by_remainder = sorted(remainders.items(), key=lambda x: x[1], reverse=True)
+        for i in range(remaining_slots):
+            if i < len(sorted_by_remainder):
+                category_to_increment = sorted_by_remainder[i][0]
+                raw_allocations[category_to_increment] += 1
+        # Filter out zero allocations (no artificial minimum guarantee)
+        final_allocations = {cat: count for cat, count in raw_allocations.items() if count > 0}
+        print(f"[DEBUG] Proportional distribution: target={total_target}, allocations={final_allocations}")
+        return final_allocations
     def _get_item_info(self, item_id: int) -> Dict:
         """Get item metadata."""
                       gender: str,
                       income: float,
                       item_id: int,
+                      profession: str = "Other",
+                      location: str = "Urban",
+                      education_level: str = "High School",
+                      marital_status: str = "Single",
                       interaction_history: List[int] = None) -> float:
         """Predict rating for a specific user-item pair."""
             return 0.5  # Default prediction
         # Prepare user features
+        user_features = self.prepare_user_features(age, gender, income, profession, location, education_level, marital_status, interaction_history)
         # Prepare item features
         if item_id not in self.data_processor.item_vocab:
         'age': 32,
         'gender': 'male',
         'income': 75000,
+        'profession': 'Technology',
+        'location': 'Urban',
+        'education_level': "Bachelor's",
+        'marital_status': 'Married',
         'interaction_history': [1000978, 1001588, 1001618]  # Sample item IDs
     }
     print(f"Age: {demo_user['age']}")
     print(f"Gender: {demo_user['gender']}")
     print(f"Income: ${demo_user['income']:,}")
+    print(f"Profession: {demo_user['profession']}")
+    print(f"Location: {demo_user['location']}")
+    print(f"Education: {demo_user['education_level']}")
+    print(f"Marital Status: {demo_user['marital_status']}")
     print(f"Interaction history: {demo_user['interaction_history']}")
     # Generate collaborative recommendations
+    print("\n=== Collaborative Filtering Recommendations ===")
+    # Extract demographics and history separately to avoid conflicts
+    demo_kwargs = {k: v for k, v in demo_user.items() if k != 'interaction_history'}
+    collab_recs = engine.recommend_items_collaborative(
+        **demo_kwargs, interaction_history=demo_user['interaction_history'], k=5
+    )
     for i, (item_id, score, info) in enumerate(collab_recs, 1):
         print(f"{i}. Item {item_id}: {info['brand']} - ${info['price']:.2f} (Score: {score:.4f})")
+    # Generate content-based recommendations from aggregated history
+    print("\n=== Content-Based Recommendations (from aggregated user history) ===")
     if demo_user['interaction_history']:
+        content_recs = engine.recommend_items_content_based_from_history(
+            interaction_history=demo_user['interaction_history'], k=5
         )
         for i, (item_id, score, info) in enumerate(content_recs, 1):
     # Generate hybrid recommendations
     print("\n=== Hybrid Recommendations ===")
+    hybrid_recs = engine.recommend_items_hybrid(
+        **demo_kwargs, interaction_history=demo_user['interaction_history'], k=5
+    )
     for i, (item_id, score, info) in enumerate(hybrid_recs, 1):
         print(f"{i}. Item {item_id}: {info['brand']} - ${info['price']:.2f} (Score: {score:.4f})")

src/models/enhanced_two_tower.py DELETED Viewed

@@ -1,574 +0,0 @@
-#!/usr/bin/env python3
-"""
-Enhanced two-tower model with embedding diversity regularization and improved discrimination.
-"""
-import tensorflow as tf
-import tensorflow_recommenders as tfrs
-import numpy as np
-class EmbeddingDiversityRegularizer(tf.keras.layers.Layer):
-    """Regularizer to prevent embedding collapse by enforcing diversity."""
-    def __init__(self, diversity_weight=0.01, orthogonality_weight=0.05, **kwargs):
-        super().__init__(**kwargs)
-        self.diversity_weight = diversity_weight
-        self.orthogonality_weight = orthogonality_weight
-    def call(self, embeddings):
-        """Apply diversity regularization to embeddings."""
-        batch_size = tf.shape(embeddings)[0]
-        # Compute pairwise cosine similarities
-        normalized_embeddings = tf.nn.l2_normalize(embeddings, axis=1)
-        similarity_matrix = tf.linalg.matmul(
-            normalized_embeddings, normalized_embeddings, transpose_b=True
-        )
-        # Remove diagonal (self-similarities)
-        mask = 1.0 - tf.eye(batch_size)
-        masked_similarities = similarity_matrix * mask
-        # Diversity loss: penalize high similarities between different embeddings
-        diversity_loss = tf.reduce_mean(tf.square(masked_similarities))
-        # Orthogonality loss: encourage embeddings to be orthogonal
-        identity_target = tf.eye(batch_size)
-        orthogonality_loss = tf.reduce_mean(
-            tf.square(similarity_matrix - identity_target)
-        )
-        # Add as regularization losses
-        self.add_loss(self.diversity_weight * diversity_loss)
-        self.add_loss(self.orthogonality_weight * orthogonality_loss)
-        return embeddings
-class AdaptiveTemperatureScaling(tf.keras.layers.Layer):
-    """Advanced temperature scaling with learned parameters."""
-    def __init__(self, initial_temperature=1.0, min_temp=0.1, max_temp=5.0, **kwargs):
-        super().__init__(**kwargs)
-        self.initial_temperature = initial_temperature
-        self.min_temp = min_temp
-        self.max_temp = max_temp
-    def build(self, input_shape):
-        # Learnable temperature with constraints
-        self.raw_temperature = self.add_weight(
-            name='raw_temperature',
-            shape=(),
-            initializer=tf.keras.initializers.Constant(
-                np.log(self.initial_temperature - self.min_temp)
-            ),
-            trainable=True
-        )
-        # Learnable bias term for better discrimination
-        self.similarity_bias = self.add_weight(
-            name='similarity_bias',
-            shape=(),
-            initializer=tf.keras.initializers.Zeros(),
-            trainable=True
-        )
-        super().build(input_shape)
-    def call(self, user_embeddings, item_embeddings):
-        """Compute adaptive temperature-scaled similarity with bias."""
-        # Constrain temperature to valid range
-        temperature = self.min_temp + tf.nn.softplus(self.raw_temperature)
-        temperature = tf.minimum(temperature, self.max_temp)
-        # Compute similarities
-        similarities = tf.reduce_sum(user_embeddings * item_embeddings, axis=1)
-        # Add learnable bias and apply temperature scaling
-        scaled_similarities = (similarities + self.similarity_bias) / temperature
-        return scaled_similarities, temperature
-class EnhancedItemTower(tf.keras.Model):
-    """Enhanced item tower with diversity regularization."""
-    def __init__(self,
-                 item_vocab_size: int,
-                 category_vocab_size: int,
-                 brand_vocab_size: int,
-                 embedding_dim: int = 128,
-                 hidden_dims: list = [256, 128],
-                 dropout_rate: float = 0.3,
-                 use_bias: bool = True,
-                 use_diversity_reg: bool = True):
-        super().__init__()
-        self.embedding_dim = embedding_dim
-        self.use_bias = use_bias
-        self.use_diversity_reg = use_diversity_reg
-        # Embedding layers with better initialization
-        self.item_embedding = tf.keras.layers.Embedding(
-            item_vocab_size, embedding_dim,
-            embeddings_initializer='he_normal',  # Better initialization
-            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
-            name="item_embedding"
-        )
-        self.category_embedding = tf.keras.layers.Embedding(
-            category_vocab_size, embedding_dim,
-            embeddings_initializer='he_normal',
-            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
-            name="category_embedding"
-        )
-        self.brand_embedding = tf.keras.layers.Embedding(
-            brand_vocab_size, embedding_dim,
-            embeddings_initializer='he_normal',
-            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
-            name="brand_embedding"
-        )
-        # Price processing
-        self.price_normalization = tf.keras.layers.Normalization(name="price_norm")
-        self.price_projection = tf.keras.layers.Dense(
-            embedding_dim // 4, activation='relu', name="price_proj"
-        )
-        # Enhanced attention mechanism
-        self.feature_attention = tf.keras.layers.MultiHeadAttention(
-            num_heads=4,
-            key_dim=embedding_dim,
-            dropout=0.1,
-            name="feature_attention"
-        )
-        # Dense layers with residual connections
-        self.dense_layers = []
-        for i, dim in enumerate(hidden_dims):
-            self.dense_layers.extend([
-                tf.keras.layers.Dense(dim, activation=None, name=f"dense_{i}"),
-                tf.keras.layers.BatchNormalization(name=f"bn_{i}"),
-                tf.keras.layers.Activation('relu', name=f"relu_{i}"),
-                tf.keras.layers.Dropout(dropout_rate, name=f"dropout_{i}")
-            ])
-        # Output layer with controlled normalization
-        self.output_layer = tf.keras.layers.Dense(
-            embedding_dim, activation=None, use_bias=use_bias, name="item_output"
-        )
-        # Diversity regularizer
-        if use_diversity_reg:
-            self.diversity_regularizer = EmbeddingDiversityRegularizer()
-        # Adaptive normalization instead of hard L2 normalization
-        self.adaptive_norm = tf.keras.layers.LayerNormalization(name="adaptive_norm")
-        # Item bias
-        if use_bias:
-            self.item_bias = tf.keras.layers.Embedding(
-                item_vocab_size, 1, name="item_bias"
-            )
-    def call(self, inputs, training=None):
-        """Enhanced forward pass with diversity regularization."""
-        item_id = inputs["product_id"]
-        category_id = inputs["category_id"]
-        brand_id = inputs["brand_id"]
-        price = inputs["price"]
-        # Get embeddings
-        item_emb = self.item_embedding(item_id)
-        category_emb = self.category_embedding(category_id)
-        brand_emb = self.brand_embedding(brand_id)
-        # Process price
-        price_norm = self.price_normalization(tf.expand_dims(price, -1))
-        price_emb = self.price_projection(price_norm)
-        # Pad price embedding
-        price_emb_padded = tf.pad(
-            price_emb,
-            [[0, 0], [0, self.embedding_dim - tf.shape(price_emb)[-1]]]
-        )
-        # Stack features for attention
-        features = tf.stack([item_emb, category_emb, brand_emb, price_emb_padded], axis=1)
-        # Apply attention
-        attended_features = self.feature_attention(
-            query=features,
-            value=features,
-            key=features,
-            training=training
-        )
-        # Aggregate with residual connection
-        combined = tf.reduce_mean(attended_features + features, axis=1)
-        # Pass through dense layers with residual connections
-        x = combined
-        residual = x
-        for i, layer in enumerate(self.dense_layers):
-            x = layer(x, training=training)
-            # Add residual connection every 4 layers (complete block)
-            if (i + 1) % 4 == 0 and x.shape[-1] == residual.shape[-1]:
-                x = x + residual
-                residual = x
-        # Final output
-        output = self.output_layer(x)
-        # Apply diversity regularization if enabled
-        if self.use_diversity_reg and training:
-            output = self.diversity_regularizer(output)
-        # Adaptive normalization instead of hard L2
-        normalized_output = self.adaptive_norm(output)
-        # Add bias if enabled
-        if self.use_bias:
-            bias = tf.squeeze(self.item_bias(item_id), axis=-1)
-            return normalized_output, bias
-        else:
-            return normalized_output
-class EnhancedUserTower(tf.keras.Model):
-    """Enhanced user tower with diversity regularization."""
-    def __init__(self,
-                 max_history_length: int = 50,
-                 embedding_dim: int = 128,
-                 hidden_dims: list = [256, 128],
-                 dropout_rate: float = 0.3,
-                 use_bias: bool = True,
-                 use_diversity_reg: bool = True):
-        super().__init__()
-        self.embedding_dim = embedding_dim
-        self.max_history_length = max_history_length
-        self.use_bias = use_bias
-        self.use_diversity_reg = use_diversity_reg
-        # Demographic embeddings with regularization
-        self.age_embedding = tf.keras.layers.Embedding(
-            6, embedding_dim // 16,
-            embeddings_initializer='he_normal',
-            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
-            name="age_embedding"
-        )
-        self.income_embedding = tf.keras.layers.Embedding(
-            5, embedding_dim // 16,
-            embeddings_initializer='he_normal',
-            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
-            name="income_embedding"
-        )
-        self.gender_embedding = tf.keras.layers.Embedding(
-            2, embedding_dim // 16,
-            embeddings_initializer='he_normal',
-            embeddings_regularizer=tf.keras.regularizers.L2(1e-6),
-            name="gender_embedding"
-        )
-        # Enhanced history processing
-        self.history_transformer = tf.keras.layers.MultiHeadAttention(
-            num_heads=8,
-            key_dim=embedding_dim,
-            dropout=0.1,
-            name="history_transformer"
-        )
-        # History aggregation with attention pooling
-        self.history_attention_pooling = tf.keras.layers.Dense(
-            1, activation=None, name="history_attention"
-        )
-        # Dense layers with residual connections
-        self.dense_layers = []
-        for i, dim in enumerate(hidden_dims):
-            self.dense_layers.extend([
-                tf.keras.layers.Dense(dim, activation=None, name=f"user_dense_{i}"),
-                tf.keras.layers.BatchNormalization(name=f"user_bn_{i}"),
-                tf.keras.layers.Activation('relu', name=f"user_relu_{i}"),
-                tf.keras.layers.Dropout(dropout_rate, name=f"user_dropout_{i}")
-            ])
-        # Output layer
-        self.output_layer = tf.keras.layers.Dense(
-            embedding_dim, activation=None, use_bias=use_bias, name="user_output"
-        )
-        # Diversity regularizer
-        if use_diversity_reg:
-            self.diversity_regularizer = EmbeddingDiversityRegularizer()
-        # Adaptive normalization
-        self.adaptive_norm = tf.keras.layers.LayerNormalization(name="user_adaptive_norm")
-        # Global user bias
-        if use_bias:
-            self.global_user_bias = tf.Variable(
-                initial_value=0.0, trainable=True, name="global_user_bias"
-            )
-    def call(self, inputs, training=None):
-        """Enhanced forward pass with diversity regularization."""
-        age = inputs["age"]
-        gender = inputs["gender"]
-        income = inputs["income"]
-        item_history = inputs["item_history_embeddings"]
-        # Process demographics
-        age_emb = self.age_embedding(age)
-        income_emb = self.income_embedding(income)
-        gender_emb = self.gender_embedding(gender)
-        # Combine demographics
-        demo_combined = tf.concat([age_emb, income_emb, gender_emb], axis=-1)
-        # Enhanced history processing
-        batch_size = tf.shape(item_history)[0]
-        seq_len = tf.shape(item_history)[1]
-        # Simplified positional encoding - ensure shape compatibility
-        positions = tf.range(seq_len, dtype=tf.float32)
-        # Create simpler positional encoding
-        pos_encoding_scale = tf.cast(tf.range(self.embedding_dim, dtype=tf.float32), tf.float32) / self.embedding_dim
-        position_encoding = tf.sin(positions[:, tf.newaxis] * pos_encoding_scale[tf.newaxis, :])
-        # Ensure correct shape: [seq_len, embedding_dim] -> [batch_size, seq_len, embedding_dim]
-        position_encoding = tf.expand_dims(position_encoding, 0)
-        position_encoding = tf.tile(position_encoding, [batch_size, 1, 1])
-        # Add positional encoding with shape check
-        history_with_pos = item_history + position_encoding
-        # Create attention mask - fix shape for MultiHeadAttention
-        # MultiHeadAttention expects mask shape: [batch_size, seq_len] or [batch_size, seq_len, seq_len]
-        history_mask = tf.reduce_sum(tf.abs(item_history), axis=-1) > 0  # [batch_size, seq_len]
-        # Apply transformer attention
-        attended_history = self.history_transformer(
-            query=history_with_pos,
-            value=history_with_pos,
-            key=history_with_pos,
-            attention_mask=history_mask,
-            training=training
-        )
-        # Attention-based pooling instead of simple mean
-        attention_weights = tf.nn.softmax(
-            self.history_attention_pooling(attended_history), axis=1
-        )
-        history_aggregated = tf.reduce_sum(
-            attended_history * attention_weights, axis=1
-        )
-        # Combine features
-        combined = tf.concat([demo_combined, history_aggregated], axis=-1)
-        # Pass through dense layers with residual connections
-        x = combined
-        residual = x
-        for i, layer in enumerate(self.dense_layers):
-            x = layer(x, training=training)
-            # Add residual connection every 4 layers
-            if (i + 1) % 4 == 0 and x.shape[-1] == residual.shape[-1]:
-                x = x + residual
-                residual = x
-        # Final output
-        output = self.output_layer(x)
-        # Apply diversity regularization if enabled
-        if self.use_diversity_reg and training:
-            output = self.diversity_regularizer(output)
-        # Adaptive normalization
-        normalized_output = self.adaptive_norm(output)
-        # Add bias if enabled
-        if self.use_bias:
-            return normalized_output, self.global_user_bias
-        else:
-            return normalized_output
-class EnhancedTwoTowerModel(tfrs.Model):
-    """Enhanced two-tower model with all improvements."""
-    def __init__(self,
-                 item_tower: EnhancedItemTower,
-                 user_tower: EnhancedUserTower,
-                 rating_weight: float = 1.0,
-                 retrieval_weight: float = 1.0,
-                 contrastive_weight: float = 0.3,
-                 diversity_weight: float = 0.1):
-        super().__init__()
-        self.item_tower = item_tower
-        self.user_tower = user_tower
-        self.rating_weight = rating_weight
-        self.retrieval_weight = retrieval_weight
-        self.contrastive_weight = contrastive_weight
-        self.diversity_weight = diversity_weight
-        # Adaptive temperature scaling
-        self.temperature_similarity = AdaptiveTemperatureScaling()
-        # Enhanced rating model
-        self.rating_model = tf.keras.Sequential([
-            tf.keras.layers.Dense(512, activation="relu"),
-            tf.keras.layers.BatchNormalization(),
-            tf.keras.layers.Dropout(0.3),
-            tf.keras.layers.Dense(256, activation="relu"),
-            tf.keras.layers.BatchNormalization(),
-            tf.keras.layers.Dropout(0.2),
-            tf.keras.layers.Dense(64, activation="relu"),
-            tf.keras.layers.Dense(1, activation="sigmoid")
-        ])
-        # Focal loss for imbalanced data
-        self.focal_loss = self._focal_loss
-    def _focal_loss(self, y_true, y_pred, alpha=0.25, gamma=2.0):
-        """Focal loss implementation."""
-        epsilon = tf.keras.backend.epsilon()
-        y_pred = tf.clip_by_value(y_pred, epsilon, 1.0 - epsilon)
-        alpha_t = y_true * alpha + (1 - y_true) * (1 - alpha)
-        p_t = y_true * y_pred + (1 - y_true) * (1 - y_pred)
-        focal_weight = alpha_t * tf.pow((1 - p_t), gamma)
-        bce = -(y_true * tf.math.log(y_pred) + (1 - y_true) * tf.math.log(1 - y_pred))
-        focal_loss = focal_weight * bce
-        return tf.reduce_mean(focal_loss)
-    def call(self, features):
-        # Get embeddings
-        user_output = self.user_tower(features)
-        item_output = self.item_tower(features)
-        # Handle bias terms
-        if isinstance(user_output, tuple):
-            user_embeddings, user_bias = user_output
-        else:
-            user_embeddings = user_output
-            user_bias = 0.0
-        if isinstance(item_output, tuple):
-            item_embeddings, item_bias = item_output
-        else:
-            item_embeddings = item_output
-            item_bias = 0.0
-        return {
-            "user_embedding": user_embeddings,
-            "item_embedding": item_embeddings,
-            "user_bias": user_bias,
-            "item_bias": item_bias
-        }
-    def compute_loss(self, features, training=False):
-        # Get embeddings and biases
-        outputs = self(features)
-        user_embeddings = outputs["user_embedding"]
-        item_embeddings = outputs["item_embedding"]
-        user_bias = outputs["user_bias"]
-        item_bias = outputs["item_bias"]
-        # Rating prediction
-        concatenated = tf.concat([user_embeddings, item_embeddings], axis=-1)
-        rating_predictions = self.rating_model(concatenated, training=training)
-        # Add bias terms
-        rating_predictions_with_bias = rating_predictions + user_bias + item_bias
-        rating_predictions_with_bias = tf.nn.sigmoid(rating_predictions_with_bias)
-        # Losses
-        rating_loss = self.focal_loss(features["rating"], rating_predictions_with_bias)
-        # Adaptive temperature-scaled retrieval loss
-        scaled_similarities, temperature = self.temperature_similarity(
-            user_embeddings, item_embeddings
-        )
-        retrieval_loss = tf.keras.losses.binary_crossentropy(
-            features["rating"],
-            tf.nn.sigmoid(scaled_similarities)
-        )
-        retrieval_loss = tf.reduce_mean(retrieval_loss)
-        # Enhanced contrastive loss with hard negatives
-        batch_size = tf.shape(user_embeddings)[0]
-        positive_similarities = tf.reduce_sum(user_embeddings * item_embeddings, axis=1)
-        # Random negative sampling
-        shuffled_indices = tf.random.shuffle(tf.range(batch_size))
-        negative_item_embeddings = tf.gather(item_embeddings, shuffled_indices)
-        negative_similarities = tf.reduce_sum(user_embeddings * negative_item_embeddings, axis=1)
-        # Triplet loss with adaptive margin
-        margin = 0.5 / temperature  # Adaptive margin based on temperature
-        contrastive_loss = tf.reduce_mean(
-            tf.maximum(0.0, margin + negative_similarities - positive_similarities)
-        )
-        # Combine losses
-        total_loss = (
-            self.rating_weight * rating_loss +
-            self.retrieval_weight * retrieval_loss +
-            self.contrastive_weight * contrastive_loss
-        )
-        # Add regularization losses from diversity regularizers
-        if training:
-            regularization_losses = tf.add_n(self.losses) if self.losses else 0.0
-            total_loss += self.diversity_weight * regularization_losses
-        return {
-            'total_loss': total_loss,
-            'rating_loss': rating_loss,
-            'retrieval_loss': retrieval_loss,
-            'contrastive_loss': contrastive_loss,
-            'temperature': temperature,
-            'diversity_loss': regularization_losses if training else 0.0
-        }
-def create_enhanced_model(data_processor,
-                         embedding_dim=128,
-                         use_bias=True,
-                         use_diversity_reg=True):
-    """Factory function to create enhanced two-tower model."""
-    # Create enhanced towers
-    item_tower = EnhancedItemTower(
-        item_vocab_size=len(data_processor.item_vocab),
-        category_vocab_size=len(data_processor.category_vocab),
-        brand_vocab_size=len(data_processor.brand_vocab),
-        embedding_dim=embedding_dim,
-        use_bias=use_bias,
-        use_diversity_reg=use_diversity_reg
-    )
-    user_tower = EnhancedUserTower(
-        max_history_length=50,
-        embedding_dim=embedding_dim,
-        use_bias=use_bias,
-        use_diversity_reg=use_diversity_reg
-    )
-    # Create enhanced model
-    model = EnhancedTwoTowerModel(
-        item_tower=item_tower,
-        user_tower=user_tower,
-        rating_weight=1.0,
-        retrieval_weight=0.5,
-        contrastive_weight=0.3,
-        diversity_weight=0.1
-    )
-    return model

src/models/improved_two_tower.py DELETED Viewed

@@ -1,545 +0,0 @@
-#!/usr/bin/env python3
-"""
-Improved two-tower model with better embedding discrimination and training stability.
-"""
-import tensorflow as tf
-import tensorflow_recommenders as tfrs
-import numpy as np
-class ImprovedItemTower(tf.keras.Model):
-    """Enhanced item tower with better discrimination and representation capacity."""
-    def __init__(self,
-                 item_vocab_size: int,
-                 category_vocab_size: int,
-                 brand_vocab_size: int,
-                 embedding_dim: int = 128,  # Increased from 64
-                 hidden_dims: list = [256, 128],  # Deeper network
-                 dropout_rate: float = 0.3,
-                 use_bias: bool = True):
-        super().__init__()
-        self.embedding_dim = embedding_dim
-        self.use_bias = use_bias
-        # Larger embedding layers with proper initialization
-        self.item_embedding = tf.keras.layers.Embedding(
-            item_vocab_size, embedding_dim,
-            embeddings_initializer='glorot_uniform',
-            name="item_embedding"
-        )
-        self.category_embedding = tf.keras.layers.Embedding(
-            category_vocab_size, embedding_dim,
-            embeddings_initializer='glorot_uniform',
-            name="category_embedding"
-        )
-        self.brand_embedding = tf.keras.layers.Embedding(
-            brand_vocab_size, embedding_dim,
-            embeddings_initializer='glorot_uniform',
-            name="brand_embedding"
-        )
-        # Price normalization and projection
-        self.price_normalization = tf.keras.layers.Normalization(name="price_norm")
-        self.price_projection = tf.keras.layers.Dense(
-            embedding_dim // 4, activation='relu', name="price_proj"
-        )
-        # Attention mechanism for feature fusion
-        self.feature_attention = tf.keras.layers.MultiHeadAttention(
-            num_heads=4,
-            key_dim=embedding_dim,
-            name="feature_attention"
-        )
-        # Enhanced dense layers with batch normalization
-        self.dense_layers = []
-        for i, dim in enumerate(hidden_dims):
-            self.dense_layers.extend([
-                tf.keras.layers.Dense(dim, activation=None, name=f"dense_{i}"),
-                tf.keras.layers.BatchNormalization(name=f"bn_{i}"),
-                tf.keras.layers.Activation('relu', name=f"relu_{i}"),
-                tf.keras.layers.Dropout(dropout_rate, name=f"dropout_{i}")
-            ])
-        # Output projection with bias term
-        self.output_layer = tf.keras.layers.Dense(
-            embedding_dim, activation=None, use_bias=use_bias, name="item_output"
-        )
-        # Learnable bias term for each item
-        if use_bias:
-            self.item_bias = tf.keras.layers.Embedding(
-                item_vocab_size, 1, name="item_bias"
-            )
-    def call(self, inputs, training=None):
-        """Enhanced forward pass with attention and better feature fusion."""
-        item_id = inputs["product_id"]
-        category_id = inputs["category_id"]
-        brand_id = inputs["brand_id"]
-        price = inputs["price"]
-        # Get embeddings
-        item_emb = self.item_embedding(item_id)  # [batch, emb_dim]
-        category_emb = self.category_embedding(category_id)
-        brand_emb = self.brand_embedding(brand_id)
-        # Process price
-        price_norm = self.price_normalization(tf.expand_dims(price, -1))
-        price_emb = self.price_projection(price_norm)
-        # Pad price embedding to match others
-        price_emb_padded = tf.pad(
-            price_emb,
-            [[0, 0], [0, self.embedding_dim - tf.shape(price_emb)[-1]]]
-        )
-        # Stack features for attention [batch, 4, emb_dim]
-        features = tf.stack([item_emb, category_emb, brand_emb, price_emb_padded], axis=1)
-        # Apply self-attention for feature fusion
-        attended_features = self.feature_attention(
-            query=features,
-            value=features,
-            key=features,
-            training=training
-        )
-        # Aggregate features (mean pooling)
-        combined = tf.reduce_mean(attended_features, axis=1)
-        # Pass through enhanced dense layers
-        x = combined
-        for layer in self.dense_layers:
-            x = layer(x, training=training)
-        # Final output
-        output = self.output_layer(x)
-        # L2 normalize for similarity computations
-        normalized_output = tf.nn.l2_normalize(output, axis=-1)
-        # Add bias if enabled
-        if self.use_bias:
-            bias = tf.squeeze(self.item_bias(item_id), axis=-1)
-            return normalized_output, bias
-        else:
-            return normalized_output
-class ImprovedUserTower(tf.keras.Model):
-    """Enhanced user tower with better history modeling and representation."""
-    def __init__(self,
-                 max_history_length: int = 50,
-                 embedding_dim: int = 128,  # Increased from 64
-                 hidden_dims: list = [256, 128],  # Deeper network
-                 dropout_rate: float = 0.3,
-                 use_bias: bool = True):
-        super().__init__()
-        self.embedding_dim = embedding_dim
-        self.max_history_length = max_history_length
-        self.use_bias = use_bias
-        # Demographic embeddings (categorical features)
-        # Age: 6 categories (Teen, Young Adult, Adult, Middle Age, Mature, Senior)
-        self.age_embedding = tf.keras.layers.Embedding(
-            6, embedding_dim // 16,
-            embeddings_initializer='glorot_uniform',
-            name="age_embedding"
-        )
-        # Income: 5 categories (percentile-based)
-        self.income_embedding = tf.keras.layers.Embedding(
-            5, embedding_dim // 16,
-            embeddings_initializer='glorot_uniform',
-            name="income_embedding"
-        )
-        # Gender: 2 categories (0=female, 1=male)
-        self.gender_embedding = tf.keras.layers.Embedding(
-            2, embedding_dim // 16,
-            embeddings_initializer='glorot_uniform',
-            name="gender_embedding"
-        )
-        # Improved history processing with positional encoding
-        self.history_transformer = tf.keras.layers.MultiHeadAttention(
-            num_heads=8,  # More attention heads
-            key_dim=embedding_dim,
-            name="history_transformer"
-        )
-        # History aggregation with learned weights
-        self.history_aggregation = tf.keras.layers.Dense(
-            embedding_dim, activation='tanh', name="history_agg"
-        )
-        # Enhanced dense layers with batch normalization
-        self.dense_layers = []
-        for i, dim in enumerate(hidden_dims):
-            self.dense_layers.extend([
-                tf.keras.layers.Dense(dim, activation=None, name=f"user_dense_{i}"),
-                tf.keras.layers.BatchNormalization(name=f"user_bn_{i}"),
-                tf.keras.layers.Activation('relu', name=f"user_relu_{i}"),
-                tf.keras.layers.Dropout(dropout_rate, name=f"user_dropout_{i}")
-            ])
-        # Output layer
-        self.output_layer = tf.keras.layers.Dense(
-            embedding_dim, activation=None, use_bias=use_bias, name="user_output"
-        )
-        # Learnable user bias
-        if use_bias:
-            # We'll need to handle user bias differently since we don't have user vocab in inference
-            self.global_user_bias = tf.Variable(
-                initial_value=0.0, trainable=True, name="global_user_bias"
-            )
-    def call(self, inputs, training=None):
-        """Enhanced forward pass with better history modeling."""
-        age = inputs["age"]  # Now categorical (0-5)
-        gender = inputs["gender"]  # Categorical (0-1)
-        income = inputs["income"]  # Now categorical (0-4)
-        item_history = inputs["item_history_embeddings"]  # [batch_size, seq_len, emb_dim]
-        # Process demographics through embeddings
-        age_emb = self.age_embedding(age)  # [batch_size, embedding_dim//16]
-        income_emb = self.income_embedding(income)  # [batch_size, embedding_dim//16]
-        gender_emb = self.gender_embedding(gender)  # [batch_size, embedding_dim//16]
-        # Combine all demographic embeddings
-        demo_combined = tf.concat([age_emb, income_emb, gender_emb], axis=-1)
-        # Total demographics: 3 * (embedding_dim//16) = ~18.75% of embedding_dim
-        # Enhanced history processing with positional encoding
-        batch_size = tf.shape(item_history)[0]
-        seq_len = tf.shape(item_history)[1]
-        # Create positional encoding
-        positions = tf.range(seq_len, dtype=tf.float32)
-        position_encoding = tf.sin(
-            positions[:, tf.newaxis] /
-            tf.pow(10000.0, 2 * tf.range(self.embedding_dim, dtype=tf.float32) / self.embedding_dim)
-        )
-        position_encoding = tf.expand_dims(position_encoding, 0)
-        position_encoding = tf.tile(position_encoding, [batch_size, 1, 1])
-        # Add positional encoding to history
-        history_with_pos = item_history + position_encoding
-        # Create attention mask for padding
-        history_mask = tf.reduce_sum(tf.abs(item_history), axis=-1) > 0
-        # Apply transformer attention to history
-        attended_history = self.history_transformer(
-            query=history_with_pos,
-            value=history_with_pos,
-            key=history_with_pos,
-            attention_mask=history_mask,
-            training=training
-        )
-        # Aggregate history with learned weights
-        history_weights = tf.nn.softmax(
-            tf.keras.layers.Dense(1)(attended_history), axis=1
-        )
-        history_aggregated = tf.reduce_sum(
-            attended_history * history_weights, axis=1
-        )
-        # Apply additional processing
-        history_processed = self.history_aggregation(history_aggregated)
-        # Combine all features
-        combined = tf.concat([
-            demo_combined,
-            history_processed
-        ], axis=-1)
-        # Pass through enhanced dense layers
-        x = combined
-        for layer in self.dense_layers:
-            x = layer(x, training=training)
-        # Final output
-        output = self.output_layer(x)
-        # L2 normalize for similarity computations
-        normalized_output = tf.nn.l2_normalize(output, axis=-1)
-        # Add global bias if enabled
-        if self.use_bias:
-            return normalized_output, self.global_user_bias
-        else:
-            return normalized_output
-class TemperatureScaledSimilarity(tf.keras.layers.Layer):
-    """Learnable temperature scaling for similarity computations."""
-    def __init__(self, initial_temperature=1.0, **kwargs):
-        super().__init__(**kwargs)
-        self.initial_temperature = initial_temperature
-    def build(self, input_shape):
-        self.temperature = self.add_weight(
-            name='temperature',
-            shape=(),
-            initializer=tf.keras.initializers.Constant(self.initial_temperature),
-            trainable=True
-        )
-        super().build(input_shape)
-    def call(self, user_embeddings, item_embeddings):
-        """Compute temperature-scaled similarity."""
-        # Dot product similarity
-        similarities = tf.reduce_sum(user_embeddings * item_embeddings, axis=1)
-        # Scale by learnable temperature
-        scaled_similarities = similarities / tf.maximum(self.temperature, 0.01)  # Prevent division by 0
-        return scaled_similarities
-class ImprovedTwoTowerModel(tfrs.Model):
-    """Enhanced two-tower model with better discrimination and training stability."""
-    def __init__(self,
-                 item_tower: ImprovedItemTower,
-                 user_tower: ImprovedUserTower,
-                 rating_weight: float = 1.0,
-                 retrieval_weight: float = 1.0,
-                 contrastive_weight: float = 0.5,
-                 use_focal_loss: bool = True):
-        super().__init__()
-        self.item_tower = item_tower
-        self.user_tower = user_tower
-        self.rating_weight = rating_weight
-        self.retrieval_weight = retrieval_weight
-        self.contrastive_weight = contrastive_weight
-        self.use_focal_loss = use_focal_loss
-        # Temperature-scaled similarity
-        self.temperature_similarity = TemperatureScaledSimilarity()
-        # Enhanced rating prediction with more capacity
-        self.rating_model = tf.keras.Sequential([
-            tf.keras.layers.Dense(512, activation="relu"),
-            tf.keras.layers.BatchNormalization(),
-            tf.keras.layers.Dropout(0.3),
-            tf.keras.layers.Dense(256, activation="relu"),
-            tf.keras.layers.BatchNormalization(),
-            tf.keras.layers.Dropout(0.2),
-            tf.keras.layers.Dense(64, activation="relu"),
-            tf.keras.layers.Dense(1, activation="sigmoid")
-        ])
-        # Rating task with better loss
-        if use_focal_loss:
-            self.rating_loss = self._focal_loss
-        else:
-            self.rating_loss = tf.keras.losses.BinaryCrossentropy()
-        # Contrastive loss for embedding separation
-        self.contrastive_loss = tf.keras.losses.CosineSimilarity()
-    def _focal_loss(self, y_true, y_pred, alpha=0.25, gamma=2.0):
-        """Focal loss for handling imbalanced data."""
-        epsilon = tf.keras.backend.epsilon()
-        y_pred = tf.clip_by_value(y_pred, epsilon, 1.0 - epsilon)
-        # Compute focal weight
-        alpha_t = y_true * alpha + (1 - y_true) * (1 - alpha)
-        p_t = y_true * y_pred + (1 - y_true) * (1 - y_pred)
-        focal_weight = alpha_t * tf.pow((1 - p_t), gamma)
-        # Compute loss
-        bce = -(y_true * tf.math.log(y_pred) + (1 - y_true) * tf.math.log(1 - y_pred))
-        focal_loss = focal_weight * bce
-        return tf.reduce_mean(focal_loss)
-    def call(self, features):
-        # Get embeddings (handle bias if present)
-        user_output = self.user_tower(features)
-        item_output = self.item_tower(features)
-        # Handle bias terms
-        if isinstance(user_output, tuple):
-            user_embeddings, user_bias = user_output
-        else:
-            user_embeddings = user_output
-            user_bias = 0.0
-        if isinstance(item_output, tuple):
-            item_embeddings, item_bias = item_output
-        else:
-            item_embeddings = item_output
-            item_bias = 0.0
-        return {
-            "user_embedding": user_embeddings,
-            "item_embedding": item_embeddings,
-            "user_bias": user_bias,
-            "item_bias": item_bias
-        }
-    def _hard_negative_mining(self, user_embeddings, item_embeddings, ratings, num_negatives=5):
-        """Mine hard negatives for better training."""
-        batch_size = tf.shape(user_embeddings)[0]
-        # Compute all pairwise similarities
-        user_norm = tf.nn.l2_normalize(user_embeddings, axis=1)
-        item_norm = tf.nn.l2_normalize(item_embeddings, axis=1)
-        # Expand dimensions for broadcasting: [batch, 1, dim] x [1, batch, dim]
-        user_expanded = tf.expand_dims(user_norm, 1)
-        item_expanded = tf.expand_dims(item_norm, 0)
-        # Compute similarity matrix [batch, batch]
-        similarity_matrix = tf.reduce_sum(user_expanded * item_expanded, axis=2)
-        # Create mask to exclude positive pairs
-        positive_mask = tf.eye(batch_size, dtype=tf.bool)
-        negative_mask = tf.logical_not(positive_mask)
-        # Get negative similarities and find hardest negatives
-        negative_similarities = tf.where(negative_mask, similarity_matrix, -tf.float32.max)
-        # Get top-k hardest negatives (highest similarities among negatives)
-        _, hard_negative_indices = tf.nn.top_k(negative_similarities, k=num_negatives)
-        return hard_negative_indices
-    def compute_loss(self, features, training=False):
-        # Get embeddings and biases
-        outputs = self(features)
-        user_embeddings = outputs["user_embedding"]
-        item_embeddings = outputs["item_embedding"]
-        user_bias = outputs["user_bias"]
-        item_bias = outputs["item_bias"]
-        # Rating prediction with bias terms
-        concatenated = tf.concat([user_embeddings, item_embeddings], axis=-1)
-        rating_predictions = self.rating_model(concatenated, training=training)
-        # Add bias terms to rating predictions
-        rating_predictions_with_bias = rating_predictions + user_bias + item_bias
-        rating_predictions_with_bias = tf.nn.sigmoid(rating_predictions_with_bias)
-        # Rating loss
-        rating_loss = self.rating_loss(features["rating"], rating_predictions_with_bias)
-        # Temperature-scaled retrieval loss
-        scaled_similarities = self.temperature_similarity(user_embeddings, item_embeddings)
-        retrieval_loss = tf.keras.losses.binary_crossentropy(
-            features["rating"],
-            tf.nn.sigmoid(scaled_similarities)
-        )
-        retrieval_loss = tf.reduce_mean(retrieval_loss)
-        # Enhanced contrastive loss with hard negative mining
-        batch_size = tf.shape(user_embeddings)[0]
-        if training and batch_size > 5:  # Only use hard negatives during training with sufficient batch size
-            # Hard negative mining
-            hard_negative_indices = self._hard_negative_mining(
-                user_embeddings, item_embeddings, features["rating"], num_negatives=3
-            )
-            # Positive similarities
-            positive_similarities = tf.reduce_sum(user_embeddings * item_embeddings, axis=1)
-            # Hard negative similarities
-            hard_negative_losses = []
-            for i in range(3):  # Use top 3 hard negatives
-                neg_indices = hard_negative_indices[:, i]
-                negative_item_embeddings = tf.gather(item_embeddings, neg_indices)
-                negative_similarities = tf.reduce_sum(user_embeddings * negative_item_embeddings, axis=1)
-                # Triplet-like loss with margin
-                margin_loss = tf.maximum(0.0, 0.2 + negative_similarities - positive_similarities)
-                hard_negative_losses.append(margin_loss)
-            # Average hard negative losses
-            contrastive_loss = tf.reduce_mean(tf.stack(hard_negative_losses))
-        else:
-            # Fallback to random negative sampling
-            shuffled_indices = tf.random.shuffle(tf.range(batch_size))
-            negative_item_embeddings = tf.gather(item_embeddings, shuffled_indices)
-            # Positive similarities
-            positive_similarities = tf.reduce_sum(user_embeddings * item_embeddings, axis=1)
-            # Negative similarities
-            negative_similarities = tf.reduce_sum(user_embeddings * negative_item_embeddings, axis=1)
-            # Contrastive loss (maximize positive, minimize negative)
-            contrastive_loss = tf.reduce_mean(
-                tf.maximum(0.0, 0.5 + negative_similarities - positive_similarities)
-            )
-        # Combine losses
-        total_loss = (
-            self.rating_weight * rating_loss +
-            self.retrieval_weight * retrieval_loss +
-            self.contrastive_weight * contrastive_loss
-        )
-        # Add L2 regularization to prevent overfitting
-        l2_loss = tf.add_n([
-            tf.nn.l2_loss(var) for var in self.trainable_variables
-            if 'bias' not in var.name and 'normalization' not in var.name
-        ]) * 1e-5
-        total_loss += l2_loss
-        return {
-            'total_loss': total_loss,
-            'rating_loss': rating_loss,
-            'retrieval_loss': retrieval_loss,
-            'contrastive_loss': contrastive_loss,
-            'l2_loss': l2_loss
-        }
-def create_improved_model(data_processor,
-                         embedding_dim=128,
-                         use_bias=True,
-                         use_focal_loss=True):
-    """Factory function to create improved two-tower model."""
-    # Create enhanced towers
-    item_tower = ImprovedItemTower(
-        item_vocab_size=len(data_processor.item_vocab),
-        category_vocab_size=len(data_processor.category_vocab),
-        brand_vocab_size=len(data_processor.brand_vocab),
-        embedding_dim=embedding_dim,
-        use_bias=use_bias
-    )
-    user_tower = ImprovedUserTower(
-        max_history_length=50,
-        embedding_dim=embedding_dim,
-        use_bias=use_bias
-    )
-    # Create improved model
-    model = ImprovedTwoTowerModel(
-        item_tower=item_tower,
-        user_tower=user_tower,
-        rating_weight=1.0,
-        retrieval_weight=0.5,
-        contrastive_weight=0.3,
-        use_focal_loss=use_focal_loss
-    )
-    return model

src/models/user_tower.py CHANGED Viewed

@@ -32,6 +32,27 @@ class UserTower(tf.keras.Model):
             2, embedding_dim // 16, name="gender_embedding"
         )
         # History aggregation layers
         self.history_attention = tf.keras.layers.MultiHeadAttention(
             num_heads=4,
@@ -57,12 +78,20 @@ class UserTower(tf.keras.Model):
         age = inputs["age"]  # Now categorical (0-5)
         gender = inputs["gender"]  # Categorical (0-1)
         income = inputs["income"]  # Now categorical (0-4)
         item_history = inputs["item_history_embeddings"]  # [batch_size, seq_len, emb_dim]
         # Process demographics through embeddings
         age_emb = self.age_embedding(age)  # [batch_size, embedding_dim//16]
         income_emb = self.income_embedding(income)  # [batch_size, embedding_dim//16]
         gender_emb = self.gender_embedding(gender)  # [batch_size, embedding_dim//16]
         # Aggregate item history using attention
         # Create attention mask for padding
@@ -84,6 +113,10 @@ class UserTower(tf.keras.Model):
             age_emb,
             income_emb,
             gender_emb,
             history_aggregated
         ], axis=-1)

             2, embedding_dim // 16, name="gender_embedding"
         )
+        # New demographic embeddings
+        # Profession: 8 categories (Technology, Healthcare, Education, Finance, Retail, Manufacturing, Services, Other)
+        self.profession_embedding = tf.keras.layers.Embedding(
+            8, embedding_dim // 16, name="profession_embedding"
+        )
+        # Location: 3 categories (Urban, Suburban, Rural)
+        self.location_embedding = tf.keras.layers.Embedding(
+            3, embedding_dim // 16, name="location_embedding"
+        )
+        # Education Level: 5 categories (High School, Some College, Bachelor's, Master's, PhD+)
+        self.education_embedding = tf.keras.layers.Embedding(
+            5, embedding_dim // 16, name="education_embedding"
+        )
+        # Marital Status: 4 categories (Single, Married, Divorced, Widowed)
+        self.marital_embedding = tf.keras.layers.Embedding(
+            4, embedding_dim // 16, name="marital_embedding"
+        )
         # History aggregation layers
         self.history_attention = tf.keras.layers.MultiHeadAttention(
             num_heads=4,
         age = inputs["age"]  # Now categorical (0-5)
         gender = inputs["gender"]  # Categorical (0-1)
         income = inputs["income"]  # Now categorical (0-4)
+        profession = inputs["profession"]  # Categorical (0-7)
+        location = inputs["location"]  # Categorical (0-2)
+        education = inputs["education_level"]  # Categorical (0-4)
+        marital_status = inputs["marital_status"]  # Categorical (0-3)
         item_history = inputs["item_history_embeddings"]  # [batch_size, seq_len, emb_dim]
         # Process demographics through embeddings
         age_emb = self.age_embedding(age)  # [batch_size, embedding_dim//16]
         income_emb = self.income_embedding(income)  # [batch_size, embedding_dim//16]
         gender_emb = self.gender_embedding(gender)  # [batch_size, embedding_dim//16]
+        profession_emb = self.profession_embedding(profession)  # [batch_size, embedding_dim//16]
+        location_emb = self.location_embedding(location)  # [batch_size, embedding_dim//16]
+        education_emb = self.education_embedding(education)  # [batch_size, embedding_dim//16]
+        marital_emb = self.marital_embedding(marital_status)  # [batch_size, embedding_dim//16]
         # Aggregate item history using attention
         # Create attention mask for padding
             age_emb,
             income_emb,
             gender_emb,
+            profession_emb,
+            location_emb,
+            education_emb,
+            marital_emb,
             history_aggregated
         ], axis=-1)

src/preprocessing/optimized_dataset_creator.py DELETED Viewed

@@ -1,111 +0,0 @@
-"""
-Optimized dataset creation script with performance improvements.
-"""
-import time
-import numpy as np
-from src.preprocessing.user_data_preparation import UserDatasetCreator
-from src.preprocessing.data_loader import DataProcessor, create_tf_dataset
-def create_optimized_dataset(max_history_length: int = 50,
-                           batch_size: int = 512,
-                           negative_samples_per_positive: int = 2,
-                           use_sample: bool = False,
-                           sample_size: int = 10000):
-    """
-    Create dataset with optimized performance settings.
-    Args:
-        max_history_length: Maximum user interaction history length
-        batch_size: Batch size for TensorFlow dataset
-        negative_samples_per_positive: Negative sampling ratio
-        use_sample: Whether to use a sample of the data for faster processing
-        sample_size: Size of sample if use_sample=True
-    """
-    print("Starting optimized dataset creation...")
-    start_time = time.time()
-    # Initialize with optimized settings
-    dataset_creator = UserDatasetCreator(max_history_length=max_history_length)
-    data_processor = DataProcessor()
-    # Load data
-    print("Loading data...")
-    load_start = time.time()
-    items_df, users_df, interactions_df = data_processor.load_data()
-    print(f"Data loaded in {time.time() - load_start:.2f} seconds")
-    # Optional: Use sample for faster development/testing
-    if use_sample:
-        print(f"Using sample of {sample_size} interactions for faster processing...")
-        sample_interactions = interactions_df.sample(min(sample_size, len(interactions_df)))
-        user_ids = set(sample_interactions['user_id'])
-        item_ids = set(sample_interactions['product_id'])
-        users_df = users_df[users_df['user_id'].isin(user_ids)]
-        items_df = items_df[items_df['product_id'].isin(item_ids)]
-        interactions_df = sample_interactions
-        print(f"Sample: {len(items_df)} items, {len(users_df)} users, {len(interactions_df)} interactions")
-    # Load embeddings with caching
-    print("Loading item embeddings...")
-    embed_start = time.time()
-    item_embeddings = dataset_creator.load_item_embeddings()
-    print(f"Embeddings loaded in {time.time() - embed_start:.2f} seconds")
-    # Create temporal split
-    print("Creating temporal split...")
-    split_start = time.time()
-    train_interactions, val_interactions = dataset_creator.create_temporal_split(interactions_df)
-    print(f"Temporal split created in {time.time() - split_start:.2f} seconds")
-    # Create training dataset with optimizations
-    print("Creating optimized training dataset...")
-    train_start = time.time()
-    training_features = dataset_creator.create_training_dataset(
-        train_interactions, items_df, users_df, item_embeddings,
-        negative_samples_per_positive=negative_samples_per_positive
-    )
-    print(f"Training dataset created in {time.time() - train_start:.2f} seconds")
-    # Create TensorFlow dataset optimized for CPU
-    print("Creating TensorFlow dataset...")
-    tf_start = time.time()
-    tf_dataset = create_tf_dataset(training_features, batch_size=batch_size)
-    print(f"TensorFlow dataset created in {time.time() - tf_start:.2f} seconds")
-    # Save optimized dataset
-    print("Saving dataset...")
-    save_start = time.time()
-    dataset_creator.save_dataset(training_features, "src/artifacts/")
-    # Save vocabularies for later use
-    data_processor.save_vocabularies("src/artifacts/")
-    print(f"Dataset saved in {time.time() - save_start:.2f} seconds")
-    total_time = time.time() - start_time
-    print(f"\nOptimized dataset creation completed in {total_time:.2f} seconds!")
-    print(f"Training samples: {len(training_features['rating'])}")
-    print(f"Memory usage optimized for CPU training")
-    return tf_dataset, training_features
-if __name__ == "__main__":
-    # Run with optimized settings
-    tf_dataset, features = create_optimized_dataset(
-        max_history_length=30,  # Reduced for speed
-        batch_size=512,         # Larger batches for CPU efficiency
-        negative_samples_per_positive=2,  # Reduced sampling ratio
-        use_sample=True,        # Use sample for development
-        sample_size=50000       # Reasonable sample size
-    )
-    print("\nDataset creation optimization complete!")
-    print("Key optimizations applied:")
-    print("- Vectorized DataFrame operations")
-    print("- Parallel negative sampling")
-    print("- Memory-efficient embedding lookup")
-    print("- Optimized TensorFlow dataset pipeline")
-    print("- LRU caching for embeddings")

src/preprocessing/user_data_preparation.py CHANGED Viewed

@@ -45,6 +45,50 @@ class UserDatasetCreator:
         categories = np.clip(categories, 0, 4)
         return categories.astype(np.int32)
     @lru_cache(maxsize=1)
     def load_item_embeddings(self, embeddings_path: str = "src/artifacts/item_embeddings.npy") -> Dict[int, np.ndarray]:
@@ -155,6 +199,12 @@ class UserDatasetCreator:
         # Categorize income (5 percentile-based categories)
         user_demographics['income_category'] = self.categorize_income(user_demographics['income'])
         # Create mapping from user_id to array index
         user_id_to_index = {uid: idx for idx, uid in enumerate(user_demographics['user_id'])}
@@ -165,6 +215,10 @@ class UserDatasetCreator:
             'age': user_demographics['age_category'].values.astype(np.int32),  # Categorical age
             'gender': user_demographics['gender_numeric'].values.astype(np.int32),
             'income': user_demographics['income_category'].values.astype(np.int32),  # Categorical income
             'item_history_embeddings': np.array([
                 user_aggregated_embeddings[uid] for uid in user_demographics['user_id']
             ]).astype(np.float32)
@@ -173,6 +227,10 @@ class UserDatasetCreator:
         print(f"Prepared user features for {len(valid_users)} users")
         print(f"Age categories: {np.unique(user_features['age'], return_counts=True)}")
         print(f"Income categories: {np.unique(user_features['income'], return_counts=True)}")
         print(f"History embeddings shape: {user_features['item_history_embeddings'].shape}")
         return user_features
@@ -256,6 +314,10 @@ class UserDatasetCreator:
         training_features['age'] = user_features['age'][user_indices]
         training_features['gender'] = user_features['gender'][user_indices]
         training_features['income'] = user_features['income'][user_indices]
         training_features['item_history_embeddings'] = user_features['item_history_embeddings'][user_indices]
         # Item features for each pair
@@ -412,10 +474,20 @@ def prepare_user_features(users_df: pd.DataFrame,
         user_idx = users_df[users_df['user_id'] == user_id].index[0]
         income_cat = income_categories[user_idx]
         user_feature_dict[user_id] = {
             'age': age_cat,
             'gender': gender_cat,
             'income': income_cat,
             'item_history_embeddings': user_aggregated_embeddings[user_id]
         }

         categories = np.clip(categories, 0, 4)
         return categories.astype(np.int32)
+    def categorize_profession(self, profession: str) -> int:
+        """Categorize profession into numeric categories."""
+        profession_map = {
+            "Technology": 0,
+            "Healthcare": 1,
+            "Education": 2,
+            "Finance": 3,
+            "Retail": 4,
+            "Manufacturing": 5,
+            "Services": 6,
+            "Other": 7
+        }
+        return profession_map.get(profession, 7)  # Default to "Other"
+    def categorize_location(self, location: str) -> int:
+        """Categorize location into numeric categories."""
+        location_map = {
+            "Urban": 0,
+            "Suburban": 1,
+            "Rural": 2
+        }
+        return location_map.get(location, 0)  # Default to "Urban"
+    def categorize_education_level(self, education: str) -> int:
+        """Categorize education level into numeric categories."""
+        education_map = {
+            "High School": 0,
+            "Some College": 1,
+            "Bachelor's": 2,
+            "Master's": 3,
+            "PhD+": 4
+        }
+        return education_map.get(education, 0)  # Default to "High School"
+    def categorize_marital_status(self, marital_status: str) -> int:
+        """Categorize marital status into numeric categories."""
+        marital_map = {
+            "Single": 0,
+            "Married": 1,
+            "Divorced": 2,
+            "Widowed": 3
+        }
+        return marital_map.get(marital_status, 0)  # Default to "Single"
     @lru_cache(maxsize=1)
     def load_item_embeddings(self, embeddings_path: str = "src/artifacts/item_embeddings.npy") -> Dict[int, np.ndarray]:
         # Categorize income (5 percentile-based categories)
         user_demographics['income_category'] = self.categorize_income(user_demographics['income'])
+        # Categorize new demographic features
+        user_demographics['profession_category'] = user_demographics['profession'].apply(self.categorize_profession)
+        user_demographics['location_category'] = user_demographics['location'].apply(self.categorize_location)
+        user_demographics['education_category'] = user_demographics['education_level'].apply(self.categorize_education_level)
+        user_demographics['marital_category'] = user_demographics['marital_status'].apply(self.categorize_marital_status)
         # Create mapping from user_id to array index
         user_id_to_index = {uid: idx for idx, uid in enumerate(user_demographics['user_id'])}
             'age': user_demographics['age_category'].values.astype(np.int32),  # Categorical age
             'gender': user_demographics['gender_numeric'].values.astype(np.int32),
             'income': user_demographics['income_category'].values.astype(np.int32),  # Categorical income
+            'profession': user_demographics['profession_category'].values.astype(np.int32),  # Categorical profession
+            'location': user_demographics['location_category'].values.astype(np.int32),  # Categorical location
+            'education_level': user_demographics['education_category'].values.astype(np.int32),  # Categorical education
+            'marital_status': user_demographics['marital_category'].values.astype(np.int32),  # Categorical marital status
             'item_history_embeddings': np.array([
                 user_aggregated_embeddings[uid] for uid in user_demographics['user_id']
             ]).astype(np.float32)
         print(f"Prepared user features for {len(valid_users)} users")
         print(f"Age categories: {np.unique(user_features['age'], return_counts=True)}")
         print(f"Income categories: {np.unique(user_features['income'], return_counts=True)}")
+        print(f"Profession categories: {np.unique(user_features['profession'], return_counts=True)}")
+        print(f"Location categories: {np.unique(user_features['location'], return_counts=True)}")
+        print(f"Education categories: {np.unique(user_features['education_level'], return_counts=True)}")
+        print(f"Marital status categories: {np.unique(user_features['marital_status'], return_counts=True)}")
         print(f"History embeddings shape: {user_features['item_history_embeddings'].shape}")
         return user_features
         training_features['age'] = user_features['age'][user_indices]
         training_features['gender'] = user_features['gender'][user_indices]
         training_features['income'] = user_features['income'][user_indices]
+        training_features['profession'] = user_features['profession'][user_indices]
+        training_features['location'] = user_features['location'][user_indices]
+        training_features['education_level'] = user_features['education_level'][user_indices]
+        training_features['marital_status'] = user_features['marital_status'][user_indices]
         training_features['item_history_embeddings'] = user_features['item_history_embeddings'][user_indices]
         # Item features for each pair
         user_idx = users_df[users_df['user_id'] == user_id].index[0]
         income_cat = income_categories[user_idx]
+        # Get new demographic features from the row
+        profession_cat = creator.categorize_profession(user_row.get('profession', 'Other'))
+        location_cat = creator.categorize_location(user_row.get('location', 'Urban'))
+        education_cat = creator.categorize_education_level(user_row.get('education_level', 'High School'))
+        marital_cat = creator.categorize_marital_status(user_row.get('marital_status', 'Single'))
         user_feature_dict[user_id] = {
             'age': age_cat,
             'gender': gender_cat,
             'income': income_cat,
+            'profession': profession_cat,
+            'location': location_cat,
+            'education_level': education_cat,
+            'marital_status': marital_cat,
             'item_history_embeddings': user_aggregated_embeddings[user_id]
         }

src/training/curriculum_trainer.py DELETED Viewed

@@ -1,341 +0,0 @@
-#!/usr/bin/env python3
-"""
-Curriculum learning trainer for the improved two-tower model.
-Implements progressive difficulty training for better convergence.
-"""
-import tensorflow as tf
-import numpy as np
-import pickle
-import os
-import time
-from typing import Dict, List, Tuple
-from src.models.improved_two_tower import create_improved_model
-from src.preprocessing.data_loader import DataProcessor
-class CurriculumTrainer:
-    """Trainer with curriculum learning for improved two-tower model."""
-    def __init__(self,
-                 embedding_dim: int = 128,
-                 learning_rate: float = 0.001,
-                 use_focal_loss: bool = True,
-                 curriculum_stages: int = 3):
-        self.embedding_dim = embedding_dim
-        self.learning_rate = learning_rate
-        self.use_focal_loss = use_focal_loss
-        self.curriculum_stages = curriculum_stages
-        self.data_processor = None
-        self.model = None
-    def load_data_processor(self, artifacts_path: str = "src/artifacts/"):
-        """Load data processor with vocabularies."""
-        self.data_processor = DataProcessor()
-        self.data_processor.load_vocabularies(f"{artifacts_path}/vocabularies.pkl")
-        print("Data processor loaded successfully")
-    def create_model(self):
-        """Create improved two-tower model."""
-        if self.data_processor is None:
-            raise ValueError("Data processor must be loaded first")
-        self.model = create_improved_model(
-            data_processor=self.data_processor,
-            embedding_dim=self.embedding_dim,
-            use_bias=True,
-            use_focal_loss=self.use_focal_loss
-        )
-        # Compile model
-        self.model.compile(
-            optimizer=tf.keras.optimizers.Adam(learning_rate=self.learning_rate)
-        )
-        print("Improved two-tower model created successfully")
-    def _create_curriculum_stages(self, features: Dict[str, np.ndarray]) -> List[Dict[str, np.ndarray]]:
-        """Create curriculum stages based on interaction complexity."""
-        # Calculate interaction history lengths for curriculum
-        history_lengths = []
-        for i in range(len(features['age'])):
-            hist = features['item_history_embeddings'][i]
-            # Count non-zero embeddings
-            length = np.sum(np.any(hist != 0, axis=1))
-            history_lengths.append(length)
-        history_lengths = np.array(history_lengths)
-        # Create stages based on history length percentiles
-        stages = []
-        if self.curriculum_stages == 3:
-            # Stage 1: Simple cases (short or no history)
-            stage1_mask = history_lengths <= np.percentile(history_lengths, 33)
-            # Stage 2: Medium complexity (medium history)
-            stage2_mask = (history_lengths > np.percentile(history_lengths, 33)) & \
-                         (history_lengths <= np.percentile(history_lengths, 67))
-            # Stage 3: Complex cases (long history)
-            stage3_mask = history_lengths > np.percentile(history_lengths, 67)
-            masks = [stage1_mask, stage2_mask, stage3_mask]
-            stage_names = ["Simple (short history)", "Medium (moderate history)", "Complex (long history)"]
-        else:
-            # Flexible number of stages
-            percentiles = np.linspace(0, 100, self.curriculum_stages + 1)
-            masks = []
-            stage_names = []
-            for i in range(self.curriculum_stages):
-                if i == 0:
-                    mask = history_lengths <= np.percentile(history_lengths, percentiles[i+1])
-                    stage_names.append(f"Stage {i+1} (≤{percentiles[i+1]:.0f}%ile)")
-                elif i == self.curriculum_stages - 1:
-                    mask = history_lengths > np.percentile(history_lengths, percentiles[i])
-                    stage_names.append(f"Stage {i+1} (>{percentiles[i]:.0f}%ile)")
-                else:
-                    mask = (history_lengths > np.percentile(history_lengths, percentiles[i])) & \
-                           (history_lengths <= np.percentile(history_lengths, percentiles[i+1]))
-                    stage_names.append(f"Stage {i+1} ({percentiles[i]:.0f}-{percentiles[i+1]:.0f}%ile)")
-                masks.append(mask)
-        # Create stage datasets
-        for i, (mask, name) in enumerate(zip(masks, stage_names)):
-            stage_features = {}
-            for key, values in features.items():
-                stage_features[key] = values[mask]
-            print(f"  Stage {i+1} ({name}): {np.sum(mask)} samples")
-            stages.append(stage_features)
-        return stages
-    def _create_tf_dataset(self, features: Dict[str, np.ndarray],
-                          batch_size: int = 256,
-                          shuffle: bool = True) -> tf.data.Dataset:
-        """Create TensorFlow dataset from features."""
-        dataset = tf.data.Dataset.from_tensor_slices(features)
-        if shuffle:
-            dataset = dataset.shuffle(buffer_size=10000)
-        dataset = dataset.batch(batch_size, drop_remainder=False)
-        dataset = dataset.prefetch(tf.data.AUTOTUNE)
-        return dataset
-    def train_with_curriculum(self,
-                            training_features: Dict[str, np.ndarray],
-                            validation_features: Dict[str, np.ndarray],
-                            epochs_per_stage: int = 10,
-                            batch_size: int = 256) -> Dict:
-        """Train model using curriculum learning."""
-        print(f"🎓 CURRICULUM LEARNING TRAINING")
-        print(f"Stages: {self.curriculum_stages} | Epochs per stage: {epochs_per_stage}")
-        print("="*70)
-        # Create curriculum stages
-        print("\n📚 Creating curriculum stages...")
-        training_stages = self._create_curriculum_stages(training_features)
-        # Training history
-        history = {
-            'stage_losses': [],
-            'stage_val_losses': [],
-            'stage_times': [],
-            'total_loss': [],
-            'rating_loss': [],
-            'retrieval_loss': [],
-            'contrastive_loss': [],
-            'val_total_loss': [],
-            'val_rating_loss': [],
-            'val_retrieval_loss': []
-        }
-        # Validation dataset (constant across stages)
-        val_dataset = self._create_tf_dataset(validation_features, batch_size, shuffle=False)
-        total_start_time = time.time()
-        # Train through curriculum stages
-        for stage_idx, stage_features in enumerate(training_stages):
-            stage_start_time = time.time()
-            print(f"\n🎯 STAGE {stage_idx + 1}/{self.curriculum_stages}")
-            print(f"Training samples: {len(stage_features['rating'])}")
-            # Create training dataset for this stage
-            train_dataset = self._create_tf_dataset(stage_features, batch_size, shuffle=True)
-            # Adaptive learning rate (decrease as stages progress)
-            stage_lr = self.learning_rate * (0.8 ** stage_idx)
-            self.model.optimizer.learning_rate.assign(stage_lr)
-            print(f"Learning rate: {stage_lr:.6f}")
-            # Train on this stage
-            stage_history = {'loss': [], 'val_loss': []}
-            for epoch in range(epochs_per_stage):
-                epoch_start = time.time()
-                # Training step
-                train_losses = []
-                for batch in train_dataset:
-                    with tf.GradientTape() as tape:
-                        loss_dict = self.model.compute_loss(batch, training=True)
-                        total_loss = loss_dict['total_loss']
-                    gradients = tape.gradient(total_loss, self.model.trainable_variables)
-                    self.model.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
-                    train_losses.append({k: v.numpy() for k, v in loss_dict.items()})
-                # Average training losses
-                avg_train_loss = {}
-                for key in train_losses[0].keys():
-                    avg_train_loss[key] = np.mean([loss[key] for loss in train_losses])
-                # Validation step
-                val_losses = []
-                for batch in val_dataset:
-                    loss_dict = self.model.compute_loss(batch, training=False)
-                    val_losses.append({k: v.numpy() for k, v in loss_dict.items()})
-                # Average validation losses
-                avg_val_loss = {}
-                for key in val_losses[0].keys():
-                    avg_val_loss[key] = np.mean([loss[key] for loss in val_losses])
-                # Record epoch results
-                stage_history['loss'].append(avg_train_loss['total_loss'])
-                stage_history['val_loss'].append(avg_val_loss['total_loss'])
-                # Add to overall history
-                for key in ['total_loss', 'rating_loss', 'retrieval_loss', 'contrastive_loss']:
-                    history[key].append(avg_train_loss[key])
-                    history[f'val_{key}'].append(avg_val_loss[key])
-                epoch_time = time.time() - epoch_start
-                print(f"  Epoch {epoch+1:2d}/{epochs_per_stage} | "
-                      f"Loss: {avg_train_loss['total_loss']:.4f} | "
-                      f"Val: {avg_val_loss['total_loss']:.4f} | "
-                      f"Time: {epoch_time:.1f}s")
-            stage_time = time.time() - stage_start_time
-            # Record stage results
-            history['stage_losses'].append(stage_history['loss'])
-            history['stage_val_losses'].append(stage_history['val_loss'])
-            history['stage_times'].append(stage_time)
-            print(f"✅ Stage {stage_idx + 1} completed in {stage_time:.1f}s")
-            # Save intermediate model after each stage
-            self.save_model(f"src/artifacts/", suffix=f"_stage_{stage_idx + 1}")
-        total_time = time.time() - total_start_time
-        print(f"\n🎓 CURRICULUM TRAINING COMPLETED!")
-        print(f"Total time: {total_time:.1f}s")
-        print(f"Average time per stage: {np.mean(history['stage_times']):.1f}s")
-        return history
-    def save_model(self, save_path: str = "src/artifacts/", suffix: str = ""):
-        """Save the trained model."""
-        os.makedirs(save_path, exist_ok=True)
-        # Save model weights
-        self.model.user_tower.save_weights(f"{save_path}/improved_user_tower_weights{suffix}")
-        self.model.item_tower.save_weights(f"{save_path}/improved_item_tower_weights{suffix}")
-        self.model.rating_model.save_weights(f"{save_path}/improved_rating_model_weights{suffix}")
-        # Save temperature parameter
-        temp_value = self.model.temperature_similarity.temperature.numpy()
-        with open(f"{save_path}/temperature_value{suffix}.txt", 'w') as f:
-            f.write(str(temp_value))
-        # Save configuration
-        config = {
-            'embedding_dim': self.embedding_dim,
-            'learning_rate': self.learning_rate,
-            'use_focal_loss': self.use_focal_loss,
-            'curriculum_stages': self.curriculum_stages
-        }
-        with open(f"{save_path}/curriculum_model_config{suffix}.txt", 'w') as f:
-            for key, value in config.items():
-                f.write(f"{key}: {value}\n")
-        if not suffix:
-            print(f"Model saved to {save_path}")
-def main():
-    """Main function for curriculum training."""
-    print("🚀 INITIALIZING CURRICULUM TRAINER")
-    # Initialize trainer
-    trainer = CurriculumTrainer(
-        embedding_dim=128,
-        learning_rate=0.001,
-        use_focal_loss=True,
-        curriculum_stages=3
-    )
-    # Load data processor
-    print("Loading data processor...")
-    trainer.load_data_processor()
-    # Create improved model
-    print("Creating improved two-tower model...")
-    trainer.create_model()
-    # Load training data
-    print("Loading training data...")
-    with open("src/artifacts/training_features.pkl", 'rb') as f:
-        training_features = pickle.load(f)
-    with open("src/artifacts/validation_features.pkl", 'rb') as f:
-        validation_features = pickle.load(f)
-    print(f"Training samples: {len(training_features['rating'])}")
-    print(f"Validation samples: {len(validation_features['rating'])}")
-    # Train with curriculum learning
-    start_time = time.time()
-    history = trainer.train_with_curriculum(
-        training_features=training_features,
-        validation_features=validation_features,
-        epochs_per_stage=15,
-        batch_size=512
-    )
-    total_time = time.time() - start_time
-    # Save final model and history
-    print("Saving final model...")
-    trainer.save_model()
-    with open("src/artifacts/curriculum_training_history.pkl", 'wb') as f:
-        pickle.dump(history, f)
-    print(f"\n✅ CURRICULUM TRAINING COMPLETED!")
-    print(f"Total training time: {total_time:.1f}s")
-    print(f"All improvements implemented successfully!")
-if __name__ == "__main__":
-    main()

src/training/fast_joint_training.py DELETED Viewed

@@ -1,268 +0,0 @@
-"""
-Fast joint training with key optimizations for CPU performance.
-"""
-import tensorflow as tf
-import numpy as np
-import pickle
-import os
-import time
-from typing import Dict
-from src.models.item_tower import ItemTower
-from src.models.user_tower import UserTower, TwoTowerModel
-from src.preprocessing.data_loader import DataProcessor
-class FastJointTrainer:
-    """Simplified fast joint training optimized for CPU."""
-    def __init__(self):
-        self.item_tower = None
-        self.user_tower = None
-        self.model = None
-        # Optimized hyperparameters for fast training
-        self.user_lr = 0.003
-        self.item_lr = 0.0003
-        self.batch_size = 2048  # Large batch for efficiency
-        self.epochs = 20  # Reduced epochs
-    def load_components(self):
-        """Load all required components."""
-        print("Loading components...")
-        # Load data processor
-        data_processor = DataProcessor()
-        data_processor.load_vocabularies("src/artifacts/vocabularies.pkl")
-        # Load item tower config
-        with open("src/artifacts/item_tower_config.txt", 'r') as f:
-            config = {}
-            for line in f:
-                key, value = line.strip().split(': ')
-                if key in ['embedding_dim', 'dropout_rate']:
-                    config[key] = float(value) if '.' in value else int(value)
-                elif key == 'hidden_dims':
-                    config[key] = eval(value)
-        # Build item tower
-        self.item_tower = ItemTower(
-            item_vocab_size=len(data_processor.item_vocab),
-            category_vocab_size=len(data_processor.category_vocab),
-            brand_vocab_size=len(data_processor.brand_vocab),
-            **config
-        )
-        # Initialize and load weights
-        dummy_input = {
-            'product_id': tf.constant([0]),
-            'category_id': tf.constant([0]),
-            'brand_id': tf.constant([0]),
-            'price': tf.constant([0.0])
-        }
-        _ = self.item_tower(dummy_input)
-        self.item_tower.load_weights("src/artifacts/item_tower_weights")
-        # Build user tower (simplified)
-        self.user_tower = UserTower(
-            max_history_length=50,
-            embedding_dim=128,  # Updated to 128D
-            hidden_dims=[64],  # Simplified architecture
-            dropout_rate=0.1
-        )
-        # Build complete model
-        self.model = TwoTowerModel(
-            item_tower=self.item_tower,
-            user_tower=self.user_tower,
-            rating_weight=1.0,
-            retrieval_weight=0.2  # Reduced for faster training
-        )
-        print("Components loaded successfully")
-    def create_fast_dataset(self, features: Dict, is_training: bool = True):
-        """Create optimized dataset pipeline."""
-        dataset = tf.data.Dataset.from_tensor_slices(features)
-        if is_training:
-            dataset = dataset.shuffle(buffer_size=5000)
-            dataset = dataset.repeat()
-        dataset = dataset.batch(self.batch_size, drop_remainder=True)
-        dataset = dataset.prefetch(2)  # Conservative prefetch for CPU
-        return dataset
-    def train_fast(self, training_features: Dict, validation_features: Dict):
-        """Fast training loop with key optimizations."""
-        print(f"Starting fast training: {self.epochs} epochs, batch size {self.batch_size}")
-        # Setup datasets
-        steps_per_epoch = len(training_features['rating']) // self.batch_size
-        val_steps = len(validation_features['rating']) // self.batch_size
-        train_ds = self.create_fast_dataset(training_features, is_training=True)
-        val_ds = self.create_fast_dataset(validation_features, is_training=False)
-        # Note: Age and income are now categorical - no normalization needed
-        # Setup optimizers
-        user_optimizer = tf.keras.optimizers.Adam(learning_rate=self.user_lr)
-        item_optimizer = tf.keras.optimizers.Adam(learning_rate=self.item_lr)
-        # Training loop
-        train_iter = iter(train_ds)
-        val_iter = iter(val_ds)
-        best_val_loss = float('inf')
-        for epoch in range(self.epochs):
-            epoch_start = time.time()
-            # Progressive unfreezing - simple strategy
-            train_item = epoch >= (self.epochs // 4)  # Unfreeze after 25%
-            print(f"Epoch {epoch+1}/{self.epochs} - Item training: {'ON' if train_item else 'OFF'}")
-            # Training
-            train_losses = []
-            for step in range(steps_per_epoch):
-                try:
-                    batch = next(train_iter)
-                except StopIteration:
-                    train_iter = iter(train_ds)
-                    batch = next(train_iter)
-                with tf.GradientTape() as tape:
-                    # Forward pass
-                    user_emb = self.user_tower(batch, training=True)
-                    item_emb = self.item_tower(batch, training=True)
-                    # Rating prediction
-                    concat_emb = tf.concat([user_emb, item_emb], axis=-1)
-                    rating_pred = self.model.rating_model(concat_emb, training=True)
-                    # Simple loss calculation
-                    rating_loss = tf.keras.losses.binary_crossentropy(
-                        batch["rating"], tf.squeeze(rating_pred)
-                    )
-                    rating_loss = tf.reduce_mean(rating_loss)
-                    # Simplified retrieval loss
-                    similarity = tf.reduce_sum(user_emb * item_emb, axis=1)
-                    retrieval_loss = tf.keras.losses.binary_crossentropy(
-                        batch["rating"], tf.nn.sigmoid(similarity)
-                    )
-                    retrieval_loss = tf.reduce_mean(retrieval_loss)
-                    total_loss = rating_loss + 0.2 * retrieval_loss
-                # Gradient computation and application
-                if train_item:
-                    # Train both towers
-                    user_vars = self.user_tower.trainable_variables + self.model.rating_model.trainable_variables
-                    item_vars = self.item_tower.trainable_variables
-                    all_vars = user_vars + item_vars
-                    grads = tape.gradient(total_loss, all_vars)
-                    user_grads = grads[:len(user_vars)]
-                    item_grads = grads[len(user_vars):]
-                    user_optimizer.apply_gradients(zip(user_grads, user_vars))
-                    item_optimizer.apply_gradients(zip(item_grads, item_vars))
-                else:
-                    # Train only user tower
-                    user_vars = self.user_tower.trainable_variables + self.model.rating_model.trainable_variables
-                    grads = tape.gradient(total_loss, user_vars)
-                    user_optimizer.apply_gradients(zip(grads, user_vars))
-                train_losses.append(total_loss.numpy())
-            # Validation
-            val_losses = []
-            for step in range(val_steps):
-                try:
-                    batch = next(val_iter)
-                except StopIteration:
-                    val_iter = iter(val_ds)
-                    batch = next(val_iter)
-                user_emb = self.user_tower(batch, training=False)
-                item_emb = self.item_tower(batch, training=False)
-                concat_emb = tf.concat([user_emb, item_emb], axis=-1)
-                rating_pred = self.model.rating_model(concat_emb, training=False)
-                rating_loss = tf.reduce_mean(
-                    tf.keras.losses.binary_crossentropy(batch["rating"], tf.squeeze(rating_pred))
-                )
-                similarity = tf.reduce_sum(user_emb * item_emb, axis=1)
-                retrieval_loss = tf.reduce_mean(
-                    tf.keras.losses.binary_crossentropy(batch["rating"], tf.nn.sigmoid(similarity))
-                )
-                total_loss = rating_loss + 0.2 * retrieval_loss
-                val_losses.append(total_loss.numpy())
-            # Calculate averages
-            avg_train_loss = np.mean(train_losses)
-            avg_val_loss = np.mean(val_losses)
-            epoch_time = time.time() - epoch_start
-            print(f"Time: {epoch_time:.1f}s | Train: {avg_train_loss:.4f} | Val: {avg_val_loss:.4f}")
-            # Save best model
-            if avg_val_loss < best_val_loss:
-                best_val_loss = avg_val_loss
-                self.save_model("_best")
-        print("Fast training completed!")
-    def save_model(self, suffix=""):
-        """Save trained model."""
-        save_path = "src/artifacts/"
-        self.user_tower.save_weights(f"{save_path}/user_tower_weights{suffix}")
-        self.item_tower.save_weights(f"{save_path}/item_tower_weights_finetuned{suffix}")
-        self.model.rating_model.save_weights(f"{save_path}/rating_model_weights{suffix}")
-        if not suffix:
-            print("Model saved successfully")
-def main():
-    """Main function for fast joint training."""
-    print("=== Fast Joint Training ===")
-    # Initialize trainer
-    trainer = FastJointTrainer()
-    trainer.load_components()
-    # Load training data
-    print("Loading training data...")
-    with open("src/artifacts/training_features.pkl", 'rb') as f:
-        training_features = pickle.load(f)
-    with open("src/artifacts/validation_features.pkl", 'rb') as f:
-        validation_features = pickle.load(f)
-    print(f"Training samples: {len(training_features['rating']):,}")
-    print(f"Validation samples: {len(validation_features['rating']):,}")
-    # Start training
-    start_time = time.time()
-    trainer.train_fast(training_features, validation_features)
-    total_time = time.time() - start_time
-    trainer.save_model()
-    print(f"\\nTraining completed in {total_time:.1f} seconds!")
-    print(f"Average time per epoch: {total_time/trainer.epochs:.1f}s")
-if __name__ == "__main__":
-    main()

src/training/improved_joint_training.py DELETED Viewed

@@ -1,462 +0,0 @@
-#!/usr/bin/env python3
-"""
-Improved joint training with hard negative mining, curriculum learning, and better optimization.
-"""
-import tensorflow as tf
-import numpy as np
-import pickle
-import os
-from typing import Dict, List, Tuple, Optional
-import time
-from collections import defaultdict
-from src.models.improved_two_tower import create_improved_model
-from src.preprocessing.data_loader import DataProcessor, create_tf_dataset
-class HardNegativeSampler:
-    """Hard negative sampling strategy for better training."""
-    def __init__(self, model, item_embeddings, sampling_strategy='mixed'):
-        self.model = model
-        self.item_embeddings = item_embeddings  # Pre-computed item embeddings
-        self.sampling_strategy = sampling_strategy
-    def sample_hard_negatives(self, user_embeddings, positive_items, k_hard=2, k_random=2):
-        """Sample hard negatives based on user-item similarity."""
-        batch_size = tf.shape(user_embeddings)[0]
-        # Compute similarities between users and all items
-        similarities = tf.linalg.matmul(user_embeddings, self.item_embeddings, transpose_b=True)
-        # Mask out positive items
-        positive_mask = tf.one_hot(positive_items, depth=tf.shape(self.item_embeddings)[0])
-        similarities = similarities - positive_mask * 1e9  # Large negative value
-        # Get top-k similar items (hard negatives)
-        _, hard_negative_indices = tf.nn.top_k(similarities, k=k_hard)
-        # Sample random negatives
-        total_items = tf.shape(self.item_embeddings)[0]
-        random_negatives = tf.random.uniform(
-            shape=[batch_size, k_random],
-            minval=0,
-            maxval=total_items,
-            dtype=tf.int32
-        )
-        # Combine hard and random negatives
-        if self.sampling_strategy == 'hard':
-            return hard_negative_indices
-        elif self.sampling_strategy == 'random':
-            return random_negatives
-        else:  # mixed
-            return tf.concat([hard_negative_indices, random_negatives], axis=1)
-class CurriculumLearningScheduler:
-    """Curriculum learning scheduler for progressive difficulty."""
-    def __init__(self, total_epochs, warmup_epochs=10):
-        self.total_epochs = total_epochs
-        self.warmup_epochs = warmup_epochs
-    def get_difficulty_schedule(self, epoch):
-        """Get curriculum parameters for current epoch."""
-        if epoch < self.warmup_epochs:
-            # Easy phase: more random negatives, lower temperature
-            hard_negative_ratio = 0.2
-            temperature = 2.0
-            negative_samples = 2
-        elif epoch < self.total_epochs * 0.6:
-            # Medium phase: balanced negatives
-            hard_negative_ratio = 0.5
-            temperature = 1.0
-            negative_samples = 4
-        else:
-            # Hard phase: more hard negatives, higher temperature
-            hard_negative_ratio = 0.8
-            temperature = 0.5
-            negative_samples = 6
-        return {
-            'hard_negative_ratio': hard_negative_ratio,
-            'temperature': temperature,
-            'negative_samples': negative_samples
-        }
-class ImprovedJointTrainer:
-    """Enhanced joint trainer with advanced techniques."""
-    def __init__(self,
-                 embedding_dim: int = 128,
-                 learning_rate: float = 0.001,
-                 use_mixed_precision: bool = True,
-                 use_curriculum_learning: bool = True,
-                 use_hard_negatives: bool = True):
-        self.embedding_dim = embedding_dim
-        self.learning_rate = learning_rate
-        self.use_mixed_precision = use_mixed_precision
-        self.use_curriculum_learning = use_curriculum_learning
-        self.use_hard_negatives = use_hard_negatives
-        # Enable mixed precision if requested
-        if use_mixed_precision:
-            policy = tf.keras.mixed_precision.Policy('mixed_float16')
-            tf.keras.mixed_precision.set_global_policy(policy)
-        self.model = None
-        self.data_processor = None
-        self.curriculum_scheduler = None
-        self.hard_negative_sampler = None
-    def setup_model(self, data_processor: DataProcessor):
-        """Setup the improved model."""
-        self.data_processor = data_processor
-        # Create improved model
-        self.model = create_improved_model(
-            data_processor=data_processor,
-            embedding_dim=self.embedding_dim,
-            use_bias=True,
-            use_focal_loss=True
-        )
-        print(f"Created improved two-tower model with {self.embedding_dim}D embeddings")
-    def setup_curriculum_learning(self, total_epochs: int):
-        """Setup curriculum learning scheduler."""
-        if self.use_curriculum_learning:
-            self.curriculum_scheduler = CurriculumLearningScheduler(
-                total_epochs=total_epochs,
-                warmup_epochs=max(5, total_epochs // 10)
-            )
-            print("Curriculum learning enabled")
-    def setup_hard_negative_sampling(self, item_features: Dict[str, np.ndarray]):
-        """Setup hard negative sampling."""
-        if self.use_hard_negatives:
-            # Pre-compute item embeddings for efficient hard negative sampling
-            print("Pre-computing item embeddings for hard negative sampling...")
-            # Create a dummy batch to get item embeddings
-            batch_size = 1000
-            total_items = len(item_features['product_id'])
-            item_embeddings_list = []
-            for i in range(0, total_items, batch_size):
-                end_idx = min(i + batch_size, total_items)
-                batch_features = {
-                    key: tf.constant(value[i:end_idx])
-                    for key, value in item_features.items()
-                }
-                item_emb_output = self.model.item_tower(batch_features, training=False)
-                if isinstance(item_emb_output, tuple):
-                    item_emb = item_emb_output[0]  # Get embeddings, ignore bias
-                else:
-                    item_emb = item_emb_output
-                item_embeddings_list.append(item_emb.numpy())
-            item_embeddings = np.vstack(item_embeddings_list)
-            self.hard_negative_sampler = HardNegativeSampler(
-                model=self.model,
-                item_embeddings=tf.constant(item_embeddings, dtype=tf.float32),
-                sampling_strategy='mixed'
-            )
-            print(f"Hard negative sampling enabled with {len(item_embeddings)} items")
-    def create_advanced_training_dataset(self,
-                                       features: Dict[str, np.ndarray],
-                                       batch_size: int = 256,
-                                       epoch: int = 0) -> tf.data.Dataset:
-        """Create training dataset with curriculum learning and hard negatives."""
-        # Get curriculum parameters
-        if self.curriculum_scheduler:
-            curriculum_params = self.curriculum_scheduler.get_difficulty_schedule(epoch)
-            print(f"Epoch {epoch}: {curriculum_params}")
-        else:
-            curriculum_params = {
-                'hard_negative_ratio': 0.5,
-                'temperature': 1.0,
-                'negative_samples': 4
-            }
-        # Filter data based on curriculum (start with easier examples)
-        if epoch < 5:  # Warmup epochs - use only high-confidence positive examples
-            positive_mask = features['rating'] == 1.0
-            if np.sum(positive_mask) > 0:
-                # Sample subset of positives and all negatives
-                positive_indices = np.where(positive_mask)[0]
-                negative_indices = np.where(features['rating'] == 0.0)[0]
-                # Sample subset for easier learning
-                n_positive_samples = min(len(positive_indices), len(negative_indices))
-                selected_positive = np.random.choice(
-                    positive_indices, size=n_positive_samples, replace=False
-                )
-                selected_negative = np.random.choice(
-                    negative_indices, size=n_positive_samples, replace=False
-                )
-                selected_indices = np.concatenate([selected_positive, selected_negative])
-                np.random.shuffle(selected_indices)
-                # Filter features
-                filtered_features = {
-                    key: value[selected_indices] for key, value in features.items()
-                }
-            else:
-                filtered_features = features
-        else:
-            filtered_features = features
-        # Create dataset
-        dataset = create_tf_dataset(filtered_features, batch_size, shuffle=True)
-        return dataset
-    def compile_model(self):
-        """Compile model with advanced optimizer."""
-        # Use AdamW with learning rate scheduling
-        initial_learning_rate = self.learning_rate
-        lr_schedule = tf.keras.optimizers.schedules.CosineDecayRestarts(
-            initial_learning_rate=initial_learning_rate,
-            first_decay_steps=1000,
-            t_mul=2.0,
-            m_mul=0.9,
-            alpha=0.01
-        )
-        optimizer = tf.keras.optimizers.AdamW(
-            learning_rate=lr_schedule,
-            weight_decay=1e-5,
-            beta_1=0.9,
-            beta_2=0.999,
-            epsilon=1e-7
-        )
-        # Enable mixed precision optimizer if needed
-        if self.use_mixed_precision:
-            optimizer = tf.keras.mixed_precision.LossScaleOptimizer(optimizer)
-        self.optimizer = optimizer
-        print(f"Model compiled with AdamW optimizer (lr={self.learning_rate})")
-    @tf.function
-    def train_step(self, features):
-        """Optimized training step with gradient scaling."""
-        with tf.GradientTape() as tape:
-            # Forward pass
-            loss_dict = self.model.compute_loss(features, training=True)
-            total_loss = loss_dict['total_loss']
-            # Scale loss for mixed precision
-            if self.use_mixed_precision:
-                scaled_loss = self.optimizer.get_scaled_loss(total_loss)
-            else:
-                scaled_loss = total_loss
-        # Compute gradients
-        if self.use_mixed_precision:
-            scaled_gradients = tape.gradient(scaled_loss, self.model.trainable_variables)
-            gradients = self.optimizer.get_unscaled_gradients(scaled_gradients)
-        else:
-            gradients = tape.gradient(scaled_loss, self.model.trainable_variables)
-        # Clip gradients to prevent exploding gradients
-        gradients, _ = tf.clip_by_global_norm(gradients, 1.0)
-        # Apply gradients
-        self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
-        return loss_dict
-    def evaluate_model(self, validation_dataset):
-        """Evaluate model on validation set."""
-        total_losses = defaultdict(list)
-        for batch in validation_dataset:
-            loss_dict = self.model.compute_loss(batch, training=False)
-            for key, value in loss_dict.items():
-                total_losses[key].append(float(value))
-        # Average losses
-        avg_losses = {key: np.mean(values) for key, values in total_losses.items()}
-        return avg_losses
-    def train(self,
-              training_features: Dict[str, np.ndarray],
-              validation_features: Dict[str, np.ndarray],
-              epochs: int = 50,
-              batch_size: int = 256,
-              save_path: str = "src/artifacts/") -> Dict:
-        """Enhanced training loop with all improvements."""
-        print(f"Starting improved training for {epochs} epochs...")
-        # Setup components
-        self.setup_curriculum_learning(epochs)
-        self.compile_model()
-        # Create validation dataset
-        validation_dataset = create_tf_dataset(validation_features, batch_size, shuffle=False)
-        # Training history
-        history = defaultdict(list)
-        best_val_loss = float('inf')
-        patience_counter = 0
-        early_stopping_patience = 10
-        # Training loop
-        for epoch in range(epochs):
-            epoch_start_time = time.time()
-            # Create training dataset for this epoch (curriculum learning)
-            training_dataset = self.create_advanced_training_dataset(
-                training_features, batch_size, epoch
-            )
-            # Training
-            epoch_losses = defaultdict(list)
-            num_batches = 0
-            for batch in training_dataset:
-                loss_dict = self.train_step(batch)
-                for key, value in loss_dict.items():
-                    epoch_losses[key].append(float(value))
-                num_batches += 1
-            # Average training losses
-            avg_train_losses = {
-                key: np.mean(values) for key, values in epoch_losses.items()
-            }
-            # Validation
-            avg_val_losses = self.evaluate_model(validation_dataset)
-            # Log progress
-            epoch_time = time.time() - epoch_start_time
-            print(f"Epoch {epoch+1}/{epochs} ({epoch_time:.1f}s):")
-            print(f"  Train Loss: {avg_train_losses['total_loss']:.4f}")
-            print(f"  Val Loss: {avg_val_losses['total_loss']:.4f}")
-            print(f"  Val Rating Loss: {avg_val_losses['rating_loss']:.4f}")
-            print(f"  Val Retrieval Loss: {avg_val_losses['retrieval_loss']:.4f}")
-            # Save history
-            for key, value in avg_train_losses.items():
-                history[f'train_{key}'].append(value)
-            for key, value in avg_val_losses.items():
-                history[f'val_{key}'].append(value)
-            # Early stopping and model saving
-            current_val_loss = avg_val_losses['total_loss']
-            if current_val_loss < best_val_loss:
-                best_val_loss = current_val_loss
-                patience_counter = 0
-                # Save best model
-                self.save_model(save_path, suffix='_improved_best')
-                print(f"  💾 Saved best model (val_loss: {best_val_loss:.4f})")
-            else:
-                patience_counter += 1
-            if patience_counter >= early_stopping_patience:
-                print(f"Early stopping at epoch {epoch+1}")
-                break
-        # Save final model and history
-        self.save_model(save_path, suffix='_improved_final')
-        self.save_training_history(dict(history), save_path)
-        print("✅ Improved training completed!")
-        return dict(history)
-    def save_model(self, save_path: str, suffix: str = ''):
-        """Save the trained model components."""
-        os.makedirs(save_path, exist_ok=True)
-        # Save model weights
-        self.model.item_tower.save_weights(f"{save_path}/improved_item_tower_weights{suffix}")
-        self.model.user_tower.save_weights(f"{save_path}/improved_user_tower_weights{suffix}")
-        if hasattr(self.model, 'rating_model'):
-            self.model.rating_model.save_weights(f"{save_path}/improved_rating_model_weights{suffix}")
-        # Save configuration
-        config = {
-            'embedding_dim': self.embedding_dim,
-            'learning_rate': self.learning_rate,
-            'use_mixed_precision': self.use_mixed_precision,
-            'use_curriculum_learning': self.use_curriculum_learning,
-            'use_hard_negatives': self.use_hard_negatives
-        }
-        with open(f"{save_path}/improved_model_config{suffix}.txt", 'w') as f:
-            for key, value in config.items():
-                f.write(f"{key}: {value}\n")
-        print(f"Model saved to {save_path} with suffix '{suffix}'")
-    def save_training_history(self, history: Dict, save_path: str):
-        """Save training history."""
-        with open(f"{save_path}/improved_training_history.pkl", 'wb') as f:
-            pickle.dump(history, f)
-        print(f"Training history saved to {save_path}")
-def main():
-    """Demo of improved training."""
-    print("🚀 IMPROVED TWO-TOWER TRAINING DEMO")
-    print("="*60)
-    # Load data
-    print("Loading training data...")
-    try:
-        with open("src/artifacts/training_features.pkl", 'rb') as f:
-            training_features = pickle.load(f)
-        with open("src/artifacts/validation_features.pkl", 'rb') as f:
-            validation_features = pickle.load(f)
-        print(f"Loaded {len(training_features['rating'])} training samples")
-        print(f"Loaded {len(validation_features['rating'])} validation samples")
-    except FileNotFoundError:
-        print("❌ Training data not found. Please run data preparation first.")
-        return
-    # Load data processor
-    data_processor = DataProcessor()
-    data_processor.load_vocabularies("src/artifacts/vocabularies.pkl")
-    # Create trainer
-    trainer = ImprovedJointTrainer(
-        embedding_dim=128,
-        learning_rate=0.001,
-        use_mixed_precision=True,
-        use_curriculum_learning=True,
-        use_hard_negatives=True
-    )
-    # Setup and train
-    trainer.setup_model(data_processor)
-    # Train model
-    history = trainer.train(
-        training_features=training_features,
-        validation_features=validation_features,
-        epochs=30,
-        batch_size=256
-    )
-    print("✅ Improved training completed successfully!")
-if __name__ == "__main__":
-    main()

src/training/optimized_joint_training.py DELETED Viewed

@@ -1,439 +0,0 @@
-import tensorflow as tf
-import numpy as np
-import pickle
-import os
-import time
-from typing import Dict, List, Tuple
-from src.models.item_tower import ItemTower
-from src.models.user_tower import UserTower, TwoTowerModel
-from src.preprocessing.data_loader import DataProcessor, create_tf_dataset
-class OptimizedJointTrainer:
-    """Optimized joint training with performance enhancements."""
-    def __init__(self,
-                 embedding_dim: int = 128,  # Updated to 128D output
-                 user_learning_rate: float = 0.001,
-                 item_learning_rate: float = 0.0001,
-                 rating_weight: float = 1.0,
-                 retrieval_weight: float = 1.0,
-                 gradient_accumulation_steps: int = 1,
-                 use_mixed_precision: bool = False):  # Disabled for CPU training
-        self.embedding_dim = embedding_dim
-        self.user_learning_rate = user_learning_rate
-        self.item_learning_rate = item_learning_rate
-        self.rating_weight = rating_weight
-        self.retrieval_weight = retrieval_weight
-        self.gradient_accumulation_steps = gradient_accumulation_steps
-        self.use_mixed_precision = use_mixed_precision
-        # Enable mixed precision for faster training
-        if self.use_mixed_precision:
-            tf.keras.mixed_precision.set_global_policy('mixed_float16')
-            print("Mixed precision training enabled")
-        self.item_tower = None
-        self.user_tower = None
-        self.model = None
-        # Precompile TensorFlow functions for speed
-        self._compiled_train_step = None
-        self._compiled_val_step = None
-    def load_pre_trained_item_tower(self, artifacts_path: str = "src/artifacts/") -> ItemTower:
-        """Load pre-trained item tower with optimizations."""
-        data_processor = DataProcessor()
-        data_processor.load_vocabularies(f"{artifacts_path}/vocabularies.pkl")
-        with open(f"{artifacts_path}/item_tower_config.txt", 'r') as f:
-            config = {}
-            for line in f:
-                key, value = line.strip().split(': ')
-                if key in ['embedding_dim', 'dropout_rate']:
-                    config[key] = float(value) if '.' in value else int(value)
-                elif key == 'hidden_dims':
-                    config[key] = eval(value)
-        self.item_tower = ItemTower(
-            item_vocab_size=len(data_processor.item_vocab),
-            category_vocab_size=len(data_processor.category_vocab),
-            brand_vocab_size=len(data_processor.brand_vocab),
-            **config
-        )
-        dummy_input = {
-            'product_id': tf.constant([0]),
-            'category_id': tf.constant([0]),
-            'brand_id': tf.constant([0]),
-            'price': tf.constant([0.0])
-        }
-        _ = self.item_tower(dummy_input)
-        self.item_tower.load_weights(f"{artifacts_path}/item_tower_weights")
-        print("Pre-trained item tower loaded successfully")
-        return self.item_tower
-    def build_user_tower(self, max_history_length: int = 50) -> UserTower:
-        """Build user tower with optimizations."""
-        self.user_tower = UserTower(
-            max_history_length=max_history_length,
-            embedding_dim=self.embedding_dim,
-            hidden_dims=[128, 64],
-            dropout_rate=0.1  # Reduced dropout for faster training
-        )
-        print("User tower initialized")
-        return self.user_tower
-    def build_two_tower_model(self) -> TwoTowerModel:
-        """Build complete two-tower model."""
-        if self.item_tower is None or self.user_tower is None:
-            raise ValueError("Both towers must be initialized first")
-        self.model = TwoTowerModel(
-            item_tower=self.item_tower,
-            user_tower=self.user_tower,
-            rating_weight=self.rating_weight,
-            retrieval_weight=self.retrieval_weight
-        )
-        print("Two-tower model built successfully")
-        return self.model
-    def create_optimized_dataset(self, features: Dict[str, np.ndarray],
-                               batch_size: int,
-                               is_training: bool = True) -> tf.data.Dataset:
-        """Create optimized dataset pipeline for faster training."""
-        dataset = tf.data.Dataset.from_tensor_slices(features)
-        if is_training:
-            # Optimized shuffling and prefetching
-            dataset = dataset.shuffle(buffer_size=min(10000, len(features['rating'])))
-            dataset = dataset.repeat()  # Repeat for multiple epochs
-        dataset = dataset.batch(batch_size, drop_remainder=True)
-        # Optimize for CPU training
-        dataset = dataset.prefetch(tf.data.AUTOTUNE)
-        return dataset
-    @tf.function(experimental_relax_shapes=True)
-    def optimized_train_step(self, batch: Dict[str, tf.Tensor],
-                           user_optimizer: tf.keras.optimizers.Optimizer,
-                           item_optimizer: tf.keras.optimizers.Optimizer,
-                           train_item: bool) -> Dict[str, tf.Tensor]:
-        """Optimized training step with tf.function compilation."""
-        with tf.GradientTape() as tape:
-            # Forward pass
-            user_embeddings = self.user_tower(batch, training=True)
-            item_embeddings = self.item_tower(batch, training=True)
-            # Concatenate and predict rating
-            concatenated = tf.concat([user_embeddings, item_embeddings], axis=-1)
-            rating_predictions = self.model.rating_model(concatenated, training=True)
-            # Compute losses - fix shape mismatch
-            rating_loss = tf.keras.losses.binary_crossentropy(
-                tf.expand_dims(batch["rating"], -1), rating_predictions
-            )
-            rating_loss = tf.reduce_mean(rating_loss)
-            # Retrieval loss - cosine similarity
-            user_norm = tf.nn.l2_normalize(user_embeddings, axis=1)
-            item_norm = tf.nn.l2_normalize(item_embeddings, axis=1)
-            similarities = tf.reduce_sum(user_norm * item_norm, axis=1)
-            retrieval_loss = tf.keras.losses.binary_crossentropy(
-                batch["rating"], tf.nn.sigmoid(similarities)
-            )
-            retrieval_loss = tf.reduce_mean(retrieval_loss)
-            total_loss = (
-                self.rating_weight * rating_loss +
-                self.retrieval_weight * retrieval_loss
-            )
-            # Handle mixed precision
-            if self.use_mixed_precision:
-                total_loss = user_optimizer.get_scaled_loss(total_loss)
-        # Compute gradients
-        user_vars = self.user_tower.trainable_variables + self.model.rating_model.trainable_variables
-        if train_item:
-            all_vars = user_vars + self.item_tower.trainable_variables
-            gradients = tape.gradient(total_loss, all_vars)
-            if self.use_mixed_precision:
-                gradients = user_optimizer.get_unscaled_gradients(gradients)
-            # Split gradients
-            user_grads = gradients[:len(user_vars)]
-            item_grads = gradients[len(user_vars):]
-            # Apply gradients
-            user_optimizer.apply_gradients(zip(user_grads, user_vars))
-            item_optimizer.apply_gradients(zip(item_grads, self.item_tower.trainable_variables))
-        else:
-            gradients = tape.gradient(total_loss, user_vars)
-            if self.use_mixed_precision:
-                gradients = user_optimizer.get_unscaled_gradients(gradients)
-            user_optimizer.apply_gradients(zip(gradients, user_vars))
-        # Convert back from scaled loss for logging
-        if self.use_mixed_precision:
-            total_loss = total_loss / user_optimizer.loss_scale
-            rating_loss = rating_loss / user_optimizer.loss_scale
-            retrieval_loss = retrieval_loss / user_optimizer.loss_scale
-        return {
-            'total_loss': total_loss,
-            'rating_loss': rating_loss,
-            'retrieval_loss': retrieval_loss
-        }
-    @tf.function(experimental_relax_shapes=True)
-    def optimized_val_step(self, batch: Dict[str, tf.Tensor]) -> Dict[str, tf.Tensor]:
-        """Optimized validation step."""
-        user_embeddings = self.user_tower(batch, training=False)
-        item_embeddings = self.item_tower(batch, training=False)
-        concatenated = tf.concat([user_embeddings, item_embeddings], axis=-1)
-        rating_predictions = self.model.rating_model(concatenated, training=False)
-        rating_loss = tf.reduce_mean(
-            tf.keras.losses.binary_crossentropy(tf.expand_dims(batch["rating"], -1), rating_predictions)
-        )
-        # Retrieval loss
-        user_norm = tf.nn.l2_normalize(user_embeddings, axis=1)
-        item_norm = tf.nn.l2_normalize(item_embeddings, axis=1)
-        similarities = tf.reduce_sum(user_norm * item_norm, axis=1)
-        retrieval_loss = tf.reduce_mean(
-            tf.keras.losses.binary_crossentropy(batch["rating"], tf.nn.sigmoid(similarities))
-        )
-        total_loss = self.rating_weight * rating_loss + self.retrieval_weight * retrieval_loss
-        return {
-            'total_loss': total_loss,
-            'rating_loss': rating_loss,
-            'retrieval_loss': retrieval_loss
-        }
-    def train(self,
-              training_features: Dict[str, np.ndarray],
-              validation_features: Dict[str, np.ndarray],
-              epochs: int = 50,  # Reduced default epochs
-              batch_size: int = 512) -> Dict:  # Larger batch size for efficiency
-        """Optimized training loop."""
-        print(f"Starting optimized joint training for {epochs} epochs...")
-        print(f"Batch size: {batch_size}")
-        print(f"Mixed precision: {self.use_mixed_precision}")
-        # Create optimized datasets
-        steps_per_epoch = len(training_features['rating']) // batch_size
-        val_steps = len(validation_features['rating']) // batch_size
-        train_dataset = self.create_optimized_dataset(training_features, batch_size, is_training=True)
-        val_dataset = self.create_optimized_dataset(validation_features, batch_size, is_training=False)
-        # Note: Age and income are now categorical - no normalization needed
-        # Setup optimizers with mixed precision
-        if self.use_mixed_precision:
-            user_optimizer = tf.keras.optimizers.Adam(learning_rate=self.user_learning_rate)
-            user_optimizer = tf.keras.mixed_precision.LossScaleOptimizer(user_optimizer)
-            item_optimizer = tf.keras.optimizers.Adam(learning_rate=self.item_learning_rate)
-            item_optimizer = tf.keras.mixed_precision.LossScaleOptimizer(item_optimizer)
-        else:
-            user_optimizer = tf.keras.optimizers.Adam(learning_rate=self.user_learning_rate)
-            item_optimizer = tf.keras.optimizers.Adam(learning_rate=self.item_learning_rate)
-        # Training history
-        history = {
-            'total_loss': [], 'rating_loss': [], 'retrieval_loss': [],
-            'val_total_loss': [], 'val_rating_loss': [], 'val_retrieval_loss': [],
-            'epoch_times': []
-        }
-        best_val_loss = float('inf')
-        patience_counter = 0
-        patience = 10  # Reduced patience for faster training
-        train_iter = iter(train_dataset)
-        val_iter = iter(val_dataset)
-        for epoch in range(epochs):
-            epoch_start_time = time.time()
-            print(f"\\nEpoch {epoch + 1}/{epochs}")
-            # Determine training strategy
-            freeze_threshold = int(0.2 * epochs)  # Reduced freeze period
-            train_item = epoch >= freeze_threshold
-            print(f"Training: User=✓, Item={'✓' if train_item else '✗'}")
-            # Training loop
-            epoch_losses = {'total_loss': [], 'rating_loss': [], 'retrieval_loss': []}
-            for step in range(steps_per_epoch):
-                try:
-                    batch = next(train_iter)
-                except StopIteration:
-                    train_iter = iter(train_dataset)
-                    batch = next(train_iter)
-                losses = self.optimized_train_step(batch, user_optimizer, item_optimizer, train_item)
-                for key in epoch_losses:
-                    epoch_losses[key].append(losses[key])
-            # Calculate training averages
-            avg_train_losses = {k: tf.reduce_mean(v).numpy() for k, v in epoch_losses.items()}
-            # Validation loop
-            val_losses = {'total_loss': [], 'rating_loss': [], 'retrieval_loss': []}
-            for step in range(val_steps):
-                try:
-                    batch = next(val_iter)
-                except StopIteration:
-                    val_iter = iter(val_dataset)
-                    batch = next(val_iter)
-                losses = self.optimized_val_step(batch)
-                for key in val_losses:
-                    val_losses[key].append(losses[key])
-            avg_val_losses = {k: tf.reduce_mean(v).numpy() for k, v in val_losses.items()}
-            # Record history
-            epoch_time = time.time() - epoch_start_time
-            history['epoch_times'].append(epoch_time)
-            for key in ['total_loss', 'rating_loss', 'retrieval_loss']:
-                history[key].append(avg_train_losses[key])
-                history[f'val_{key}'].append(avg_val_losses[key])
-            # Print progress
-            print(f"Time: {epoch_time:.1f}s | "
-                  f"Train Loss: {avg_train_losses['total_loss']:.4f} | "
-                  f"Val Loss: {avg_val_losses['total_loss']:.4f}")
-            # Early stopping with model saving
-            if avg_val_losses['total_loss'] < best_val_loss:
-                best_val_loss = avg_val_losses['total_loss']
-                patience_counter = 0
-                self.save_model("src/artifacts/", suffix="_best")
-            else:
-                patience_counter += 1
-                if patience_counter >= patience:
-                    print(f"Early stopping at epoch {epoch + 1}")
-                    break
-        avg_epoch_time = np.mean(history['epoch_times'])
-        print(f"\\nTraining completed!")
-        print(f"Average epoch time: {avg_epoch_time:.1f}s")
-        print(f"Total training time: {sum(history['epoch_times']):.1f}s")
-        return history
-    def save_model(self, save_path: str = "src/artifacts/", suffix: str = ""):
-        """Save the trained model."""
-        os.makedirs(save_path, exist_ok=True)
-        self.user_tower.save_weights(f"{save_path}/user_tower_weights{suffix}")
-        self.item_tower.save_weights(f"{save_path}/item_tower_weights_finetuned{suffix}")
-        self.model.rating_model.save_weights(f"{save_path}/rating_model_weights{suffix}")
-        config = {
-            'embedding_dim': self.embedding_dim,
-            'user_learning_rate': self.user_learning_rate,
-            'item_learning_rate': self.item_learning_rate,
-            'rating_weight': self.rating_weight,
-            'retrieval_weight': self.retrieval_weight,
-            'use_mixed_precision': self.use_mixed_precision
-        }
-        with open(f"{save_path}/optimized_joint_model_config{suffix}.txt", 'w') as f:
-            for key, value in config.items():
-                f.write(f"{key}: {value}\\n")
-        if not suffix:
-            print(f"Optimized model saved to {save_path}")
-def main():
-    """Main function for optimized joint training."""
-    print("Initializing optimized joint trainer...")
-    trainer = OptimizedJointTrainer(
-        embedding_dim=128,  # Updated to 128D
-        user_learning_rate=0.002,  # Slightly higher for faster convergence
-        item_learning_rate=0.0002,
-        rating_weight=1.0,
-        retrieval_weight=0.3,  # Reduced for faster training
-        use_mixed_precision=False  # Disabled for CPU
-    )
-    # Load components
-    print("Loading pre-trained item tower...")
-    trainer.load_pre_trained_item_tower()
-    print("Building user tower...")
-    trainer.build_user_tower(max_history_length=50)
-    print("Building two-tower model...")
-    trainer.build_two_tower_model()
-    # Load training data
-    print("Loading training data...")
-    with open("src/artifacts/training_features.pkl", 'rb') as f:
-        training_features = pickle.load(f)
-    with open("src/artifacts/validation_features.pkl", 'rb') as f:
-        validation_features = pickle.load(f)
-    print(f"Training samples: {len(training_features['rating'])}")
-    print(f"Validation samples: {len(validation_features['rating'])}")
-    # Train with optimizations
-    print("Starting optimized training...")
-    start_time = time.time()
-    history = trainer.train(
-        training_features=training_features,
-        validation_features=validation_features,
-        epochs=30,  # Reduced epochs for faster training
-        batch_size=1024  # Larger batch size for better GPU utilization
-    )
-    total_time = time.time() - start_time
-    # Save final model and history
-    print("Saving final model...")
-    trainer.save_model()
-    with open("src/artifacts/optimized_training_history.pkl", 'wb') as f:
-        pickle.dump(history, f)
-    print(f"\\nOptimized joint training completed!")
-    print(f"Total training time: {total_time:.1f}s")
-    print(f"Average time per epoch: {total_time/len(history['epoch_times']):.1f}s")
-if __name__ == "__main__":
-    main()

src/utils/real_user_selector.py CHANGED Viewed

@@ -93,22 +93,58 @@ class RealUserSelector:
         """
         print(f"Selecting {n} real users with at least {min_interactions} interactions...")
-        # Filter users with sufficient interactions
-        active_users = []
         for _, user in self.users_df.iterrows():
             user_id = user['user_id']
-            if (user_id in self.user_stats and
-                self.user_stats[user_id]['total_interactions'] >= min_interactions):
-                active_users.append(user)
-        print(f"Found {len(active_users)} active users with >={min_interactions} interactions")
-        # Randomly sample n users
-        if len(active_users) < n:
-            print(f"Warning: Only {len(active_users)} users available, returning all")
-            selected_users = active_users
         else:
-            selected_users = random.sample(active_users, n)
         # Build user profiles with real data
         real_user_profiles = []
@@ -124,6 +160,10 @@ class RealUserSelector:
                 'age': int(user['age']),
                 'gender': user['gender'],
                 'income': int(user['income']),
                 'interaction_history': stats['unique_items'][:50],  # Limit to 50 most recent
                 'interaction_stats': {
                     'total_interactions': stats['total_interactions'],

         """
         print(f"Selecting {n} real users with at least {min_interactions} interactions...")
+        # Filter users with sufficient interactions, separating by interaction count
+        high_interaction_users = []  # >14 interactions
+        low_interaction_users = []   # min_interactions to 14 interactions
         for _, user in self.users_df.iterrows():
             user_id = user['user_id']
+            if user_id in self.user_stats:
+                interaction_count = self.user_stats[user_id]['total_interactions']
+                if interaction_count >= min_interactions:
+                    if interaction_count > 14:
+                        high_interaction_users.append(user)
+                    else:
+                        low_interaction_users.append(user)
+        print(f"Found {len(high_interaction_users)} high-interaction users (>14) and {len(low_interaction_users)} low-interaction users ({min_interactions}-14)")
+        # Ensure more than half have >14 interactions
+        min_high_interaction = (n // 2) + 1  # More than 50%
+        selected_users = []
+        # First, select from high-interaction users (prioritize these)
+        if len(high_interaction_users) >= min_high_interaction:
+            selected_high = random.sample(high_interaction_users, min_high_interaction)
         else:
+            print(f"Warning: Only {len(high_interaction_users)} high-interaction users available, using all")
+            selected_high = high_interaction_users
+        selected_users.extend(selected_high)
+        remaining_slots = n - len(selected_high)
+        # Fill remaining slots with low-interaction users if available
+        if remaining_slots > 0 and len(low_interaction_users) > 0:
+            if len(low_interaction_users) >= remaining_slots:
+                selected_low = random.sample(low_interaction_users, remaining_slots)
+            else:
+                selected_low = low_interaction_users
+            selected_users.extend(selected_low)
+        # If we still need more users, add remaining high-interaction users
+        remaining_slots = n - len(selected_users)
+        if remaining_slots > 0:
+            remaining_high = [user for user in high_interaction_users if user not in selected_users]
+            if len(remaining_high) >= remaining_slots:
+                selected_users.extend(random.sample(remaining_high, remaining_slots))
+            else:
+                selected_users.extend(remaining_high)
+        print(f"Selected {len(selected_users)} total users: {len([u for u in selected_users if self.user_stats[u['user_id']]['total_interactions'] > 14])} high-interaction (>14), {len([u for u in selected_users if self.user_stats[u['user_id']]['total_interactions'] <= 14])} low-interaction (≤14)")
+        if len(selected_users) < n:
+            print(f"Warning: Only {len(selected_users)} users available, returning all")
         # Build user profiles with real data
         real_user_profiles = []
                 'age': int(user['age']),
                 'gender': user['gender'],
                 'income': int(user['income']),
+                'profession': user.get('profession', 'Other'),
+                'location': user.get('location', 'Urban'),
+                'education_level': user.get('education_level', 'High School'),
+                'marital_status': user.get('marital_status', 'Single'),
                 'interaction_history': stats['unique_items'][:50],  # Limit to 50 most recent
                 'interaction_stats': {
                     'total_interactions': stats['total_interactions'],

train_improved_model.py DELETED Viewed

@@ -1,111 +0,0 @@
-#!/usr/bin/env python3
-"""
-Train the improved two-tower model with all enhancements to address the identified issues.
-This script implements:
-✅ 128D embeddings (vs 64D) - Better representation capacity
-✅ Temperature scaling - Improved score discrimination
-✅ Category-aware boosting - Enhanced personalization
-✅ Contrastive loss - Prevents embedding collapse
-✅ Hard negative mining - Better training signal
-✅ User/item bias terms - Improved modeling capacity
-✅ Curriculum learning - Progressive training strategy
-"""
-import argparse
-import sys
-import os
-# Add project root to path
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
-from src.training.curriculum_trainer import CurriculumTrainer
-def main():
-    parser = argparse.ArgumentParser(description='Train improved two-tower model')
-    parser.add_argument('--embedding-dim', type=int, default=128,
-                       help='Embedding dimension (default: 128)')
-    parser.add_argument('--learning-rate', type=float, default=0.001,
-                       help='Learning rate (default: 0.001)')
-    parser.add_argument('--epochs-per-stage', type=int, default=15,
-                       help='Epochs per curriculum stage (default: 15)')
-    parser.add_argument('--batch-size', type=int, default=512,
-                       help='Batch size (default: 512)')
-    parser.add_argument('--curriculum-stages', type=int, default=3,
-                       help='Number of curriculum stages (default: 3)')
-    parser.add_argument('--use-focal-loss', action='store_true', default=True,
-                       help='Use focal loss for imbalanced data')
-    args = parser.parse_args()
-    print("🚀 TRAINING IMPROVED TWO-TOWER MODEL")
-    print("="*70)
-    print("IMPROVEMENTS IMPLEMENTED:")
-    print("✅ 128D embeddings (increased from 64D)")
-    print("✅ Temperature scaling for better score discrimination")
-    print("✅ Category-aware boosting for personalization")
-    print("✅ Contrastive loss to prevent embedding collapse")
-    print("✅ Hard negative mining for better training")
-    print("✅ User/item bias terms for improved modeling")
-    print("✅ Curriculum learning for progressive training")
-    print("="*70)
-    # Initialize trainer with improved settings
-    trainer = CurriculumTrainer(
-        embedding_dim=args.embedding_dim,
-        learning_rate=args.learning_rate,
-        use_focal_loss=args.use_focal_loss,
-        curriculum_stages=args.curriculum_stages
-    )
-    try:
-        # Load data and train
-        trainer.load_data_processor()
-        trainer.create_model()
-        # Load training data
-        import pickle
-        with open("src/artifacts/training_features.pkl", 'rb') as f:
-            training_features = pickle.load(f)
-        with open("src/artifacts/validation_features.pkl", 'rb') as f:
-            validation_features = pickle.load(f)
-        # Train with curriculum learning
-        history = trainer.train_with_curriculum(
-            training_features=training_features,
-            validation_features=validation_features,
-            epochs_per_stage=args.epochs_per_stage,
-            batch_size=args.batch_size
-        )
-        # Save results
-        trainer.save_model()
-        with open("src/artifacts/improved_training_history.pkl", 'wb') as f:
-            pickle.dump(history, f)
-        print("\n🎯 EXPECTED IMPROVEMENTS:")
-        print("• Score variance: 0.0007 → 0.01+ (15x better discrimination)")
-        print("• Category alignment: 12% → 60%+ (5x better personalization)")
-        print("• Reduced embedding collapse (more diverse user representations)")
-        print("• Better negative sampling and contrastive learning")
-        print("• Improved bias modeling for users and items")
-        print("\n✅ TRAINING COMPLETED SUCCESSFULLY!")
-        print("The improved model should address all critical issues identified in your analysis.")
-    except FileNotFoundError as e:
-        print(f"❌ ERROR: {e}")
-        print("Please ensure training data exists in src/artifacts/")
-        print("Run data preprocessing first if needed.")
-    except Exception as e:
-        print(f"❌ TRAINING ERROR: {e}")
-        import traceback
-        traceback.print_exc()
-if __name__ == "__main__":
-    main()