File size: 7,033 Bytes
26d29db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78b9457
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26d29db
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
language: en
license: mit
tags:
- recommendation
- ranking
- personalization
- xgboost
- xgbranker
- recipe
- cold-start
datasets:
- your-username/recipe-cleaned-dataset
model-index:
- name: Personalized Recipe Ranking Models
  results:
  - task:
      type: recommendation
      name: Personalized Recipe Ranking
    dataset:
      name: Food.com (Cleaned)
      type: your-username/recipe-cleaned-dataset
    metrics:
      - type: ndcg@5
        value: 0.44
      - type: ndcg@10
        value: 0.44
---

# Model Card: Personalized Recipe Ranking Models

## Overview

This project implements a personalized recipe recommendation system using two model categories:

1. **Scratch-trained baseline**: A simple rule-based + embedding matching ranker trained on a synthetic preference dataset (no user-specific rules).  
2. **Rule-enhanced cold-start models**: Five separate XGBRanker models trained with more complex rule-based preference signals and user-specific interaction patterns (user1–user5).

The goal is to evaluate how different user profiles affect ranking behavior and recommendation diversity, even when overall NDCG scores are lower than the baseline.

---

## Model Category 1: Scratch-trained Baseline

### Purpose
Provide a simple cold-start recommendation baseline that matches ingredients and ranks recipes without personalization. It uses parent–child ingredient overlap and a few numeric features (e.g., protein, cost, cooking time).

### Data Sources
- Cleaned Food.com dataset (~180k recipes)
- 10,000 synthetic preference samples generated via uniform random selection

### Training Details
- Model type: **XGBRanker** (`objective='rank:pairwise'`)  
- Features: ~1000 numeric ingredient-parent ratio features + basic nutrition/time features  
- Train/test split: 80/20 (by recipe ID)  
- Evaluation metric: NDCG@5, NDCG@10

### Evaluation
The baseline achieves **very high NDCG scores (95%+)**, because training and evaluation rely on synthetic signals that align perfectly with the ranking structure.

### Intended Use
Serve as a **sanity check** and upper bound for ranking performance, not for deployment.

### Limitations
- Unrealistically clean preference structure  
- No user differentiation  
- Inflated metrics due to synthetic evaluation

---

## Model Category 2: Rule-enhanced Cold Start Models (User1–User5)

### Purpose
Capture user-specific dietary preferences and ranking heuristics using richer rule sets, leading to more diverse recommendation patterns across different users.

### Data Sources
- Cleaned Food.com dataset (~180k recipes)
- 5,000 cold-start synthetic interactions per user profile
- Additional unselected (negative) samples included to simulate realistic cold-start scenarios

### Model
- Model type: **XGBRanker** (scratch-trained)
- Training objective: `rank:pairwise`
- Feature space:
  - Ingredient-parent coverage ratios (~1000 parent nodes)
  - Nutrition features: protein, calories, cost, cooking time
  - User preference weights: protein/time/cost
  - Dietary tag filters and exclusion rules

### Training Setup
- Train/valid/test split: 70/15/15 by recipe ID per profile
- No fine-tuning between profiles; each profile trained independently
- Evaluation metric: NDCG@5 and NDCG@10

### Evaluation Results

| User Profile | NDCG@5 | NDCG@10 |
|-------------|--------|---------|
| user1       | 0.4400 | 0.4400  |
| user2       | 0.4342 | 0.4342  |
| user3       | 0.4179 | 0.4179  |
| user4       | 0.1651 | 0.1651  |
| user5       | 0.4607 | 0.4607  |

**Note:** User4 has very restrictive dietary preferences, resulting in very few matching recipes and inherently lower achievable NDCG.

:contentReference[oaicite:0]{index=0}:contentReference[oaicite:1]{index=1}:contentReference[oaicite:2]{index=2}:contentReference[oaicite:3]{index=3}:contentReference[oaicite:4]{index=4}

Although these NDCG values are lower than the baseline, this is expected for several reasons:

- The cold-start datasets contain a large proportion of unselected recipes, leading to sparse positive signals.  
- More complex preference rules increase variability and reduce alignment with NDCG’s single-label relevance assumptions.  
- The models now produce more differentiated ranking behaviors across user profiles, which aligns with the intended personalization goals.

---

## Model Selection Justification

- **XGBRanker** was chosen for all models due to its effectiveness on structured tabular data, fast training time, and compatibility with large feature spaces (1000+ ingredients).  
- The **baseline model** acts as a clean control, providing an upper bound on achievable NDCG under idealized preferences.  
- The **rule-enhanced models** trade some raw NDCG performance for greater personalization fidelity, which is critical in multi-user recommendation contexts.

---

## Evaluation Methodology

- Metric: NDCG@5 and NDCG@10 on held-out cold-start samples  
- Each user model evaluated independently  
- Negative samples retained to approximate real-world recommendation class imbalance

---

## Intended Uses and Limitations

**Intended Uses**
- Multi-profile recipe recommendation  
- Studying personalization behaviors under sparse feedback  
- Cold-start scenarios for new users

**Limitations**
- Synthetic user interactions do not perfectly reflect real-world feedback  
- NDCG is not well aligned with multi-rule personalization behavior  
- User4 performance is limited by scarcity of relevant recipes

---
## Risks and Bias

The models are trained on the Food.com dataset, which has known biases:
- **Regional bias**: Western and American cuisines dominate the dataset, leading to potential under-representation of other regions.
- **Popularity bias**: Highly rated or frequently interacted recipes are over-represented.
- **Cold-start leakage risk**: Although user interactions are synthetic, overlapping ingredient-parent structures between train/test may create mild information leakage, potentially inflating baseline metrics.

These biases may affect recommendation diversity and fairness across different cuisines or dietary groups.

---

## Cost and Latency

All models are based on **XGBRanker**, which runs efficiently on CPU:
- **Inference latency**: Approximately 1–5 ms per recipe for ranking (measured on a laptop CPU, single thread).
- **Training cost**: Training each user profile model on 5,000 interactions takes less than 2 minutes on CPU.

The approach is designed for real-time personalization in lightweight interfaces (e.g., Hugging Face Spaces).

---

## Usage Disclosure

**Intended Uses**
- Academic and educational research on personalized recommendation
- Cold-start personalization experiments
- Recipe recommendation for diverse dietary profiles

**Not Intended For**
- Medical or dietary decision-making
- Real-world deployment without additional bias mitigation
- High-stakes personalization where fairness across demographic groups is critical

---

## Citation

Tang, Xinxuan. Personalized Recipe Ranking Models. 2025.