--- license: mit language: - en tags: - recommendation-system - two-tower - re-ranking - torchrec - faiss --- # Model Card for ReVue ReVue is a two-stage recommendation system for property listings. It recommends new properties to returning users based on their past reviews. Stage 1 uses a two-tower model trained with in-batch-negative softmax to generate candidates via a FAISS HNSW index. Stage 2 applies a pointwise MLP re-ranker to score and re-order the retrieved candidates. ## Model Details ### Model Description ReVue combines collaborative filtering signals (user-item interactions) with content-based features (text embeddings, dense listing attributes, and sparse categorical features) in a two-stage retrieval and ranking pipeline. The candidate generation stage learns 128-dimensional L2-normalised user and item embeddings using a two-tower architecture built on TorchRec. Retrieved candidates are then scored by a pointwise MLP re-ranker that fuses sparse embeddings, dense features, text embeddings, and the two-tower cosine similarity into a single relevance logit. - **Developed by:** Vladimir Ilievski - **Model type:** Two-stage recommender system (candidate generation + re-ranking) - **Language(s) (NLP):** English (86.1%), French (4.7%), Spanish (2.5%), German (1.5%), Italian (1.1%), other (3.1%) - **License:** MIT ### Model Sources [optional] - **Repository:** [https://github.com/IlievskiV/ReVue](https://github.com/IlievskiV/ReVue) ## Uses ### Direct Use ReVue is designed to recommend property listings to returning users who have previously reviewed at least one property. Given a user's review history, the system retrieves and ranks candidate listings from the full catalogue. ### Downstream Use [optional] - Fine-tuning on property listing datasets from other cities or platforms. - Using the learned item embeddings for related-listing or similar-property retrieval. ### Out-of-Scope Use - **Cold-start users:** The system requires at least one past review to produce recommendations. It is not suitable for brand-new users with no interaction history. - **Non-property domains:** The feature engineering and data pipeline are tailored to property listing data (Airbnb); applying the model to unrelated domains without adaptation is not recommended. - **Real-time safety-critical decisions:** The model is not designed for applications where incorrect recommendations could cause harm. ## Bias, Risks, and Limitations - **Geographic bias:** The training data comes exclusively from Airbnb London listings. The model may not generalise to other cities, countries, or cultural contexts. - **Interaction sparsity:** Despite having ~1.5 million reviews, the user-item interaction matrix has a density of only ~0.002% due to the large number of unique users and listings, which limits the signal available for collaborative filtering. - **Cold-start problem:** Users with no review history cannot be served. Single-review users are included in training but excluded from evaluation. - **Popularity bias:** In-batch negative sampling can introduce a bias toward popular items that appear more frequently as negatives. - **Approximate retrieval:** FAISS HNSW provides approximate nearest-neighbour search, trading exact recall for latency. The `efSearch` parameter controls this trade-off. ### Recommendations Users should be aware that recommendations are biased toward the London Airbnb market represented in the training data. Deploying on a different market requires retraining on representative data. The cold-start limitation should be addressed at the application level (e.g., popularity-based fallback for new users). ## How to Get Started with the Model Use the code below to get started with the model. ### Installation ```bash poetry install --all-extras # Download artefacts from Hugging Face Hub poetry run hf download vlad0saurus/ReVue --repo-type model --local-dir . ``` ### Full pipeline via CLI ```bash # 1. Clean data revue data clean-raw-reviews revue data clean-raw-listings # 2. Train two-tower model revue model train-two-tower # 3. Build FAISS index revue index build-items-index # 4. Generate re-ranking triplets revue data create-ranking-triplets # 5. Build ranker dataset revue model build-ranker-dataset # 6. Train re-ranker revue model train-ranker ``` ### Inference ```python from revue.index.ann_items import load_ann_index, search_ann_index from revue.models.two_tower.model import TwoTowerModel # Load model and index model, checkpoint = TwoTowerModel.load_from_checkpoint(checkpoint_path, device=device) index = load_ann_index(index_path) user_id_map = checkpoint["user_id_map"] # Retrieve top-K candidates for a user user_embeddings = model.encode_user(user_kjt).cpu().numpy() scores, listing_ids = search_ann_index(index, user_embeddings, k=100) ``` ## Training Details ### Training Data The training data is derived from publicly available Airbnb London data consisting of two tables: - **`reviews.csv`:** User reviews of property listings, with columns including `listing_id`, `reviewer_id`, `reviewer_name`, `date`, `comments`, and an augmented `sentiment` score. - **`listings.csv`:** Property listing metadata with features such as `name`, `description`, `host_id`, and various categorical and numerical attributes. The reviews are **multilingual** (detected via `langdetect`): | Language | Count | Share | |----------|------:|------:| | English | 43,040 | 86.1% | | French | 2,348 | 4.7% | | Spanish | 1,227 | 2.5% | | German | 763 | 1.5% | | Italian | 563 | 1.1% | | Other / unknown | 1,582 | 3.1% | For the re-ranker, training triplets are constructed with: - **Positives:** All review events (label = 1) - **Hard negatives:** Top-K (default 10) FAISS nearest neighbours not reviewed by the user - **Easy negatives:** N (default 10) random catalogue listings not reviewed by the user ### Training Procedure #### Preprocessing [optional] Text data undergoes a multi-stage cleaning pipeline: 1. **Quality filtering (datatrove):** `GopherQualityFilter` (min 3 words, max 1,000 words, max avg word length 15, max non-alpha ratio 0.5) and `UnigramLogProbFilter` (threshold -20). 2. **Text normalisation:** Lowercasing, whitespace stripping, link removal, symbol removal, non-alphanumeric removal, and whitespace collapsing via regex and spaCy. 3. **Sentiment augmentation (reviews only):** Expected star rating in [1, 5] from `nlptown/bert-base-multilingual-uncased-sentiment`. 4. **Train/test split:** Temporal leave-last-out for returning users (users with >= 2 reviews). Single-review users are kept in training only. #### Training Hyperparameters **Two-Tower Model:** | Parameter | Value | |---|---| | Optimizer | AdamW | | Peak learning rate | 1e-3 | | Weight decay | 1e-5 | | LR schedule | Linear warmup (100 steps) + cosine decay | | Batch size | 1,024 per GPU | | Gradient clipping | 1.0 (max norm) | | Epochs | 10 | | Loss | In-batch-negative softmax cross-entropy (temperature = 0.05) | | User MLP | [256, 128], ReLU + Dropout(0.1) | | Item MLP | [512, 256, 128], ReLU + Dropout(0.1) | | Output dim | 128 (L2-normalised) | **Re-Ranker:** | Parameter | Value | |---|---| | Optimizer | AdamW | | Peak learning rate | 1e-3 | | Weight decay | 1e-5 | | LR schedule | Linear warmup (100 steps) + cosine decay | | Batch size | 1,024 per GPU | | Gradient clipping | 1.0 (max norm) | | Epochs | 10 | | Loss | Binary cross-entropy with logits | | MLP | [256, 128], ReLU + Dropout(0.1) → 1 logit | - **Training regime:** fp32. Multi-GPU training supported via DDP (`torchrun`). #### Speeds, Sizes, Times [optional] - **Checkpoints:** ~5.9 GB (hosted on Hugging Face Hub) - **FAISS index:** ~62 MB - **Training data:** ~2.2 GB (CSV files) ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data The test set is constructed using a temporal leave-last-out protocol: for each returning user (>= 2 reviews), the most recent review is held out for evaluation. The remaining reviews form the training set. #### Factors Evaluation is disaggregated by user, with per-user ranking of the full item catalogue (two-tower) or per-user candidate lists (re-ranker). #### Metrics **Two-Tower (Candidate Generation):** - **Recall@K** (K = 1, 5, 10, 50, 100): Measures whether the held-out item appears in the top-K retrieved candidates. Recall is the primary metric for candidate generation since the goal is to ensure the relevant item is included in the shortlist. **Re-Ranker:** - **NDCG@K** (K = 1, 5, 10): Measures the ranking quality of the re-ordered candidates, giving higher weight to relevant items ranked near the top. - **MRR:** Mean reciprocal rank of the first relevant item. - **Recall@K** (K = 1, 5, 10): Measures whether the relevant item appears in the top-K after re-ranking. ## Technical Specifications [optional] ### Model Architecture and Objective **Two-Tower Model (Candidate Generation):** - **User tower:** User-ID embedding (256-dim EBC) → MLP [256, 128] → 128-dim L2-normalised output. - **Item tower:** Sparse features (EBC) + dense features (LayerNorm) + text embeddings (3 × 384 from `all-MiniLM-L6-v2`) → MLP [512, 256, 128] → 128-dim L2-normalised output. - **Objective:** In-batch-negative softmax cross-entropy with temperature 0.05. **Re-Ranker (Pointwise MLP):** - **Input:** User embedding (64-dim) + item sparse embeddings (64-dim each) + dense features (LayerNorm) + text embeddings (3 × 384) + two-tower cosine similarity. - **Architecture:** MLP [256, 128] → 1 logit. - **Objective:** Binary cross-entropy with logits. **FAISS Index:** - **Type:** `IndexHNSWFlat` + `IndexIDMap` - **Metric:** `METRIC_INNER_PRODUCT` (equivalent to cosine similarity for L2-normalised embeddings) - **Embedding dimension:** 128 ### Compute Infrastructure Multi-GPU training via PyTorch DDP (`torchrun`). Single-GPU and CPU inference supported. #### Hardware CUDA-capable GPU recommended for training. CPU supported for inference. #### Software - Python >= 3.11, < 3.12 - PyTorch >= 2.6 - TorchRec >= 1.0 - FAISS (faiss-cpu >= 1.13.2) - Transformers >= 5.0 - spaCy >= 3.3 - datatrove >= 0.8 - Poetry >= 2.1.3 (dependency management) ## Glossary [optional] - **Two-tower model:** A dual-encoder architecture where user and item features are independently encoded into a shared embedding space, enabling efficient retrieval via approximate nearest-neighbour search. - **In-batch negatives:** A training strategy where items paired with other users in the same mini-batch serve as negative examples, avoiding the need for an explicit negative sampling step. - **HNSW:** Hierarchical Navigable Small World, a graph-based algorithm for approximate nearest-neighbour search used in FAISS. - **Leave-last-out:** An evaluation protocol where the most recent interaction of each user is held out for testing, simulating a temporal prediction scenario. - **EBC:** Embedding Bag Collection, a TorchRec primitive for efficiently computing sparse feature embeddings. ## More Information [optional] See the [repository README](https://github.com/IlievskiV/ReVue) for detailed instructions on data preparation, training, evaluation, and inference. ## Model Card Authors [optional] Vladimir Ilievski