ReVue / modelcard.md

Upload modelcard.md with huggingface_hub

b073038 verified about 1 month ago

11.3 kB

	---
	license: mit
	language:
	- en
	- fr
	- es
	- de
	- it
	tags:
	- recommendation-system
	- two-tower
	- re-ranking
	- torchrec
	- faiss
	---

	# Model Card for ReVue

	ReVue is a two-stage recommendation system for property listings. It recommends new properties to returning users based on their past reviews. Stage 1 uses a two-tower model trained with in-batch-negative softmax to generate candidates via a FAISS HNSW index. Stage 2 applies a pointwise MLP re-ranker to score and re-order the retrieved candidates.

	## Model Details

	### Model Description

	ReVue combines collaborative filtering signals (user-item interactions) with content-based features (text embeddings, dense listing attributes, and sparse categorical features) in a two-stage retrieval and ranking pipeline. The candidate generation stage learns 128-dimensional L2-normalised user and item embeddings using a two-tower architecture built on TorchRec. Retrieved candidates are then scored by a pointwise MLP re-ranker that fuses sparse embeddings, dense features, text embeddings, and the two-tower cosine similarity into a single relevance logit.

	- Developed by: Vladimir Ilievski
	- Model type: Two-stage recommender system (candidate generation + re-ranking)
	- Language(s) (NLP): English (86.1%), French (4.7%), Spanish (2.5%), German (1.5%), Italian (1.1%), other (3.1%)
	- License: MIT

	### Model Sources [optional]

	- Repository: [https://github.com/IlievskiV/ReVue](https://github.com/IlievskiV/ReVue)

	## Uses

	### Direct Use

	ReVue is designed to recommend property listings to returning users who have previously reviewed at least one property. Given a user's review history, the system retrieves and ranks candidate listings from the full catalogue.

	### Downstream Use [optional]

	- Fine-tuning on property listing datasets from other cities or platforms.
	- Using the learned item embeddings for related-listing or similar-property retrieval.

	### Out-of-Scope Use

	- Cold-start users: The system requires at least one past review to produce recommendations. It is not suitable for brand-new users with no interaction history.
	- Non-property domains: The feature engineering and data pipeline are tailored to property listing data (Airbnb); applying the model to unrelated domains without adaptation is not recommended.
	- Real-time safety-critical decisions: The model is not designed for applications where incorrect recommendations could cause harm.

	## Bias, Risks, and Limitations

	- Geographic bias: The training data comes exclusively from Airbnb London listings. The model may not generalise to other cities, countries, or cultural contexts.
	- Interaction sparsity: Despite having ~1.5 million reviews, the user-item interaction matrix has a density of only ~0.002% due to the large number of unique users and listings, which limits the signal available for collaborative filtering.
	- Cold-start problem: Users with no review history cannot be served. Single-review users are included in training but excluded from evaluation.
	- Popularity bias: In-batch negative sampling can introduce a bias toward popular items that appear more frequently as negatives.
	- Approximate retrieval: FAISS HNSW provides approximate nearest-neighbour search, trading exact recall for latency. The `efSearch` parameter controls this trade-off.

	### Recommendations

	Users should be aware that recommendations are biased toward the London Airbnb market represented in the training data. Deploying on a different market requires retraining on representative data. The cold-start limitation should be addressed at the application level (e.g., popularity-based fallback for new users).

	## How to Get Started with the Model

	Use the code below to get started with the model.

	### Installation

	```bash
	poetry install --all-extras

	# Download artefacts from Hugging Face Hub
	poetry run hf download vlad0saurus/ReVue --repo-type model --local-dir .
	```

	### Full pipeline via CLI

	```bash
	# 1. Clean data
	revue data clean-raw-reviews
	revue data clean-raw-listings

	# 2. Train two-tower model
	revue model train-two-tower

	# 3. Build FAISS index
	revue index build-items-index

	# 4. Generate re-ranking triplets
	revue data create-ranking-triplets

	# 5. Build ranker dataset
	revue model build-ranker-dataset

	# 6. Train re-ranker
	revue model train-ranker
	```

	### Inference

	```python
	from revue.index.ann_items import load_ann_index, search_ann_index
	from revue.models.two_tower.model import TwoTowerModel

	# Load model and index
	model, checkpoint = TwoTowerModel.load_from_checkpoint(checkpoint_path, device=device)
	index = load_ann_index(index_path)
	user_id_map = checkpoint["user_id_map"]

	# Retrieve top-K candidates for a user
	user_embeddings = model.encode_user(user_kjt).cpu().numpy()
	scores, listing_ids = search_ann_index(index, user_embeddings, k=100)
	```

	## Training Details

	### Training Data

	The training data is derived from publicly available Airbnb London data consisting of two tables:

	- `reviews.csv`: User reviews of property listings, with columns including `listing_id`, `reviewer_id`, `reviewer_name`, `date`, `comments`, and an augmented `sentiment` score.
	- `listings.csv`: Property listing metadata with features such as `name`, `description`, `host_id`, and various categorical and numerical attributes.

	The reviews are multilingual (detected via `langdetect`):

	\| Language \| Count \| Share \|
	\|----------\|------:\|------:\|
	\| English \| 43,040 \| 86.1% \|
	\| French \| 2,348 \| 4.7% \|
	\| Spanish \| 1,227 \| 2.5% \|
	\| German \| 763 \| 1.5% \|
	\| Italian \| 563 \| 1.1% \|
	\| Other / unknown \| 1,582 \| 3.1% \|

	For the re-ranker, training triplets are constructed with:
	- Positives: All review events (label = 1)
	- Hard negatives: Top-K (default 10) FAISS nearest neighbours not reviewed by the user
	- Easy negatives: N (default 10) random catalogue listings not reviewed by the user

	### Training Procedure

	#### Preprocessing [optional]

	Text data undergoes a multi-stage cleaning pipeline:

	1. Quality filtering (datatrove): `GopherQualityFilter` (min 3 words, max 1,000 words, max avg word length 15, max non-alpha ratio 0.5) and `UnigramLogProbFilter` (threshold -20).
	2. Text normalisation: Lowercasing, whitespace stripping, link removal, symbol removal, non-alphanumeric removal, and whitespace collapsing via regex and spaCy.
	3. Sentiment augmentation (reviews only): Expected star rating in [1, 5] from `nlptown/bert-base-multilingual-uncased-sentiment`.
	4. Train/test split: Temporal leave-last-out for returning users (users with >= 2 reviews). Single-review users are kept in training only.

	#### Training Hyperparameters

	Two-Tower Model:

	\| Parameter \| Value \|
	\|---\|---\|
	\| Optimizer \| AdamW \|
	\| Peak learning rate \| 1e-3 \|
	\| Weight decay \| 1e-5 \|
	\| LR schedule \| Linear warmup (100 steps) + cosine decay \|
	\| Batch size \| 1,024 per GPU \|
	\| Gradient clipping \| 1.0 (max norm) \|
	\| Epochs \| 10 \|
	\| Loss \| In-batch-negative softmax cross-entropy (temperature = 0.05) \|
	\| User MLP \| [256, 128], ReLU + Dropout(0.1) \|
	\| Item MLP \| [512, 256, 128], ReLU + Dropout(0.1) \|
	\| Output dim \| 128 (L2-normalised) \|

	Re-Ranker:

	\| Parameter \| Value \|
	\|---\|---\|
	\| Optimizer \| AdamW \|
	\| Peak learning rate \| 1e-3 \|
	\| Weight decay \| 1e-5 \|
	\| LR schedule \| Linear warmup (100 steps) + cosine decay \|
	\| Batch size \| 1,024 per GPU \|
	\| Gradient clipping \| 1.0 (max norm) \|
	\| Epochs \| 10 \|
	\| Loss \| Binary cross-entropy with logits \|
	\| MLP \| [256, 128], ReLU + Dropout(0.1) → 1 logit \|

	- Training regime: fp32. Multi-GPU training supported via DDP (`torchrun`).

	#### Speeds, Sizes, Times [optional]

	- Checkpoints: ~5.9 GB (hosted on Hugging Face Hub)
	- FAISS index: ~62 MB
	- Training data: ~2.2 GB (CSV files)

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	The test set is constructed using a temporal leave-last-out protocol: for each returning user (>= 2 reviews), the most recent review is held out for evaluation. The remaining reviews form the training set.

	#### Factors

	Evaluation is disaggregated by user, with per-user ranking of the full item catalogue (two-tower) or per-user candidate lists (re-ranker).

	#### Metrics

	Two-Tower (Candidate Generation):
	- Recall@K (K = 1, 5, 10, 50, 100): Measures whether the held-out item appears in the top-K retrieved candidates. Recall is the primary metric for candidate generation since the goal is to ensure the relevant item is included in the shortlist.

	Re-Ranker:
	- NDCG@K (K = 1, 5, 10): Measures the ranking quality of the re-ordered candidates, giving higher weight to relevant items ranked near the top.
	- MRR: Mean reciprocal rank of the first relevant item.
	- Recall@K (K = 1, 5, 10): Measures whether the relevant item appears in the top-K after re-ranking.

	## Technical Specifications [optional]

	### Model Architecture and Objective

	Two-Tower Model (Candidate Generation):
	- User tower: User-ID embedding (256-dim EBC) → MLP [256, 128] → 128-dim L2-normalised output.
	- Item tower: Sparse features (EBC) + dense features (LayerNorm) + text embeddings (3 × 384 from `all-MiniLM-L6-v2`) → MLP [512, 256, 128] → 128-dim L2-normalised output.
	- Objective: In-batch-negative softmax cross-entropy with temperature 0.05.

	Re-Ranker (Pointwise MLP):
	- Input: User embedding (64-dim) + item sparse embeddings (64-dim each) + dense features (LayerNorm) + text embeddings (3 × 384) + two-tower cosine similarity.
	- Architecture: MLP [256, 128] → 1 logit.
	- Objective: Binary cross-entropy with logits.

	FAISS Index:
	- Type: `IndexHNSWFlat` + `IndexIDMap`
	- Metric: `METRIC_INNER_PRODUCT` (equivalent to cosine similarity for L2-normalised embeddings)
	- Embedding dimension: 128

	### Compute Infrastructure

	Multi-GPU training via PyTorch DDP (`torchrun`). Single-GPU and CPU inference supported.

	#### Hardware

	CUDA-capable GPU recommended for training. CPU supported for inference.

	#### Software

	- Python >= 3.11, < 3.12
	- PyTorch >= 2.6
	- TorchRec >= 1.0
	- FAISS (faiss-cpu >= 1.13.2)
	- Transformers >= 5.0
	- spaCy >= 3.3
	- datatrove >= 0.8
	- Poetry >= 2.1.3 (dependency management)

	## Glossary [optional]

	- Two-tower model: A dual-encoder architecture where user and item features are independently encoded into a shared embedding space, enabling efficient retrieval via approximate nearest-neighbour search.
	- In-batch negatives: A training strategy where items paired with other users in the same mini-batch serve as negative examples, avoiding the need for an explicit negative sampling step.
	- HNSW: Hierarchical Navigable Small World, a graph-based algorithm for approximate nearest-neighbour search used in FAISS.
	- Leave-last-out: An evaluation protocol where the most recent interaction of each user is held out for testing, simulating a temporal prediction scenario.
	- EBC: Embedding Bag Collection, a TorchRec primitive for efficiently computing sparse feature embeddings.

	## More Information [optional]

	See the [repository README](https://github.com/IlievskiV/ReVue) for detailed instructions on data preparation, training, evaluation, and inference.

	## Model Card Authors [optional]

	Vladimir Ilievski