metadata
license: mit
library_name: pytorch
tags:
- model-recommendation
- model-selection
- ranking
- model-routing
- benchmarks
- leaderboard
pipeline_tag: tabular-regression
ModelLens — Trained Recommender Checkpoint
📄 Paper: ModelLens: Finding the Best Model for Your Task from Myriads of Models · 🤗 Collection: luisrui/modellens · 💻 Code: github.com/luisrui/ModelLens
This is the released ModelLens checkpoint — a metric-aware ranker that, given a dataset description + task + metric, returns a ranked list of HuggingFace models likely to perform well on it. No fine-tuning, no forward pass on the target dataset.
This repo only ships the weights. For:
- Live demo (Gradio): 🤗
luisrui/ModelLens - Training data: 🤗
luisrui/ModelLens-corpus-v2(1.81M rows, recommended) - Source code: github.com/luisrui/ModelLens
- Paper: see citation below
What's in here
| File | Size | Description |
|---|---|---|
ModelLens.pt |
~709 MB | Trained recommender weights (slim — inference-ready, ~3 unused parent-class buffers dropped) |
args.json |
~2 KB | Training-time hyperparameters (model dims, num_models / num_tasks / num_metrics / etc.) |
Provenance
- Trained on:
luisrui/ModelLens-corpus-v2— 1,807,133 (model × dataset × metric × value) records - Coverage: 47,242 HuggingFace models · 2,581 tasks · 3,714 metrics · ~86k datasets
- Architecture:
MLPMetricFull(the paper model — see github repo) - Loss: ensemble (listwise + pairwise + pointwise,
λ_list=0.5, λ_pair=1.0, w_point=0.1) - Training: 30 epochs, DDP × 4 GPUs,
bs=8,lr=1e-3,wd=1e-4, learnableτ - Slimmed checkpoint: inference-unused parent-class buffers + train-set
dataset_desc_matrixstripped (load withstrict=False).
Loading
from huggingface_hub import hf_hub_download
import torch, json
ckpt_path = hf_hub_download("luisrui/ModelLens", "ModelLens.pt")
args_path = hf_hub_download("luisrui/ModelLens", "args.json")
args = json.load(open(args_path))
state = torch.load(ckpt_path, map_location="cpu")
# Build the model from source (see github.com/luisrui/ModelLens) and load:
# model = MLPMetricFull(**args_to_kwargs(args))
# model.load_state_dict(state, strict=False) # strict=False is intentional
For a complete, ready-to-run setup including the candidate model pool +
metadata, see inference_lib.py
and recommend.py
in the Space.
How it works
- The dataset description is embedded with OpenAI
text-embedding-3-small(1536-dim — same encoder used at training time). - The ranker scores every candidate model conditioned on
(dataset_embedding, task_id, metric_id, model_size_bucket, model_family_id, model_id). - Returns the top-K candidates, optionally filtered by param count / "HF-hosted only" / "official pretrained only".
Intended use
- Picking a starting model for a new task / dataset, without running every candidate.
- Cheap pre-filter ahead of a more expensive transferability estimator or partial fine-tune.
Limitations
- Knowledge is bounded by what's in
corpus-v2(up to early 2026). - Models / datasets that don't appear in the corpus fall back to text similarity over their descriptions — useful but weaker than the full signal available for in-corpus entities.
- Scores are relative — the ranking is what matters; the absolute numbers are not calibrated to any specific metric scale.
Citation
@article{cai2026modellens,
title={ModelLens: Finding the Best for Your Task from Myriads of Models},
author={Cai, Rui and Mo, Weijie Jacky and Wen, Xiaofei and Ma, Qiyao and Zhu, Wenhui and Chen, Xiwen and Chen, Muhao and Zhao, Zhe},
journal={arXiv preprint arXiv:2605.07075},
year={2026}
}
License
MIT.