File size: 4,413 Bytes
9bf25b2 7ded8ac 9bf25b2 68fabcb 9bf25b2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | ---
license: mit
library_name: pytorch
tags:
- model-recommendation
- model-selection
- ranking
- model-routing
- benchmarks
- leaderboard
pipeline_tag: tabular-regression
---
# ModelLens — Trained Recommender Checkpoint
📄 **Paper**: [ModelLens: Finding the Best Model for Your Task from Myriads of Models](https://huggingface.co/papers/2605.07075)
· 🤗 **Collection**: [luisrui/modellens](https://huggingface.co/collections/luisrui/modellens)
· 💻 **Code**: [github.com/luisrui/ModelLens](https://github.com/luisrui/ModelLens)
This is the released **ModelLens** checkpoint — a metric-aware ranker that,
given a dataset description + task + metric, returns a ranked list of
HuggingFace models likely to perform well on it. No fine-tuning, no
forward pass on the target dataset.
This repo only ships the weights. For:
- **Live demo (Gradio)**: 🤗 [`luisrui/ModelLens`](https://huggingface.co/spaces/luisrui/ModelLens)
- **Training data**: 🤗 [`luisrui/ModelLens-corpus-v2`](https://huggingface.co/datasets/luisrui/ModelLens-corpus-v2) (1.81M rows, recommended)
- **Source code**: [github.com/luisrui/ModelLens](https://github.com/luisrui/ModelLens)
- **Paper**: see citation below
## What's in here
| File | Size | Description |
|---|---:|---|
| `ModelLens.pt` | ~709 MB | Trained recommender weights (slim — inference-ready, ~3 unused parent-class buffers dropped) |
| `args.json` | ~2 KB | Training-time hyperparameters (model dims, num_models / num_tasks / num_metrics / etc.) |
## Provenance
- **Trained on**: [`luisrui/ModelLens-corpus-v2`](https://huggingface.co/datasets/luisrui/ModelLens-corpus-v2) — 1,807,133 (model × dataset × metric × value) records
- **Coverage**: 47,242 HuggingFace models · 2,581 tasks · 3,714 metrics · ~86k datasets
- **Architecture**: `MLPMetricFull` (the paper model — see [github repo](https://github.com/luisrui/ModelLens))
- **Loss**: ensemble (listwise + pairwise + pointwise, `λ_list=0.5, λ_pair=1.0, w_point=0.1`)
- **Training**: 30 epochs, DDP × 4 GPUs, `bs=8`, `lr=1e-3`, `wd=1e-4`, learnable `τ`
- **Slimmed checkpoint**: inference-unused parent-class buffers + train-set `dataset_desc_matrix` stripped (load with `strict=False`).
## Loading
```python
from huggingface_hub import hf_hub_download
import torch, json
ckpt_path = hf_hub_download("luisrui/ModelLens", "ModelLens.pt")
args_path = hf_hub_download("luisrui/ModelLens", "args.json")
args = json.load(open(args_path))
state = torch.load(ckpt_path, map_location="cpu")
# Build the model from source (see github.com/luisrui/ModelLens) and load:
# model = MLPMetricFull(**args_to_kwargs(args))
# model.load_state_dict(state, strict=False) # strict=False is intentional
```
For a complete, ready-to-run setup including the candidate model pool +
metadata, see [`inference_lib.py`](https://huggingface.co/spaces/luisrui/ModelLens/blob/main/inference_lib.py)
and [`recommend.py`](https://huggingface.co/spaces/luisrui/ModelLens/blob/main/recommend.py)
in the Space.
## How it works
1. The dataset description is embedded with OpenAI `text-embedding-3-small`
(1536-dim — same encoder used at training time).
2. The ranker scores every candidate model conditioned on
`(dataset_embedding, task_id, metric_id, model_size_bucket, model_family_id, model_id)`.
3. Returns the top-K candidates, optionally filtered by param count /
"HF-hosted only" / "official pretrained only".
## Intended use
- Picking a starting model for a new task / dataset, without running
every candidate.
- Cheap pre-filter ahead of a more expensive transferability estimator
or partial fine-tune.
## Limitations
- Knowledge is bounded by what's in `corpus-v2` (up to early 2026).
- Models / datasets that don't appear in the corpus fall back to text
similarity over their descriptions — useful but weaker than the full
signal available for in-corpus entities.
- Scores are *relative* — the ranking is what matters; the absolute
numbers are not calibrated to any specific metric scale.
## Citation
```bibtex
@article{cai2026modellens,
title={ModelLens: Finding the Best for Your Task from Myriads of Models},
author={Cai, Rui and Mo, Weijie Jacky and Wen, Xiaofei and Ma, Qiyao and Zhu, Wenhui and Chen, Xiwen and Chen, Muhao and Zhao, Zhe},
journal={arXiv preprint arXiv:2605.07075},
year={2026}
}
```
## License
MIT.
|