Instructions to use marqvision-ai/FM-Relevancy-v1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use marqvision-ai/FM-Relevancy-v1.0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="marqvision-ai/FM-Relevancy-v1.0", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("marqvision-ai/FM-Relevancy-v1.0", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
FM-Relevancy-v1.0
Overview
Binary classifier that fuses image and text feature embeddings to score listing relevancy. The fusion module is encoder-agnostic by design: it consumes pre-computed embeddings of fixed dimensions and produces a single relevancy logit.
Architecture
The model is composed of a multimodal fusion encoder followed by a small classification head.
- Input projections β each modality is projected through a
Linear β LayerNormblock into the model's hidden dimension. - Modality streams β image and text each have an independent stack of 3 Q-Former blocks operating on a set of 12 learnable query tokens per stream.
- Fusion stream β a separate stack of 2 Q-Former blocks with its own 12 learnable queries, which cross-attends to the outputs of the two modality streams.
- Pooling & projection β the fusion queries are reduced to a single vector
via attention pooling, then projected through a small MLP head
(
LayerNorm β Linear(1024β512) β GELU β Linear(512β512)). - Classifier β
Linear(512 β 1)producing the relevancy logit (trained withBCEWithLogitsLoss).
Each Q-Former block contains pre-norm self-attention, cross-attention, and an
MLP feed-forward (Linear(1024β4096) β GELU β Linear(4096β1024)),
with LayerNorm on each sub-block.
Total trainable parameters: 139,422,210.
Inputs
| field | shape | dtype | notes |
|---|---|---|---|
image_features |
[B, num_images, 1024] |
float | one row per image |
text_features |
[B, num_texts, 1024] |
float | one row per text field |
image_mask |
[B, num_images] |
bool | optional, True = valid |
text_mask |
[B, num_texts] |
bool | optional, True = valid |
num_images should be at most 2 for best results. Any image/text
encoder producing embeddings of matching dimensions can be paired with this
model. Encoders with different output dimensions require fine-tuning the input
projection layers.
Example encoder pairing
The released checkpoint was trained with DINOv3 ViT-L/16 and BGE-M3 as one representative encoder pair; other encoders are compatible with appropriate fine-tuning.
Files
| file | description |
|---|---|
model.safetensors |
weights (~530MB, 139.4M parameters, fp32) |
config.json |
model configuration |
configuration_fm_relevancy.py |
FMRelevancyConfig |
modeling_fm_relevancy.py |
FMRelevancyModel |
Usage
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
"marqvision-ai/FM-Relevancy-v1.0",
trust_remote_code=True,
)
model.eval()
# Prepare your pre-computed embeddings.
# image_features: [B, num_images, 1024]
# text_features: [B, num_texts, 1024]
# image_mask: [B, num_images] (bool, True = valid)
# text_mask: [B, num_texts] (bool, True = valid)
with torch.no_grad():
out = model(
image_features=image_features,
text_features=text_features,
image_mask=image_mask,
text_mask=text_mask,
)
probs = torch.sigmoid(out.logits) # [B] in [0, 1]
License
CC BY-NC 4.0 β Creative Commons Attribution-NonCommercial 4.0 International.
- Downloads last month
- 32