FM-Relevancy-v1.0

Overview

Binary classifier that fuses image and text feature embeddings to score listing relevancy. The fusion module is encoder-agnostic by design: it consumes pre-computed embeddings of fixed dimensions and produces a single relevancy logit.

Architecture

The model is composed of a multimodal fusion encoder followed by a small classification head.

  • Input projections β€” each modality is projected through a Linear β†’ LayerNorm block into the model's hidden dimension.
  • Modality streams β€” image and text each have an independent stack of 3 Q-Former blocks operating on a set of 12 learnable query tokens per stream.
  • Fusion stream β€” a separate stack of 2 Q-Former blocks with its own 12 learnable queries, which cross-attends to the outputs of the two modality streams.
  • Pooling & projection β€” the fusion queries are reduced to a single vector via attention pooling, then projected through a small MLP head (LayerNorm β†’ Linear(1024β†’512) β†’ GELU β†’ Linear(512β†’512)).
  • Classifier β€” Linear(512 β†’ 1) producing the relevancy logit (trained with BCEWithLogitsLoss).

Each Q-Former block contains pre-norm self-attention, cross-attention, and an MLP feed-forward (Linear(1024β†’4096) β†’ GELU β†’ Linear(4096β†’1024)), with LayerNorm on each sub-block.

Total trainable parameters: 139,422,210.

Inputs

field shape dtype notes
image_features [B, num_images, 1024] float one row per image
text_features [B, num_texts, 1024] float one row per text field
image_mask [B, num_images] bool optional, True = valid
text_mask [B, num_texts] bool optional, True = valid

num_images should be at most 2 for best results. Any image/text encoder producing embeddings of matching dimensions can be paired with this model. Encoders with different output dimensions require fine-tuning the input projection layers.

Example encoder pairing

The released checkpoint was trained with DINOv3 ViT-L/16 and BGE-M3 as one representative encoder pair; other encoders are compatible with appropriate fine-tuning.

Files

file description
model.safetensors weights (~530MB, 139.4M parameters, fp32)
config.json model configuration
configuration_fm_relevancy.py FMRelevancyConfig
modeling_fm_relevancy.py FMRelevancyModel

Usage

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "marqvision-ai/FM-Relevancy-v1.0",
    trust_remote_code=True,
)
model.eval()

# Prepare your pre-computed embeddings.
# image_features: [B, num_images, 1024]
# text_features:  [B, num_texts,  1024]
# image_mask:     [B, num_images]  (bool, True = valid)
# text_mask:      [B, num_texts]   (bool, True = valid)

with torch.no_grad():
    out = model(
        image_features=image_features,
        text_features=text_features,
        image_mask=image_mask,
        text_mask=text_mask,
    )
    probs = torch.sigmoid(out.logits)   # [B] in [0, 1]

License

CC BY-NC 4.0 β€” Creative Commons Attribution-NonCommercial 4.0 International.

Downloads last month
32
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support