FM-Relevancy-v1.0

Overview

Binary classifier that fuses image and text feature embeddings to score listing relevancy. The fusion module is encoder-agnostic by design: it consumes pre-computed embeddings of fixed dimensions and produces a single relevancy logit.

Architecture

The model is composed of a multimodal fusion encoder followed by a small classification head.

Input projections — each modality is projected through a Linear → LayerNorm block into the model's hidden dimension.
Modality streams — image and text each have an independent stack of 3 Q-Former blocks operating on a set of 12 learnable query tokens per stream.
Fusion stream — a separate stack of 2 Q-Former blocks with its own 12 learnable queries, which cross-attends to the outputs of the two modality streams.
Pooling & projection — the fusion queries are reduced to a single vector via attention pooling, then projected through a small MLP head (LayerNorm → Linear(1024→512) → GELU → Linear(512→512)).
Classifier — Linear(512 → 1) producing the relevancy logit (trained with BCEWithLogitsLoss).

Each Q-Former block contains pre-norm self-attention, cross-attention, and an MLP feed-forward (Linear(1024→4096) → GELU → Linear(4096→1024)), with LayerNorm on each sub-block.

Total trainable parameters: 139,422,210.

Inputs

field	shape	dtype	notes
`image_features`	`[B, num_images, 1024]`	float	one row per image
`text_features`	`[B, num_texts, 1024]`	float	one row per text field
`image_mask`	`[B, num_images]`	bool	optional, `True` = valid
`text_mask`	`[B, num_texts]`	bool	optional, `True` = valid

num_images should be at most 2 for best results. Any image/text encoder producing embeddings of matching dimensions can be paired with this model. Encoders with different output dimensions require fine-tuning the input projection layers.

Example encoder pairing

The released checkpoint was trained with DINOv3 ViT-L/16 and BGE-M3 as one representative encoder pair; other encoders are compatible with appropriate fine-tuning.

Files

file	description
`model.safetensors`	weights (~530MB, 139.4M parameters, fp32)
`config.json`	model configuration
`configuration_fm_relevancy.py`	`FMRelevancyConfig`
`modeling_fm_relevancy.py`	`FMRelevancyModel`

Usage

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "marqvision-ai/FM-Relevancy-v1.0",
    trust_remote_code=True,
)
model.eval()

# Prepare your pre-computed embeddings.
# image_features: [B, num_images, 1024]
# text_features:  [B, num_texts,  1024]
# image_mask:     [B, num_images]  (bool, True = valid)
# text_mask:      [B, num_texts]   (bool, True = valid)

with torch.no_grad():
    out = model(
        image_features=image_features,
        text_features=text_features,
        image_mask=image_mask,
        text_mask=text_mask,
    )
    probs = torch.sigmoid(out.logits)   # [B] in [0, 1]

License

CC BY-NC 4.0 — Creative Commons Attribution-NonCommercial 4.0 International.

Downloads last month: 32

Safetensors

Model size

0.1B params

Tensor type

F32