Product Matching - all-MiniLM-L6-v2

This is a specialized Sentence Transformer model fine-tuned for Product Matching and E-commerce Similarity tasks. It is based on all-MiniLM-L6-v2 and has been optimized to handle complex product titles, specifications, and search queries.

Core Capabilities

  • Product Deduplication: Identify identical products across different listings with varying title formats.
  • Semantic Search: Match search queries to product titles even when keywords don't match exactly.
  • Spec Matching: Associate product titles with their corresponding technical specifications or descriptions.
  • Cross-Category Retrieval: Trained across 25 distinct consumer and industrial categories.
  • False Positive Removal: Specifically optimized to distinguish between similar but non-matching items (e.g., different product versions or models).

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-MiniLM-L6-v2
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset: 10 Million Synthetic Product Pairs (including Title-Title, Query-Title, and Title-Specs)

Training Focus: Reducing False Positives

Unlike generic embedding models, this model is trained to be stricter. It is specifically designed to:

  • Distinguish between product versions (e.g., Sony XM4 vs XM5).
  • Reject same-category cross-brand matches (e.g., Nike vs Adidas).
  • Provide low similarity scores for unrelated products, ensuring higher precision in automated deduplication pipelines.

Training Domain

The model was fine-tuned on a diverse corpus covering:

  • Consumer Goods: Electronics, Office Supplies, Furniture, etc.
  • Industrial & B2B: Mechanical parts, Industrial equipment, Professional tools.

Model Sources

Full Model Architecture

SentenceTransformer(
 (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
 (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
 (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference:

from sentence_transformers import SentenceTransformer
import util # Optional for cosine_sim if not using model.similarity

# 1. Load the model
model = SentenceTransformer("surazbhandari/all-MiniLM-L6-v2-ProductMatching")

# 2. Define product pairs
product_a = "Apple iPhone 15 Pro Max 256GB Titanium"
product_b = "iPhone 15 Pro Max - Blue Titanium - 256 GB"
unrelated_product = "Logitech MX Master 3S Wireless Mouse"

# 3. Encode product titles
embeddings = model.encode([product_a, product_b, unrelated_product])

# 4. Calculate Cosine Similarity
similarity_score = model.similarity(embeddings[0], embeddings[1])
print(f"Similarity (Product A & B): {similarity_score.item():.4f}")
# Expected: ~0.60 (Strong match for varied titles)

diff_similarity = model.similarity(embeddings[0], embeddings[2])
print(f"Similarity (Unrelated): {diff_similarity.item():.4f}")
# Expected: < 0.10 (Clear distinction)

Benchmarking

To evaluate the model's effectiveness, we compared it against the original all-MiniLM-L6-v2 model across various product matching scenarios.

Comparison Results

Scenario Type Fine-Tuned (This Model) Base Model Decision
iPhone 15 Pro Max 256GB vs Var. Title Match 0.6088 0.9077 Base Higher
Logitech MX Master 3S vs Var. Title Match 0.8250 0.9374 Base Higher
Galaxy S23 Ultra vs Var. Title Match 0.8725 0.7825 ✅ FT Higher
Sony XM5 vs Sony XM4 Hard Negative 0.6404 0.7573 ✅ FT Lower (Better)
MacBook Pro 14 vs MacBook Pro 16 Hard Negative 0.9293 0.8965 Base Lower
Nike vs Adidas (Running Shoes) Similar Category 0.5643 0.7720 ✅ FT Lower (Better)
Stand Mixer vs Printer Random Negative -0.0814 0.0533 ✅ FT Lower (Better)

Summary Statistics

  • Avg Negative Similarity: Fine-Tuned: 0.5132 | Base: 0.6198
  • Avg Match Similarity: Fine-Tuned: 0.7687 | Base: 0.8759

Key Findings

While the base model often provides higher scores for direct matches, it is also much more "generous" with similar but different products. This fine-tuned model is significantly more effective at rejecting false positives, demonstrating lower similarity scores for hard negatives (different versions) and cross-brand comparisons. This makes it ideal for high-precision deduplication tasks.

Training Overview

The model was fine-tuned using a high-performance contrastive learning approach (MultipleNegativesRankingLoss).

Key Training Highlights:

  • Dataset: 10 Million product-centric pairs.
  • Optimization: Focused on maximizing the embedding distance between similar but non-identical products (Hard Negatives).
  • Efficiency: Leveraged the lightweight all-MiniLM-L6-v2 architecture, maintaining fast inference speeds while significantly improving domain-specific accuracy.
  • Convergence: Achievement of high precision within a single epoch of large-scale synthetic data training.

For full technical details on the base architecture, refer to the original all-MiniLM-L6-v2 model card.

Downloads last month
38
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for surazbhandari/all-MiniLM-L6-v2-ProductMatching

Finetuned
(753)
this model