MATCHA — Matching Text via Contrastive Semantic Alignment

MATCHA is a learned text similarity metric that captures both semantic alignment and contradiction through contrastive training. It learns a dual-view semantic space in which semantically aligned texts are pulled closer while contradictory or irrelevant texts are pushed apart.

Paper: MATCHA: Matching Text via Contrastive Semantic Alignment Code: GitHub

Model Details

  • Backbone: GPT-2 (word embeddings only, no transformer layers)
  • Architecture: Token-independent MLP processing with a learned transformation and mean pooling
  • Training objective: Triplet margin loss with cosine similarity
  • Training data: 15 diverse sources across NLI, factuality, captioning, summarization, and paraphrase tasks

Files

File Description
max_diff.pth Best checkpoint (selected by max pos–neg similarity difference)
config.yaml Training hyperparameters
model_config.json Model architecture configuration
model.py Model architecture code
matcha.py Simple inference interface

Installation

pip install matcha-metric

Usage

from matcha_metric import MATCHA

model = MATCHA.from_pretrained("Siran-Li/MATCHA")

# Score a pair of texts
similarity = model.score("The vaccine was proven effective.", "Clinical trials confirmed the vaccine works.")
print(f"Similarity: {similarity:.4f}")

# Batch scoring
scores = model.score(
    ["The cat sat on the mat.", "It is raining outside."],
    ["A feline rested on the rug.", "The weather is sunny and clear."],
)

# Get embeddings directly
embeddings = model.encode(["Hello world", "Hi there"])

Interpretability

Token-level attribution via Integrated Gradients:

# Get raw token attributions
result = model.interpret("The cat sat on the mat.", "A feline rested on the rug.")
for token, attr in zip(result["tokens"], result["attributions"]):
    print(f"{token:>15s}  {attr:+.4f}")

# Save interactive HTML heatmap
model.visualize(
    "The cat sat on the mat.",
    "A feline rested on the rug.",
    label="Correct",
    output_path="attribution.html",
)

Aligned pair ("The cat sat on the mat." vs. "A feline rested on the rug."):

Attribution example - aligned

Contradictory pair ("It is raining outside." vs. "The weather is sunny and clear."):

Attribution example - contradictory

Evaluation Benchmarks

Evaluated on 7 benchmarks: SNLI, MultiNLI, MedNLI, TruthfulQA, COCO-Caption, NEWTS, and Climate-FEVER.

Citation

@article{li2026matcha,
  title={MATCHA: Matching Text via Contrastive Semantic Alignment},
  author={Li, Siran and Etoglu, Ece Sena and Eickhoff, Carsten and Bahrainian, Seyed Ali},
  journal={arXiv preprint arXiv:2605.27345},
  year={2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Siran-Li/MATCHA