MATCHA / README.md
Siran-Li's picture
Upload README.md with huggingface_hub
1f68b0c verified
---
license: cc-by-nc-sa-4.0
language:
- en
tags:
- text-similarity
- contrastive-learning
- semantic-alignment
- text-evaluation
library_name: pytorch
pipeline_tag: sentence-similarity
---
# MATCHA — Matching Text via Contrastive Semantic Alignment
MATCHA is a learned text similarity metric that captures both semantic alignment and contradiction through contrastive training. It learns a dual-view semantic space in which semantically aligned texts are pulled closer while contradictory or irrelevant texts are pushed apart.
**Paper:** [MATCHA: Matching Text via Contrastive Semantic Alignment](https://arxiv.org/abs/2605.27345)
**Code:** [GitHub](https://github.com/Siran-Li/MATCHA)
## Model Details
- **Backbone:** GPT-2 (word embeddings only, no transformer layers)
- **Architecture:** Token-independent MLP processing with a learned transformation and mean pooling
- **Training objective:** Triplet margin loss with cosine similarity
- **Training data:** 15 diverse sources across NLI, factuality, captioning, summarization, and paraphrase tasks
## Files
| File | Description |
|------|-------------|
| `max_diff.pth` | Best checkpoint (selected by max pos–neg similarity difference) |
| `config.yaml` | Training hyperparameters |
| `model_config.json` | Model architecture configuration |
| `model.py` | Model architecture code |
| `matcha.py` | Simple inference interface |
## Installation
```bash
pip install matcha-metric
```
## Usage
```python
from matcha_metric import MATCHA
model = MATCHA.from_pretrained("Siran-Li/MATCHA")
# Score a pair of texts
similarity = model.score("The vaccine was proven effective.", "Clinical trials confirmed the vaccine works.")
print(f"Similarity: {similarity:.4f}")
# Batch scoring
scores = model.score(
["The cat sat on the mat.", "It is raining outside."],
["A feline rested on the rug.", "The weather is sunny and clear."],
)
# Get embeddings directly
embeddings = model.encode(["Hello world", "Hi there"])
```
### Interpretability
Token-level attribution via Integrated Gradients:
```python
# Get raw token attributions
result = model.interpret("The cat sat on the mat.", "A feline rested on the rug.")
for token, attr in zip(result["tokens"], result["attributions"]):
print(f"{token:>15s} {attr:+.4f}")
# Save interactive HTML heatmap
model.visualize(
"The cat sat on the mat.",
"A feline rested on the rug.",
label="Correct",
output_path="attribution.html",
)
```
**Aligned pair** ("The cat sat on the mat." vs. "A feline rested on the rug."):
![Attribution example - aligned](attrbution_ex1.jpeg)
**Contradictory pair** ("It is raining outside." vs. "The weather is sunny and clear."):
![Attribution example - contradictory](attrbution_ex2.jpeg)
## Evaluation Benchmarks
Evaluated on 7 benchmarks: SNLI, MultiNLI, MedNLI, TruthfulQA, COCO-Caption, NEWTS, and Climate-FEVER.
## Citation
```bibtex
@article{li2026matcha,
title={MATCHA: Matching Text via Contrastive Semantic Alignment},
author={Li, Siran and Etoglu, Ece Sena and Eickhoff, Carsten and Bahrainian, Seyed Ali},
journal={arXiv preprint arXiv:2605.27345},
year={2026}
}
```