Siran-Li
/

MATCHA

Sentence Similarity

text-similarity

contrastive-learning

semantic-alignment

text-evaluation

Model card Files Files and versions

MATCHA / README.md

Siran-Li's picture

Upload README.md with huggingface_hub

1f68b0c verified 1 day ago

|

history blame contribute delete

3.17 kB

	---
	license: cc-by-nc-sa-4.0
	language:
	- en
	tags:
	- text-similarity
	- contrastive-learning
	- semantic-alignment
	- text-evaluation
	library_name: pytorch
	pipeline_tag: sentence-similarity
	---

	# MATCHA — Matching Text via Contrastive Semantic Alignment

	MATCHA is a learned text similarity metric that captures both semantic alignment and contradiction through contrastive training. It learns a dual-view semantic space in which semantically aligned texts are pulled closer while contradictory or irrelevant texts are pushed apart.

	Paper: [MATCHA: Matching Text via Contrastive Semantic Alignment](https://arxiv.org/abs/2605.27345)
	Code: [GitHub](https://github.com/Siran-Li/MATCHA)

	## Model Details

	- Backbone: GPT-2 (word embeddings only, no transformer layers)
	- Architecture: Token-independent MLP processing with a learned transformation and mean pooling
	- Training objective: Triplet margin loss with cosine similarity
	- Training data: 15 diverse sources across NLI, factuality, captioning, summarization, and paraphrase tasks

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `max_diff.pth` \| Best checkpoint (selected by max pos–neg similarity difference) \|
	\| `config.yaml` \| Training hyperparameters \|
	\| `model_config.json` \| Model architecture configuration \|
	\| `model.py` \| Model architecture code \|
	\| `matcha.py` \| Simple inference interface \|

	## Installation

	```bash
	pip install matcha-metric
	```

	## Usage

	```python
	from matcha_metric import MATCHA

	model = MATCHA.from_pretrained("Siran-Li/MATCHA")

	# Score a pair of texts
	similarity = model.score("The vaccine was proven effective.", "Clinical trials confirmed the vaccine works.")
	print(f"Similarity: {similarity:.4f}")

	# Batch scoring
	scores = model.score(
	["The cat sat on the mat.", "It is raining outside."],
	["A feline rested on the rug.", "The weather is sunny and clear."],
	)

	# Get embeddings directly
	embeddings = model.encode(["Hello world", "Hi there"])
	```

	### Interpretability

	Token-level attribution via Integrated Gradients:

	```python
	# Get raw token attributions
	result = model.interpret("The cat sat on the mat.", "A feline rested on the rug.")
	for token, attr in zip(result["tokens"], result["attributions"]):
	print(f"{token:>15s} {attr:+.4f}")

	# Save interactive HTML heatmap
	model.visualize(
	"The cat sat on the mat.",
	"A feline rested on the rug.",
	label="Correct",
	output_path="attribution.html",
	)
	```

	Aligned pair ("The cat sat on the mat." vs. "A feline rested on the rug."):

	![Attribution example - aligned](attrbution_ex1.jpeg)

	Contradictory pair ("It is raining outside." vs. "The weather is sunny and clear."):

	![Attribution example - contradictory](attrbution_ex2.jpeg)

	## Evaluation Benchmarks

	Evaluated on 7 benchmarks: SNLI, MultiNLI, MedNLI, TruthfulQA, COCO-Caption, NEWTS, and Climate-FEVER.

	## Citation

	```bibtex
	@article{li2026matcha,
	title={MATCHA: Matching Text via Contrastive Semantic Alignment},
	author={Li, Siran and Etoglu, Ece Sena and Eickhoff, Carsten and Bahrainian, Seyed Ali},
	journal={arXiv preprint arXiv:2605.27345},
	year={2026}
	}
	```