--- license: cc-by-nc-sa-4.0 language: - en tags: - text-similarity - contrastive-learning - semantic-alignment - text-evaluation library_name: pytorch pipeline_tag: sentence-similarity --- # MATCHA — Matching Text via Contrastive Semantic Alignment MATCHA is a learned text similarity metric that captures both semantic alignment and contradiction through contrastive training. It learns a dual-view semantic space in which semantically aligned texts are pulled closer while contradictory or irrelevant texts are pushed apart. **Paper:** [MATCHA: Matching Text via Contrastive Semantic Alignment](https://arxiv.org/abs/2605.27345) **Code:** [GitHub](https://github.com/Siran-Li/MATCHA) ## Model Details - **Backbone:** GPT-2 (word embeddings only, no transformer layers) - **Architecture:** Token-independent MLP processing with a learned transformation and mean pooling - **Training objective:** Triplet margin loss with cosine similarity - **Training data:** 15 diverse sources across NLI, factuality, captioning, summarization, and paraphrase tasks ## Files | File | Description | |------|-------------| | `max_diff.pth` | Best checkpoint (selected by max pos–neg similarity difference) | | `config.yaml` | Training hyperparameters | | `model_config.json` | Model architecture configuration | | `model.py` | Model architecture code | | `matcha.py` | Simple inference interface | ## Installation ```bash pip install matcha-metric ``` ## Usage ```python from matcha_metric import MATCHA model = MATCHA.from_pretrained("Siran-Li/MATCHA") # Score a pair of texts similarity = model.score("The vaccine was proven effective.", "Clinical trials confirmed the vaccine works.") print(f"Similarity: {similarity:.4f}") # Batch scoring scores = model.score( ["The cat sat on the mat.", "It is raining outside."], ["A feline rested on the rug.", "The weather is sunny and clear."], ) # Get embeddings directly embeddings = model.encode(["Hello world", "Hi there"]) ``` ### Interpretability Token-level attribution via Integrated Gradients: ```python # Get raw token attributions result = model.interpret("The cat sat on the mat.", "A feline rested on the rug.") for token, attr in zip(result["tokens"], result["attributions"]): print(f"{token:>15s} {attr:+.4f}") # Save interactive HTML heatmap model.visualize( "The cat sat on the mat.", "A feline rested on the rug.", label="Correct", output_path="attribution.html", ) ``` **Aligned pair** ("The cat sat on the mat." vs. "A feline rested on the rug."): ![Attribution example - aligned](attrbution_ex1.jpeg) **Contradictory pair** ("It is raining outside." vs. "The weather is sunny and clear."): ![Attribution example - contradictory](attrbution_ex2.jpeg) ## Evaluation Benchmarks Evaluated on 7 benchmarks: SNLI, MultiNLI, MedNLI, TruthfulQA, COCO-Caption, NEWTS, and Climate-FEVER. ## Citation ```bibtex @article{li2026matcha, title={MATCHA: Matching Text via Contrastive Semantic Alignment}, author={Li, Siran and Etoglu, Ece Sena and Eickhoff, Carsten and Bahrainian, Seyed Ali}, journal={arXiv preprint arXiv:2605.27345}, year={2026} } ```