| --- |
| license: cc-by-nc-sa-4.0 |
| language: |
| - en |
| tags: |
| - text-similarity |
| - contrastive-learning |
| - semantic-alignment |
| - text-evaluation |
| library_name: pytorch |
| pipeline_tag: sentence-similarity |
| --- |
| |
| # MATCHA — Matching Text via Contrastive Semantic Alignment |
|
|
| MATCHA is a learned text similarity metric that captures both semantic alignment and contradiction through contrastive training. It learns a dual-view semantic space in which semantically aligned texts are pulled closer while contradictory or irrelevant texts are pushed apart. |
|
|
| **Paper:** [MATCHA: Matching Text via Contrastive Semantic Alignment](https://arxiv.org/abs/2605.27345) |
| **Code:** [GitHub](https://github.com/Siran-Li/MATCHA) |
|
|
| ## Model Details |
|
|
| - **Backbone:** GPT-2 (word embeddings only, no transformer layers) |
| - **Architecture:** Token-independent MLP processing with a learned transformation and mean pooling |
| - **Training objective:** Triplet margin loss with cosine similarity |
| - **Training data:** 15 diverse sources across NLI, factuality, captioning, summarization, and paraphrase tasks |
|
|
| ## Files |
|
|
| | File | Description | |
| |------|-------------| |
| | `max_diff.pth` | Best checkpoint (selected by max pos–neg similarity difference) | |
| | `config.yaml` | Training hyperparameters | |
| | `model_config.json` | Model architecture configuration | |
| | `model.py` | Model architecture code | |
| | `matcha.py` | Simple inference interface | |
|
|
| ## Installation |
|
|
| ```bash |
| pip install matcha-metric |
| ``` |
|
|
| ## Usage |
|
|
| ```python |
| from matcha_metric import MATCHA |
| |
| model = MATCHA.from_pretrained("Siran-Li/MATCHA") |
| |
| # Score a pair of texts |
| similarity = model.score("The vaccine was proven effective.", "Clinical trials confirmed the vaccine works.") |
| print(f"Similarity: {similarity:.4f}") |
| |
| # Batch scoring |
| scores = model.score( |
| ["The cat sat on the mat.", "It is raining outside."], |
| ["A feline rested on the rug.", "The weather is sunny and clear."], |
| ) |
| |
| # Get embeddings directly |
| embeddings = model.encode(["Hello world", "Hi there"]) |
| ``` |
|
|
| ### Interpretability |
|
|
| Token-level attribution via Integrated Gradients: |
|
|
| ```python |
| # Get raw token attributions |
| result = model.interpret("The cat sat on the mat.", "A feline rested on the rug.") |
| for token, attr in zip(result["tokens"], result["attributions"]): |
| print(f"{token:>15s} {attr:+.4f}") |
| |
| # Save interactive HTML heatmap |
| model.visualize( |
| "The cat sat on the mat.", |
| "A feline rested on the rug.", |
| label="Correct", |
| output_path="attribution.html", |
| ) |
| ``` |
|
|
| **Aligned pair** ("The cat sat on the mat." vs. "A feline rested on the rug."): |
|
|
|  |
|
|
| **Contradictory pair** ("It is raining outside." vs. "The weather is sunny and clear."): |
|
|
|  |
|
|
| ## Evaluation Benchmarks |
|
|
| Evaluated on 7 benchmarks: SNLI, MultiNLI, MedNLI, TruthfulQA, COCO-Caption, NEWTS, and Climate-FEVER. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{li2026matcha, |
| title={MATCHA: Matching Text via Contrastive Semantic Alignment}, |
| author={Li, Siran and Etoglu, Ece Sena and Eickhoff, Carsten and Bahrainian, Seyed Ali}, |
| journal={arXiv preprint arXiv:2605.27345}, |
| year={2026} |
| } |
| ``` |
|
|