parapluie / README.md
qlemesle's picture
bib
35a61d4
|
raw
history blame
3.68 kB
metadata
title: ParaPLUIE
emoji: ☂️
tags:
  - evaluate
  - metric
description: >-
  ParaPLUIE is a metric for evaluating the semantic proximity of two sentences. 
  ParaPLUIE use the perplexity of an LLM to compute a confidence score. It has
  shown the highest correlation with human judgement on paraphrase
  classification meanwhile reamin the computional cost low as it roughtly equal
  to one token generation cost.
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false

Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)

W.I.P

Metric Description

ParaPLUIE is a metric for evaluating the semantic proximity of two sentences. ParaPLUIE use the perplexity of an LLM to compute a confidence score. It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.

How to Use

This metric requires a source sentence and it's hypothetical paraphrase.

>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
>>> print(results)
{'accuracy': 1.0}

Inputs

  • predictions (list of int): Predicted labels.
  • references (list of int): Ground truth labels.
  • normalize (boolean): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
  • sample_weight (list of float): Sample weights Defaults to None.

Output Values

  • accuracy(float or int): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if normalize is set to True. A higher score means higher accuracy. Output Example(s):
{'accuracy': 1.0}

This metric outputs a dictionary, containing the accuracy score.

Values from Papers

ParaPLUIE has been compared to other state of art metrics in: ImageNet and showed a high correlation with humand judgement while beeing less computing intensive than LLM as a judge methods.

Examples

Example 1-A simple example

>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
>>> print(results)
{'accuracy': 0.5}

Example 2-The same as Example 1, except with normalize set to False.

>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
>>> print(results)
{'accuracy': 3.0}

Example 3-The same as Example 1, except with sample_weight set.

>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
>>> print(results)
{'accuracy': 0.8778625954198473}

Limitations and Bias

This metric is based on an LLM and therefore is limited by the LLM used.

Citation

@inproceedings{lemesle-etal-2025-paraphrase,
    title = "Paraphrase Generation Evaluation Powered by an {LLM}: A Semantic Metric, Not a Lexical One",
    author = "Lemesle, Quentin  and
      Chevelu, Jonathan  and
      Martin, Philippe  and
      Lolive, Damien  and
      Delhay, Arnaud  and
      Barbot, Nelly",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    year = "2025",
    url = "https://aclanthology.org/2025.coling-main.538/"
}