parapluie / README.md
qlemesle's picture
Update README.md
37b5821 verified
|
raw
history blame
4.25 kB
metadata
title: ParaPLUIE
emoji: ☂️
tags:
  - evaluate
  - metric
description: >-
  ParaPLUIE is a metric for evaluating the semantic proximity of two sentences. 
  ParaPLUIE use the perplexity of an LLM to compute a confidence score. It has
  shown the highest correlation with human judgement on paraphrase
  classification meanwhile reamin the computional cost low as it roughtly equal
  to one token generation cost.
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
short_description: ParaPLUIE is a metric for evaluating the semantic proximity

Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)

Metric Description

ParaPLUIE is a metric for evaluating the semantic proximity of two sentences. ParaPLUIE use the perplexity of an LLM to compute a confidence score. It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.

How to Use

This metric requires a source sentence and it's hypothetical paraphrase.

import evaluate
ppluie = evaluate.load("qlemesle/parapluie")
ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
S = "Have you ever seen a tsunami ?" 
H = "Have you ever seen a tiramisu ?"
results = ppluie.compute(sources=[S], hypotheses=[H])
print(results)
>>> {'scores': [-16.97607421875]}

Inputs

  • sources (list of string): Source sentences.
  • hypotheses (list of string): Hypothetical paraphrases.

Output Values

  • score (float): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 mean that sentences are paraphrases. A score lower than 0 mean the opposite.

This metric outputs a dictionary, containing the score.

Examples

Simple example

import evaluate
ppluie = evaluate.load("qlemesle/parapluie")
ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
S = "Have you ever seen a tsunami ?" 
H = "Have you ever seen a tiramisu ?"
results = ppluie.compute(sources=[S], hypotheses=[H])
print(results)
>>> {'scores': [-16.97607421875]}

Configure metric

ppluie.init(
  model = "mistralai/Mistral-7B-Instruct-v0.2",
  device = "cuda:0",
  template = "FS-DIRECT",
  use_chat_template = True,
  half_mode = True,
  n_right_specials_tokens = 1
)

show available prompting templates

ppluie.show_templates()
>>> DIRECT
>>> MEANING
>>> INDIRECT
>>> FS-DIRECT
>>> FS-DIRECT_MAJ
>>> FS-DIRECT_FR
>>> FS-DIRECT_MAJ_FR
>>> FS-DIRECT_FR_MIN
>>> NETWORK

show LLM already tested with ParaPLUIE

ppluie.show_available_models()
>>> HuggingFaceTB/SmolLM2-135M-Instruct
>>> HuggingFaceTB/SmolLM2-360M-Instruct
>>> HuggingFaceTB/SmolLM2-1.7B-Instruct
>>> google/gemma-2-2b-it
>>> state-spaces/mamba-2.8b-hf
>>> internlm/internlm2-chat-1_8b
>>> microsoft/Phi-4-mini-instruct
>>> mistralai/Mistral-7B-Instruct-v0.2
>>> tiiuae/falcon-mamba-7b-instruct
>>> Qwen/Qwen2.5-7B-Instruct
>>> CohereForAI/aya-expanse-8b
>>> google/gemma-2-9b-it
>>> meta-llama/Meta-Llama-3-8B-Instruct
>>> microsoft/phi-4
>>> CohereForAI/aya-expanse-32b
>>> Qwen/QwQ-32B
>>> CohereForAI/c4ai-command-r-08-2024

change prompting template

ppluie.setTemplate("DIRECT")

show how is the prompt encoded, to ensure that the correct numbers of special tokens are removed and Yes / No words fit on one token

ppluie.check_end_tokens_tmpl()

Limitations and Bias

This metric is based on an LLM and therefore is limited by the LLM used.

Source code

GitLab

Citation

@inproceedings{lemesle-etal-2025-paraphrase,
    title = "Paraphrase Generation Evaluation Powered by an {LLM}: A Semantic Metric, Not a Lexical One",
    author = "Lemesle, Quentin  and
      Chevelu, Jonathan  and
      Martin, Philippe  and
      Lolive, Damien  and
      Delhay, Arnaud  and
      Barbot, Nelly",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    year = "2025",
    url = "https://aclanthology.org/2025.coling-main.538/"
}