Spaces:

qlemesle
/

parapluie

Running

App Files Files Community

parapluie / README.md

qlemesle

orth

3567093 about 1 month ago

preview code

raw

history blame contribute delete

4.36 kB

	---
	title: ParaPLUIE
	emoji: ☂️
	tags:
	- evaluate
	- metric
	description: >-
	ParaPLUIE is a metric for evaluating the semantic proximity between two sentences.
	ParaPLUIE uses the perplexity of an LLM to compute a confidence score. It has
	shown the highest correlation with human judgment on paraphrase
	classification while maintaining a low computational cost, as it roughly equivalent
	to the cost of generating a single token.
	sdk: gradio
	sdk_version: 3.19.1
	app_file: app.py
	pinned: false
	short_description: ParaPLUIE is a metric for evaluating the semantic proximity
	---

	# Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)

	## Metric Description
	ParaPLUIE is a metric for evaluating the semantic proximity between two sentences.
	ParaPLUIE uses the perplexity of an LLM to compute a confidence score.
	It has shown the highest correlation with human judgment on paraphrase classification while maintaining a low computational cost, as it roughly equivalent to the cost of generating a single token.

	## How to Use

	This metric requires a source sentence and its hypothetical paraphrase.

	```python
	import evaluate
	ppluie = evaluate.load("qlemesle/parapluie")
	ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
	S = "Have you ever seen a tsunami ?"
	H = "Have you ever seen a tiramisu ?"
	results = ppluie.compute(sources=[S], hypotheses=[H])
	print(results)
	>>> {'scores': [-16.97607421875]}
	```

	### Inputs

	- sources (`list` of `string`): Source sentences.
	- hypotheses (`list` of `string`): Hypothetical paraphrases.

	### Output Values

	- score (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 means that sentences are paraphrases. A score lower than 0 indicates the opposite.

	This metric outputs a dictionary containing the score.

	### Examples

	Simple example
	```python
	import evaluate
	ppluie = evaluate.load("qlemesle/parapluie")
	ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
	S = "Have you ever seen a tsunami ?"
	H = "Have you ever seen a tiramisu ?"
	results = ppluie.compute(sources=[S], hypotheses=[H])
	print(results)
	>>> {'scores': [-16.97607421875]}
	```

	Configure metric
	```python
	ppluie.init(
	model = "mistralai/Mistral-7B-Instruct-v0.2",
	device = "cuda:0",
	template = "FS-DIRECT",
	use_chat_template = True,
	half_mode = True,
	n_right_specials_tokens = 1
	)
	```

	Show the available prompting templates
	```python
	ppluie.show_templates()
	>>> DIRECT
	>>> MEANING
	>>> INDIRECT
	>>> FS-DIRECT
	>>> FS-DIRECT_MAJ
	>>> FS-DIRECT_FR
	>>> FS-DIRECT_MAJ_FR
	>>> FS-DIRECT_FR_MIN
	>>> NETWORK
	```

	Show the LLMs that have already been tested with ParaPLUIE
	```python
	ppluie.show_available_models()
	>>> HuggingFaceTB/SmolLM2-135M-Instruct
	>>> HuggingFaceTB/SmolLM2-360M-Instruct
	>>> HuggingFaceTB/SmolLM2-1.7B-Instruct
	>>> google/gemma-2-2b-it
	>>> state-spaces/mamba-2.8b-hf
	>>> internlm/internlm2-chat-1_8b
	>>> microsoft/Phi-4-mini-instruct
	>>> mistralai/Mistral-7B-Instruct-v0.2
	>>> tiiuae/falcon-mamba-7b-instruct
	>>> Qwen/Qwen2.5-7B-Instruct
	>>> CohereForAI/aya-expanse-8b
	>>> google/gemma-2-9b-it
	>>> meta-llama/Meta-Llama-3-8B-Instruct
	>>> microsoft/phi-4
	>>> CohereForAI/aya-expanse-32b
	>>> Qwen/QwQ-32B
	>>> CohereForAI/c4ai-command-r-08-2024
	```

	Change the prompting template
	```python
	ppluie.setTemplate("DIRECT")
	```

	Show how the prompt is encoded to ensure that the correct numbers of special tokens are removed and that the words "Yes" and "No" each fit into a single token
	```python
	ppluie.check_end_tokens_tmpl()
	```

	## Limitations and Bias
	This metric is based on an LLM and is therefore limited by the LLM that is used.

	## Source code
	[GitLab](https://gitlab.inria.fr/expression/paraphrase-generation-evaluation-powered-by-an-llm-a-semantic-metric-not-a-lexical-one-coling-2025)


	## Citation
	```bibtex
	@inproceedings{lemesle-etal-2025-paraphrase,
	title = "Paraphrase Generation Evaluation Powered by an {LLM}: A Semantic Metric, Not a Lexical One",
	author = "Lemesle, Quentin and
	Chevelu, Jonathan and
	Martin, Philippe and
	Lolive, Damien and
	Delhay, Arnaud and
	Barbot, Nelly",
	booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
	year = "2025",
	url = "https://aclanthology.org/2025.coling-main.538/"
	}
	```