place_gen_evaluate / README.md
Rodrigo Ferreira Rodrigues
Updating doc
cdc442b

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: Place_gen_evaluate
datasets:
  - GeoBenchmark
tags:
  - evaluate
  - metric
description: 'TODO: add a description here'
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false

Metric Card for Place_gen_evaluate

Metric Description

This metric aims to evaluate geographic place prediction tasks done by LMs. For each question, we expect the model to generate a list of places and the gold answers must also be a list of places names.

How to Use

This metric takes 2 mandatory arguments : generations (a list of string), golds (a list of list of string containing places names).

import evaluate
place_pred_eval = evaluate.load("rfr2003/place_gen_evaluate")
results = place_pred_eval.compute(generations=['[Hotel New Home, Hopeland]'], golds=[['Bar Guisness', 'Hotel New Home', 'New Hopeland']])
print(results)
{'bert_score_precision': 0.8470218181610107, 'bert_score_recall': 0.9131535291671753, 'bert_score_f1': 0.8788453936576843, 'bleu-1': 0.5714285714285714, 'precision': [6.0], 'rappel': [15.0], 'macro-mean': [10.5], 'median macro-mean': 10.5}

This metric accepts one optional argument:

d: function used to compute the distance between a generated value and a gold one. The default value is the distance function from the Levansthein library.

Output Values

This metric outputs a dictionary with the following values:

bert_score_precision: Average of BERTScores Precision values computed by bertscore module.

bert_score_recall: Average of BERTScores Recall values computed by bertscore module.

bert_score_f1: Average of BERTScores f1 values computed by bertscore module.

bleu-1: Bleu-1 score computed by bleu module.

precision: Sum of the minimum distances between each predicted value and the set of gold values, computed for each question.

recall: Sum of the minimum distances between each gold value and the set of generated values, computed for each question.

macro-mean: Average between precision and recall, computed for each question.

median macro-mean: Median accross macro-mean values.

Values from Popular Papers

Examples

import evaluate
place_pred_eval = evaluate.load("rfr2003/place_gen_evaluate")
results = place_pred_eval.compute(generations=['[Hotel New Home, Hopeland]'], golds=[['Bar Guisness', 'Hotel New Home', 'New Hopeland']])
print(results)
{'bert_score_precision': 0.8470218181610107, 'bert_score_recall': 0.9131535291671753, 'bert_score_f1': 0.8788453936576843, 'bleu-1': 0.5714285714285714, 'precision': [6.0], 'rappel': [15.0], 'macro-mean': [10.5], 'median macro-mean': 10.5}

Limitations and Bias

Citation

Further References