|
|
--- |
|
|
title: NIST_MT |
|
|
emoji: 🤗 |
|
|
colorFrom: purple |
|
|
colorTo: red |
|
|
sdk: gradio |
|
|
sdk_version: 3.19.1 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
tags: |
|
|
- evaluate |
|
|
- metric |
|
|
- machine-translation |
|
|
description: |
|
|
DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU score. |
|
|
--- |
|
|
|
|
|
# Metric Card for NIST's MT metric |
|
|
|
|
|
|
|
|
## Metric Description |
|
|
DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU |
|
|
score. The official script used by NIST to compute BLEU and NIST score is |
|
|
mteval-14.pl. The main differences are: |
|
|
|
|
|
- BLEU uses geometric mean of the ngram overlaps, NIST uses arithmetic mean. |
|
|
- NIST has a different brevity penalty |
|
|
- NIST score from mteval-14.pl has a self-contained tokenizer (in the Hugging Face implementation we rely on NLTK's |
|
|
implementation of the NIST-specific tokenizer) |
|
|
|
|
|
## Intended Uses |
|
|
NIST was developed for machine translation evaluation. |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
import evaluate |
|
|
nist_mt = evaluate.load("nist_mt") |
|
|
hypothesis1 = "It is a guide to action which ensures that the military always obeys the commands of the party" |
|
|
reference1 = "It is a guide to action that ensures that the military will forever heed Party commands" |
|
|
reference2 = "It is the guiding principle which guarantees the military forces always being under the command of the Party" |
|
|
nist_mt.compute(hypothesis1, [reference1, reference2]) |
|
|
# {'nist_mt': 3.3709935957649324} |
|
|
``` |
|
|
|
|
|
### Inputs |
|
|
- **predictions**: tokenized predictions to score. For sentence-level NIST, a list of tokens (str); |
|
|
for corpus-level NIST, a list (sentences) of lists of tokens (str) |
|
|
- **references**: potentially multiple tokenized references for each prediction. For sentence-level NIST, a |
|
|
list (multiple potential references) of list of tokens (str); for corpus-level NIST, a list (corpus) of lists |
|
|
(multiple potential references) of lists of tokens (str) |
|
|
- **n**: highest n-gram order |
|
|
- **tokenize_kwargs**: arguments passed to the tokenizer (see: https://github.com/nltk/nltk/blob/90fa546ea600194f2799ee51eaf1b729c128711e/nltk/tokenize/nist.py#L139) |
|
|
|
|
|
### Output Values |
|
|
- **nist_mt** (`float`): NIST score |
|
|
|
|
|
Output Example: |
|
|
```python |
|
|
{'nist_mt': 3.3709935957649324} |
|
|
``` |
|
|
|
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@inproceedings{10.5555/1289189.1289273, |
|
|
author = {Doddington, George}, |
|
|
title = {Automatic Evaluation of Machine Translation Quality Using N-Gram Co-Occurrence Statistics}, |
|
|
year = {2002}, |
|
|
publisher = {Morgan Kaufmann Publishers Inc.}, |
|
|
address = {San Francisco, CA, USA}, |
|
|
booktitle = {Proceedings of the Second International Conference on Human Language Technology Research}, |
|
|
pages = {138–145}, |
|
|
numpages = {8}, |
|
|
location = {San Diego, California}, |
|
|
series = {HLT '02} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Further References |
|
|
|
|
|
This Hugging Face implementation uses [the NLTK implementation](https://github.com/nltk/nltk/blob/develop/nltk/translate/nist_score.py) |
|
|
|