| | --- |
| | title: codebleu |
| | tags: |
| | - evaluate |
| | - metric |
| | description: "CodeBLEU" |
| | sdk: gradio |
| | sdk_version: 3.0.2 |
| | app_file: app.py |
| | pinned: false |
| | --- |
| | |
| | # Metric Card for CodeBLEU |
| |
|
| | ## Metric Description |
| |
|
| | CodeBLEU from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator) |
| | and from article [CodeBLEU: a Method for Automatic Evaluation of Code Synthesis](https://arxiv.org/abs/2009.10297) |
| |
|
| | NOTE: currently works on Linux machines only due to dependency from languages .so |
| |
|
| | ## How to Use |
| |
|
| | ```python |
| | module = evaluate.load("dvitel/codebleu") |
| | src = 'class AcidicSwampOoze(MinionCard):§ def __init__(self):§ super().__init__("Acidic Swamp Ooze", 2, CHARACTER_CLASS.ALL, CARD_RARITY.COMMON, battlecry=Battlecry(Destroy(), WeaponSelector(EnemyPlayer())))§§ def create_minion(self, player):§ return Minion(3, 2)§' |
| | tgt = 'class AcidSwampOoze(MinionCard):§ def __init__(self):§ super().__init__("Acidic Swamp Ooze", 2, CHARACTER_CLASS.ALL, CARD_RARITY.COMMON, battlecry=Battlecry(Destroy(), WeaponSelector(EnemyPlayer())))§§ def create_minion(self, player):§ return Minion(3, 2)§' |
| | src = src.replace("§","\n") |
| | tgt = tgt.replace("§","\n") |
| | res = module.compute(predictions = [tgt], references = [[src]]) |
| | print(res) |
| | #{'CodeBLEU': 0.9473264567644872, 'ngram_match_score': 0.8915993127600096, 'weighted_ngram_match_score': 0.8977065142979394, 'syntax_match_score': 1.0, 'dataflow_match_score': 1.0} |
| | ``` |
| |
|
| | ### Inputs |
| | - **predictions** (`list` of `str`s): Translations to score. |
| | - **references** (`list` of `list`s of `str`s): references for each translation. |
| | - **lang** programming language in ['java','js','c_sharp','php','go','python','ruby'] |
| | - **tokenizer**: approach used for standardizing `predictions` and `references`. |
| | The default tokenizer is `tokenizer_13a`, a relatively minimal tokenization approach that is however equivalent to `mteval-v13a`, used by WMT. |
| | This can be replaced by another tokenizer from a source such as [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master/sacrebleu/tokenizers). |
| | - **params**: str, weights for averaging(see CodeBLEU paper). |
| | Defaults to equal weights "0.25,0.25,0.25,0.25". |
| | |
| | ### Output Values |
| |
|
| | - CodeBLEU: resulting score, |
| | - ngram_match_score: See paper CodeBLEU, |
| | - weighted_ngram_match_score: See paper CodeBLEU, |
| | - syntax_match_score: See paper CodeBLEU, |
| | - dataflow_match_score: See paper CodeBLEU, |
| | |
| | #### Values from Popular Papers |
| | *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.* |
| | |
| | ### Examples |
| | *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.* |
| | |
| | ## Limitations and Bias |
| | Linux OS only. See above a set of programming languages supported. |
| | |
| | ## Citation |
| | ```bibtex |
| | @InProceedings{huggingface:module, |
| | title = {CodeBLEU: A Metric for Evaluating Code Generation}, |
| | authors={Sedykh, Ivan}, |
| | year={2022} |
| | } |
| | ``` |
| | |
| | ## Further References |
| | *Add any useful further references.* |
| | |