Spaces:

dvitel
/

codebleu

Runtime error

App Files Files Community

dvitel commited on Nov 10, 2022

Commit

08dc526

1 Parent(s): ea09ebe

allow references to be simple list

Browse files

Files changed (3) hide show

README.md +35 -12
dataflow_match.py +2 -2
my_codebleu.py +1 -1

README.md CHANGED Viewed

@@ -12,25 +12,42 @@ pinned: false
 # Metric Card for CodeBLEU
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
 ### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 #### Values from Popular Papers
 *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
@@ -39,10 +56,16 @@ pinned: false
 *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 ## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
 ## Further References
 *Add any useful further references.*

 # Metric Card for CodeBLEU
 ## Metric Description
+CodeBLEU from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator)
+and from article [CodeBLEU: a Method for Automatic Evaluation of Code Synthesis](https://arxiv.org/abs/2009.10297)
+NOTE: currently works on Linux machines only due to dependency on languages.so.
 ## How to Use
+```python
+src = 'class AcidicSwampOoze(MinionCard):§    def __init__(self):§        super().__init__("Acidic Swamp Ooze", 2, CHARACTER_CLASS.ALL, CARD_RARITY.COMMON, battlecry=Battlecry(Destroy(), WeaponSelector(EnemyPlayer())))§§    def create_minion(self, player):§        return Minion(3, 2)§'
+tgt = 'class AcidSwampOoze(MinionCard):§    def __init__(self):§        super().__init__("Acidic Swamp Ooze", 2, CHARACTER_CLASS.ALL, CARD_RARITY.COMMON, battlecry=Battlecry(Destroy(), WeaponSelector(EnemyPlayer())))§§    def create_minion(self, player):§        return Minion(3, 2)§'
+src = src.replace("§","\n")
+tgt = tgt.replace("§","\n")
+res = module.compute(predictions = [tgt], references = [[src]])
+print(res)
+#{'CodeBLEU': 0.9473264567644872, 'ngram_match_score': 0.8915993127600096, 'weighted_ngram_match_score': 0.8977065142979394, 'syntax_match_score': 1.0, 'dataflow_match_score': 1.0}
+```
 ### Inputs
+- **predictions** (`list` of `str`s): Translations to score.
+- **references** (`list` of `list`s of `str`s): references for each translation.
+- **lang** programming language in ['java','js','c_sharp','php','go','python','ruby']
+- **tokenizer**: approach used for standardizing `predictions` and `references`.
+    The default tokenizer is `tokenizer_13a`, a relatively minimal tokenization approach that is however equivalent to `mteval-v13a`, used by WMT.
+    This can be replaced by another tokenizer from a source such as [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master/sacrebleu/tokenizers).
+- **params**: str, weights for averaging(see CodeBLEU paper).
+        Defaults to equal weights "0.25,0.25,0.25,0.25".
 ### Output Values
+- CodeBLEU: resulting score,
+- ngram_match_score: See paper CodeBLEU,
+- weighted_ngram_match_score: See paper CodeBLEU,
+- syntax_match_score: See paper CodeBLEU,
+- dataflow_match_score: See paper CodeBLEU,
 #### Values from Popular Papers
 *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 ## Limitations and Bias
+Linux OS only. See above a set of programming languages supported.
 ## Citation
+```bibtex
+@InProceedings{huggingface:module,
+title = {CodeBLEU: A Metric for Evaluating Code Generation},
+authors={Sedykh, Ivan},
+year={2022}
+}
+```
 ## Further References
 *Add any useful further references.*

dataflow_match.py CHANGED Viewed

@@ -36,11 +36,11 @@ def corpus_dataflow_match(references, candidates, lang, langso_dir):
         candidate = candidates[i]
         for reference in references_sample:
             try:
-                candidate=remove_comments_and_docstrings(candidate,'java')
             except:
                 pass
             try:
-                reference=remove_comments_and_docstrings(reference,'java')
             except:
                 pass

         candidate = candidates[i]
         for reference in references_sample:
             try:
+                candidate=remove_comments_and_docstrings(candidate,lang)
             except:
                 pass
             try:
+                reference=remove_comments_and_docstrings(reference,lang)
             except:
                 pass

my_codebleu.py CHANGED Viewed

@@ -24,7 +24,7 @@ def calc_codebleu(predictions, references, lang, tokenizer=None, params='0.25,0.
     alpha, beta, gamma, theta = [float(x) for x in params.split(',')]
     # preprocess inputs
-    references = [[x.strip() for x in ref] for ref in references]
     hypothesis = [x.strip() for x in predictions]
     if not len(references) == len(hypothesis):

     alpha, beta, gamma, theta = [float(x) for x in params.split(',')]
     # preprocess inputs
+    references = [[x.strip() for x in ref] if type(ref) == list else [ref.strip()] for ref in references]
     hypothesis = [x.strip() for x in predictions]
     if not len(references) == len(hypothesis):