| # Evaluation Metrics for Whole Image Parsing | |
| Whole Image Parsing [1], also known as Panoptic Segmentation [2], generalizes | |
| the tasks of semantic segmentation for "stuff" classes and instance | |
| segmentation for "thing" classes, assigning both semantic and instance labels | |
| to every pixel in an image. | |
| Previous works evaluate the parsing result with separate metrics (e.g., one for | |
| semantic segmentation result and one for object detection result). Recently, | |
| Kirillov et al. propose the unified instance-based Panoptic Quality (PQ) metric | |
| [2] into several benchmarks [3, 4]. | |
| However, we notice that the instance-based PQ metric often places | |
| disproportionate emphasis on small instance parsing, as well as on "thing" over | |
| "stuff" classes. To remedy these effects, we propose an alternative | |
| region-based Parsing Covering (PC) metric [5], which adapts the Covering | |
| metric [6], previously used for class-agnostics segmentation quality | |
| evaluation, to the task of image parsing. | |
| Here, we provide implementation of both PQ and PC for evaluating the parsing | |
| results. We briefly explain both metrics below for reference. | |
| ## Panoptic Quality (PQ) | |
| Given a groundtruth segmentation S and a predicted segmentation S', PQ is | |
| defined as follows: | |
| <p align="center"> | |
| <img src="g3doc/img/equation_pq.png" width=400> | |
| </p> | |
| where R and R' are groundtruth regions and predicted regions respectively, | |
| and |TP|, |FP|, and |FN| are the number of true positives, false postives, | |
| and false negatives. The matching is determined by a threshold of 0.5 | |
| Intersection-Over-Union (IOU). | |
| PQ treats all regions of the same ‘stuff‘ class as one instance, and the | |
| size of instances is not considered. For example, instances with 10 × 10 | |
| pixels contribute equally to the metric as instances with 1000 × 1000 pixels. | |
| Therefore, PQ is sensitive to false positives with small regions and some | |
| heuristics could improve the performance, such as removing those small | |
| regions (as also pointed out in the open-sourced evaluation code from [2]). | |
| Thus, we argue that PQ is suitable in applications where one cares equally for | |
| the parsing quality of instances irrespective of their sizes. | |
| ## Parsing Covering (PC) | |
| We notice that there are applications where one pays more attention to large | |
| objects, e.g., autonomous driving (where nearby objects are more important | |
| than far away ones). Motivated by this, we propose to also evaluate the | |
| quality of image parsing results by extending the existing Covering metric [5], | |
| which accounts for instance sizes. Specifically, our proposed metric, Parsing | |
| Covering (PC), is defined as follows: | |
| <p align="center"> | |
| <img src="g3doc/img/equation_pc.png" width=400> | |
| </p> | |
| where S<sub>i</sub> and S<sub>i</sub>' are the groundtruth segmentation and | |
| predicted segmentation for the i-th semantic class respectively, and | |
| N<sub>i</sub> is the total number of pixels of groundtruth regions from | |
| S<sub>i</sub> . The Covering for class i, Cov<sub>i</sub> , is computed in | |
| the same way as the original Covering metric except that only groundtruth | |
| regions from S<sub>i</sub> and predicted regions from S<sub>i</sub>' are | |
| considered. PC is then obtained by computing the average of Cov<sub>i</sub> | |
| over C semantic classes. | |
| A notable difference between PQ and the proposed PC is that there is no | |
| matching involved in PC and hence no matching threshold. As an attempt to | |
| treat equally "thing" and "stuff", the segmentation of "stuff" classes still | |
| receives partial PC score if the segmentation is only partially correct. For | |
| example, if one out of three equally-sized trees is perfectly segmented, the | |
| model will get the same partial score by using PC regardless of considering | |
| "tree" as "stuff" or "thing". | |
| ## Tutorial | |
| To evaluate the parsing results with PQ and PC, we provide two options: | |
| 1. Python off-line evaluation with results saved in the [COCO format](http://cocodataset.org/#format-results). | |
| 2. TensorFlow on-line evaluation. | |
| Below, we explain each option in detail. | |
| #### 1. Python off-line evaluation with results saved in COCO format | |
| [COCO result format](http://cocodataset.org/#format-results) has been | |
| adopted by several benchmarks [3, 4]. Therefore, we provide a convenient | |
| function, `eval_coco_format`, to evaluate the results saved in COCO format | |
| in terms of PC and re-implemented PQ. | |
| Before using the provided function, the users need to download the official COCO | |
| panotpic segmentation task API. Please see [installation](../g3doc/installation.md#add-libraries-to-pythonpath) | |
| for reference. | |
| Once the official COCO panoptic segmentation task API is downloaded, the | |
| users should be able to run the `eval_coco_format.py` to evaluate the parsing | |
| results in terms of both PC and reimplemented PQ. | |
| To be concrete, let's take a look at the function, `eval_coco_format` in | |
| `eval_coco_format.py`: | |
| ```python | |
| eval_coco_format(gt_json_file, | |
| pred_json_file, | |
| gt_folder=None, | |
| pred_folder=None, | |
| metric='pq', | |
| num_categories=201, | |
| ignored_label=0, | |
| max_instances_per_category=256, | |
| intersection_offset=None, | |
| normalize_by_image_size=True, | |
| num_workers=0, | |
| print_digits=3): | |
| ``` | |
| where | |
| 1. `gt_json_file`: Path to a JSON file giving ground-truth annotations in COCO | |
| format. | |
| 2. `pred_json_file`: Path to a JSON file for the predictions to evaluate. | |
| 3. `gt_folder`: Folder containing panoptic-format ID images to match | |
| ground-truth annotations to image regions. | |
| 4. `pred_folder`: Path to a folder containing ID images for predictions. | |
| 5. `metric`: Name of a metric to compute. Set to `pc`, `pq` for evaluation in PC | |
| or PQ, respectively. | |
| 6. `num_categories`: The number of segmentation categories (or "classes") in the | |
| dataset. | |
| 7. `ignored_label`: A category id that is ignored in evaluation, e.g. the "void" | |
| label in COCO panoptic segmentation dataset. | |
| 8. `max_instances_per_category`: The maximum number of instances for each | |
| category to ensure unique instance labels. | |
| 9. `intersection_offset`: The maximum number of unique labels. | |
| 10. `normalize_by_image_size`: Whether to normalize groundtruth instance region | |
| areas by image size when using PC. | |
| 11. `num_workers`: If set to a positive number, will spawn child processes to | |
| compute parts of the metric in parallel by splitting the images between the | |
| workers. If set to -1, will use the value of multiprocessing.cpu_count(). | |
| 12. `print_digits`: Number of significant digits to print in summary of computed | |
| metrics. | |
| The input arguments have default values set for the COCO panoptic segmentation | |
| dataset. Thus, users only need to provide the `gt_json_file` and the | |
| `pred_json_file` (following the COCO format) to run the evaluation on COCO with | |
| PQ. If users want to evaluate the results on other datasets, they may need | |
| to change the default values. | |
| As an example, the interested users could take a look at the provided unit | |
| test, `test_compare_pq_with_reference_eval`, in `eval_coco_format_test.py`. | |
| #### 2. TensorFlow on-line evaluation | |
| Users may also want to run the TensorFlow on-line evaluation, similar to the | |
| [tf.contrib.metrics.streaming_mean_iou](https://www.tensorflow.org/api_docs/python/tf/contrib/metrics/streaming_mean_iou). | |
| Below, we provide a code snippet that shows how to use the provided | |
| `streaming_panoptic_quality` and `streaming_parsing_covering`. | |
| ```python | |
| metric_map = {} | |
| metric_map['panoptic_quality'] = streaming_metrics.streaming_panoptic_quality( | |
| category_label, | |
| instance_label, | |
| category_prediction, | |
| instance_prediction, | |
| num_classes=201, | |
| max_instances_per_category=256, | |
| ignored_label=0, | |
| offset=256*256) | |
| metric_map['parsing_covering'] = streaming_metrics.streaming_parsing_covering( | |
| category_label, | |
| instance_label, | |
| category_prediction, | |
| instance_prediction, | |
| num_classes=201, | |
| max_instances_per_category=256, | |
| ignored_label=0, | |
| offset=256*256, | |
| normalize_by_image_size=True) | |
| metrics_to_values, metrics_to_updates = slim.metrics.aggregate_metric_map( | |
| metric_map) | |
| ``` | |
| where `metric_map` is a dictionary storing the streamed results of PQ and PC. | |
| The `category_label` and the `instance_label` are the semantic segmentation and | |
| instance segmentation groundtruth, respectively. That is, in the panoptic | |
| segmentation format: | |
| panoptic_label = category_label * max_instances_per_category + instance_label. | |
| Similarly, the `category_prediction` and the `instance_prediction` are the | |
| predicted semantic segmentation and instance segmentation, respectively. | |
| Below, we provide a code snippet about how to summarize the results in the | |
| context of tf.summary. | |
| ```python | |
| summary_ops = [] | |
| for metric_name, metric_value in metrics_to_values.iteritems(): | |
| if metric_name == 'panoptic_quality': | |
| [pq, sq, rq, total_tp, total_fn, total_fp] = tf.unstack( | |
| metric_value, 6, axis=0) | |
| panoptic_metrics = { | |
| # Panoptic quality. | |
| 'pq': pq, | |
| # Segmentation quality. | |
| 'sq': sq, | |
| # Recognition quality. | |
| 'rq': rq, | |
| # Total true positives. | |
| 'total_tp': total_tp, | |
| # Total false negatives. | |
| 'total_fn': total_fn, | |
| # Total false positives. | |
| 'total_fp': total_fp, | |
| } | |
| # Find the valid classes that will be used for evaluation. We will | |
| # ignore the `ignore_label` class and other classes which have (tp + fn | |
| # + fp) equal to 0. | |
| valid_classes = tf.logical_and( | |
| tf.not_equal(tf.range(0, num_classes), void_label), | |
| tf.not_equal(total_tp + total_fn + total_fp, 0)) | |
| for target_metric, target_value in panoptic_metrics.iteritems(): | |
| output_metric_name = '{}_{}'.format(metric_name, target_metric) | |
| op = tf.summary.scalar( | |
| output_metric_name, | |
| tf.reduce_mean(tf.boolean_mask(target_value, valid_classes))) | |
| op = tf.Print(op, [target_value], output_metric_name + '_classwise: ', | |
| summarize=num_classes) | |
| op = tf.Print( | |
| op, | |
| [tf.reduce_mean(tf.boolean_mask(target_value, valid_classes))], | |
| output_metric_name + '_mean: ', | |
| summarize=1) | |
| summary_ops.append(op) | |
| elif metric_name == 'parsing_covering': | |
| [per_class_covering, | |
| total_per_class_weighted_ious, | |
| total_per_class_gt_areas] = tf.unstack(metric_value, 3, axis=0) | |
| # Find the valid classes that will be used for evaluation. We will | |
| # ignore the `void_label` class and other classes which have | |
| # total_per_class_weighted_ious + total_per_class_gt_areas equal to 0. | |
| valid_classes = tf.logical_and( | |
| tf.not_equal(tf.range(0, num_classes), void_label), | |
| tf.not_equal( | |
| total_per_class_weighted_ious + total_per_class_gt_areas, 0)) | |
| op = tf.summary.scalar( | |
| metric_name, | |
| tf.reduce_mean(tf.boolean_mask(per_class_covering, valid_classes))) | |
| op = tf.Print(op, [per_class_covering], metric_name + '_classwise: ', | |
| summarize=num_classes) | |
| op = tf.Print( | |
| op, | |
| [tf.reduce_mean( | |
| tf.boolean_mask(per_class_covering, valid_classes))], | |
| metric_name + '_mean: ', | |
| summarize=1) | |
| summary_ops.append(op) | |
| else: | |
| raise ValueError('The metric_name "%s" is not supported.' % metric_name) | |
| ``` | |
| Afterwards, the users could use the following code to run the evaluation in | |
| TensorFlow. | |
| Users can take a look at eval.py for reference which provides a simple | |
| example to run the streaming evaluation of mIOU for semantic segmentation. | |
| ```python | |
| metric_values = slim.evaluation.evaluation_loop( | |
| master=FLAGS.master, | |
| checkpoint_dir=FLAGS.checkpoint_dir, | |
| logdir=FLAGS.eval_logdir, | |
| num_evals=num_batches, | |
| eval_op=metrics_to_updates.values(), | |
| final_op=metrics_to_values.values(), | |
| summary_op=tf.summary.merge(summary_ops), | |
| max_number_of_evaluations=FLAGS.max_number_of_evaluations, | |
| eval_interval_secs=FLAGS.eval_interval_secs) | |
| ``` | |
| ### References | |
| 1. **Image Parsing: Unifying Segmentation, Detection, and Recognition**<br /> | |
| Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and Song-Chun Zhu<br /> | |
| IJCV, 2005. | |
| 2. **Panoptic Segmentation**<br /> | |
| Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother and Piotr | |
| Dollár<br /> | |
| arXiv:1801.00868, 2018. | |
| 3. **Microsoft COCO: Common Objects in Context**<br /> | |
| Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross | |
| Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, | |
| Piotr Dollar<br /> | |
| In the Proc. of ECCV, 2014. | |
| 4. **The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes**<br /> | |
| Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and Peter Kontschieder<br /> | |
| In the Proc. of ICCV, 2017. | |
| 5. **DeeperLab: Single-Shot Image Parser**<br /> | |
| Tien-Ju Yang, Maxwell D. Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, | |
| Xiao Zhang, Vivienne Sze, George Papandreou, Liang-Chieh Chen<br /> | |
| arXiv: 1902.05093, 2019. | |
| 6. **Contour Detection and Hierarchical Image Segmentation**<br /> | |
| Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik<br /> | |
| PAMI, 2011 | |