98.3 kB

Title: CancerUniT: Towards a Single Unified Model for Effective Detection, Segmentation, and Diagnosis of Eight Major Cancers Using a Large Collection of CT Scans

URL Source: https://arxiv.org/html/2301.12291

Markdown Content: Jieneng Chen 1,2 1 2{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT 1 1 1 Correspondence to Jieneng Chen (jienengchen01@gmail.com), Yingda Xia (yingda.xia@alibaba-inc.com), and Zaiyi Liu (zyliu@163.com) Yingda Xia 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 1 1 1 Correspondence to Jieneng Chen (jienengchen01@gmail.com), Yingda Xia (yingda.xia@alibaba-inc.com), and Zaiyi Liu (zyliu@163.com) Jiawen Yao 1,3 1 3{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT Ke Yan 1,3 1 3{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT Jianpeng Zhang 1,3 1 3{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT Le Lu 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Fakai Wang 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT

Bo Zhou 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Mingyan Qiu 1,3 1 3{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT Qihang Yu 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Mingze Yuan 1,3 1 3{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT Wei Fang 1,3 1 3{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT Yuxing Tang 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Minfeng Xu 1,3 1 3{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT

Jian Zhou 4 4{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Yuqian Zhao 5 5{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT Qifeng Wang 5 5{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT Xianghua Ye 6 6{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPT Xiaoli Yin 7 7{}^{7}start_FLOATSUPERSCRIPT 7 end_FLOATSUPERSCRIPT Yu Shi 7 7{}^{7}start_FLOATSUPERSCRIPT 7 end_FLOATSUPERSCRIPT Xin Chen 8,9 8 9{}^{8,9}start_FLOATSUPERSCRIPT 8 , 9 end_FLOATSUPERSCRIPT

Jingren Zhou 1,3 1 3{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT Alan Yuille 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Zaiyi Liu 8,9 8 9{}^{8,9}start_FLOATSUPERSCRIPT 8 , 9 end_FLOATSUPERSCRIPT 1 1 1 Correspondence to Jieneng Chen (jienengchen01@gmail.com), Yingda Xia (yingda.xia@alibaba-inc.com), and Zaiyi Liu (zyliu@163.com) Ling Zhang 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT

1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT DAMO Academy, Alibaba Group 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Johns Hopkins University

3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Hupan Lab, 310023, Hangzhou, China 4 4{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Sun Yat-sen University Cancer Center

5 5{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT Sichuan Cancer Hospital 6 6{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPT The First Affiliated Hospital of Zhejiang University

7 7{}^{7}start_FLOATSUPERSCRIPT 7 end_FLOATSUPERSCRIPT Shengjing Hospital of China Medical University 8 8{}^{8}start_FLOATSUPERSCRIPT 8 end_FLOATSUPERSCRIPT Guangdong Provincial People’s Hospital

9 9{}^{9}start_FLOATSUPERSCRIPT 9 end_FLOATSUPERSCRIPT Guangdong Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application

Abstract

Human readers or radiologists routinely perform full-body multi-organ multi-disease detection and diagnosis in clinical practice, while most medical AI systems are built to focus on single organs with a narrow list of a few diseases. This might severely limit AI’s clinical adoption. A certain number of AI models need to be assembled non-trivially to match the diagnostic process of a human reading a CT scan. In this paper, we construct a Unified Tumor Transformer (CancerUniT) model to jointly detect tumor existence & location and diagnose tumor characteristics for eight major cancers in CT scans. CancerUniT is a query-based Mask Transformer model with the output of multi-tumor prediction. We decouple the object queries into organ queries, tumor detection queries and tumor diagnosis queries, and further establish hierarchical relationships among the three groups. This clinically-inspired architecture effectively assists inter- and intra-organ representation learning of tumors and facilitates the resolution of these complex, anatomically related multi-organ cancer image reading tasks. CancerUniT is trained end-to-end using a curated large-scale CT images of 10,042 patients including eight major types of cancers and occurring non-cancer tumors (all are pathology-confirmed with 3D tumor masks annotated by radiologists). On the test set of 631 patients, CancerUniT has demonstrated strong performance under a set of clinically relevant evaluation metrics, substantially outperforming both multi-disease methods and an assembly of eight single-organ expert models in tumor detection, segmentation, and diagnosis. This moves one step closer towards a universal high performance cancer screening tool.

1 Introduction

Figure 1: We aim at cancer and non-cancer detection, segmentation, and diagnosis in eight major organs via CT scan. Seven of our eight targeted cancers rank the top seven in terms of mortality.

Cancer, a leading cause of death in the world, continues to thwart human life expectancy and cause huge societal burdens despite significant progress in medical research [56, 44]. Medical imaging is a powerful tool for detection and diagnostic examination of cancer and is widely used in clinical practice everywhere. The daily work of radiologists in reading and interpreting cancer imaging findings includes three main clinical tasks: detection, quantification, and diagnosis [4]. Since Computed Tomography (CT) body scans are very common (nearly 80% in all CT exams) [41] and each CT scan can have hundreds of image slices, the miss-detection and miss-diagnosis of cancer are the pain points in the radiology workflow. Human readers statistically tend to have high specificity but low sensitivity for tagging and reporting various anomalies or diseases.

Computer-aided detection (CADe) and diagnosis (CADx) can assist radiologists and oncologists to improve the tumor detection rate and diagnosis accuracy[16, 4]. With the development of deep learning and Convolutional Neural Networks (CNNs), CAD algorithms have met or exceeded expert-level performance in some specific applications [36, 31, 14, 3]. However, most CAD expert systems focus on dealing with single organ diseases [49], e.g., pancreas [80, 75], liver [25, 12], lung [3, 27], or kidney tumors [21], while radiologists in turn, must be responsible for all possible diseases and radiologically significant findings [38]. For example, for an abdominal CT examination that is initially targeted for the gallbladder, even when there are no clear prior indications (e.g., abdominal pain), all visible organs in the entire abdomen from CT imaging need to be carefully inspected by a radiologist. Therefore, the role of current CAD tools is still (very) limited in clinical practice, far from functioning as universally as a human reader. A versatile CAD tool that can perform (many) more critical medical tasks would be more clinically desirable and thus is in high demand [43].

Despite notable progress in multi-organ segmentation, it should be noted that detecting and diagnosing multiple cancers is considerably more difficult than segmenting organs alone due to several factors: (1) tumors have a variety of types, appearances and size, making it hard to be detected. (2) tumor detection requires differentiation of tumors from normal tissue within an organ, which is more challenging than the differentiation of organs from the background in organ segmentation. (3) the diagnosis of cancer involves the fine-grained categorization of tumors, which necessitates a high level of expertise and specialized training. Aiming at solving the universal lesion detection problem in CT scans, DeepLesion is a recent pioneering publicly available dataset[66, 65, 63], and despite much follow-up work, most cancer detection, quantification, and diagnosis solutions derived from DeepLesion dataset [66] are still insufficient in the following aspects. First, the data size and patient number of a single disease can be small, and some major cancer types (e.g., esophagus, stomach, and colorectum) are scarce, resulting in relatively high false positive and sub-optimal detection rates. Second, voxel-level tumor annotation, perhaps as 3D masks (requiring a high level of clinical expertise and are very tedious to label) are not available, making the necessary precise 3D quantification difficult (if not impossible). Third, the pathological gold standard of confirming tumor types is unavailable and, therefore, impossible to distinguish between malignant and benign lesions. Recent clinical validations of two multi-disease detection AI systems [40, 61] found that ruling out irrelevant CAD findings (i.e., false positives and lesions without adequate malignancy assessment) was very time-consuming and confusing for radiologists. These observations clearly indicates the essential limitations of applying DeepLesion dataset [66] to positive clinical impacts.

In this paper, we curate a large (abdominal and chest) CT image dataset including eight major cancers (from the top seven cancers with the highest mortality in the world [44]: lung, colorectum, liver, stomach, breast, esophagus, and pancreas, plus a public kidney dataset [21]) of total 10,673 patients (Of these, breast cancer has the fewest, with 478 patients; lung cancer has the most, with 2,402 patients. In addition, there are 1,055 normal controls). All tumor types (and subtypes) of the seven organs are confirmed by either surgical or biopsy pathology and recorded as gold standard labels, where full spectrum of all tumor subtypes are offered for four organs. All confirmed tumors in CT scans are manually segmented or delineated in 3D by board-certified radiologists who are specialized in the particular organ or disease types. To our knowledge, previous datasets with similar tumor characteristics only cover a single disease at the scale of hundreds of patients, such as the pancreatic tumor [75] and kidney tumor datasets [21]. The curation of our new 8-cancer dataset is a major step towards building a universal multi-cancer imaging reading AI model – with the hope to reach a performance level comparable with radiologists specializing in different cancer types – for assisting radiologists and general clinicians in precision detection, quantification, and diagnosis. Fig.1 shows our goal for cancer and non-cancer detection, segmentation, and diagnosis in eight major organs via CT scans.

On the other hand, we propose a new clinically interpretable computing architecture, named Unified Tumor Transformer (CancerUniT). In general, CancerUniT is a single unified model that simultaneously solves the tasks of multi-tumor detection, segmentation and diagnosis in a semantic segmentation manner. Our motivations are: (1) the organs, cancers and non-cancer tumors are interrelated in both appearance similarity and human anatomical constraints, e.g., HCC (a major malignant liver cancer) and cyst (benign lesion) both occur inside the liver with textual and other visual differences, while HCC and PDAC (a type of pancreatic cancer) should appear in two different organs but their clinical characteristics are both malignant carcinoma; (2) a unified learning of multi-organs-tumors could reduce the performance uncertainty and architectural complexity in assembling multiple single models, e.g., different predictions of the same intended object or finding by multiple models. To collaboratively model such differences and connections or dependencies, we propose a novel representation learning method that represents each organ and tumor as an object query of the Transformer in a semantic hierarchy. The object queries are divided into organ queries, tumor detection queries and diagnosis queries, and we establish a query hierarchy based on the clinical meaning of the queries. This design will explicitly encourage the queries to learn the inter-organ and intra-organ relationships to solve the clinically sophisticated multi-cancer tumor recognition tasks.

CancerUniT is trained and tested on our curated dataset. CancerUniT outperforms the DeepLesion model, the ensemble of single-organ expert models and unified baseline models (trained on our data). Compared to the DeepLesion model, CancerUniT has a 29.3% higher sensitivity and a large margin of 77.5% higher specificity in tumor detection. Compared to an ensemble of individually trained single-organ nnUNet models, CancerUniT has an average improvement of 6.7% in tumor detection sensitivity, 2.8% in diagnostic accuracy, and 3.9% in Dice segmentation score across all the organs; On normal patients, CancerUniT has an improvement of 22.5% in specificity (ours 81.7% vs. nnUNet 59.2%); CancerUniT is 4.5 times faster in testing speed. In comparison to a unified nnUNet model dealing with all eight organs, CancerUniT leads by 5.3% in lesion detection sensitivity, 6.7% in diagnostic accuracy, 2.8% in specificity, and 2.7% in Dice segmentation score. The improvements indicate that the different type of tumors have mutual correlations and the design of CancerUniT successfully captures this clinical relationship for enhanced tumor representation learning. The high performance of CancerUniT also sheds light on its clinical potential for real-world multi-cancer detection, segmentation, and diagnosis.

2 Related Work

CADe and CADx. CADe normally refers to the computer-aided localization process of lesions in 2D/3D medical images and CADx subsequently diagnoses lesions or findings as either malignant or benign [4] and assigning more potential tumor characteristics. Along with advances in deep learning, quantitative CADe performances matching or beyond medical domain experts are reported in several specialized single-organ clinical applications: breast cancer screening[36], lung cancer detection [3], retinal disease referral [14], skin disease diagnosis [31] and so on.

Tumor detection, segmentation, and diagnosis in CT via CNNs. CNNs have been widely applied to detect, segment, and diagnose cancers/tumors in CT scans. Lung nodule detection in low-dose CT[47] is the recommended lung cancer screening protocol where some promising results are discussed[3, 23, 59, 76]. Image segmentation networks [33, 9, 39, 26, 78] are well-adopted under the per-pixel classification setting and a segmentation model is designed to predict the probability distribution over all possible categories or labels per pixel (as a structured dense prediction problem). Segmenting abdominal organs and detecting tumor by segmentation principles [55, 2, 5, 21, 74, 71, 22], serve a key role towards fully-automated tumor detection[80, 57, 70, 58], differential diagnosis and reporting[75]. Despite their promising performance, these approaches are often specialized to focus only on a single organ. Multi-organ segmentation[34, 29, 54, 73, 15, 77] are emerging, but the degree of difficulty involved in multi-cancer detection and diagnosis is considerably greater than that of organ-level. DeepLesion[66] attempts to tackle the universal lesion detection task in CT scans, but their derived lesion detection methods [65, 63, 62] and several follow-up work [67, 35, 64, 46, 45] so far have reported mostly moderate multi-class lesion detection performance. Distinguishing between malignant and benign lesions in multi-class tumor setting is still far from a clinical reality.

Transformers[50] have advanced the state-of-the-art performance in various computer vision tasks[18, 6, 32, 8, 19, 79, 48], by capturing global interactions between image patches and having no built-in inductive prior. The success of Transformer has also been witnessed in medical image detection and segmentation[7, 60, 20]. With the recent progress in transformers [6, 52], a new variant called mask Transformers has been proposed, where segmentation predictions are represented by a set of query embeddings with their own semantic class labels, generated through the conversion of query embedding to mask embedding vectors followed by multiplying with the image features. The essential component of mask transformers is the decoder which takes object queries as input and gradually transfers them into mask embedding vectors. Recent works[42, 52, 11, 10] inspire us to represent tumor in the medical domain as the class query[42] within the Transformer formulation. In this paper, we propose a novel semantic hierarchical representation to exploit the relationship in detection, diagnosis and differentiation among eight main tumors and their sub-types from a large dataset of CT scans collected from both healthy subjects and patients with cancers.

3 Method

In this section, we first define the problem of tumor detection, segmentation, and diagnosis from an image semantic segmentation perspective in Sec.3.1. We then describe the overview of query-based mask Transformer and how we integrate it as our segmentation decoder in Sec.3.2. After that, we introduce the proposed Unified tumor Transformer (CancerUniT) in Sec.3.3, which represents tumors by a semantic query hierarchy, solving the tumor detection, segmentation, and diagnosis in a unified manner.

3.1 Problem Definition

We focus on three tasks in images, i.e., tumor detection, segmentation, and diagnosis. Tumor detection aims to locate the presence of target types of tumors. Tumor segmentation aims to provide per-pixel annotation of the tumor region. Tumor diagnosis aims to classify the specific tumor subtype of a detected tumor. We denote 𝐬 o subscript 𝐬 𝑜\mathbf{s}{o}bold_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT as a set of organs, and 𝐬 t subscript 𝐬 𝑡\mathbf{s}{t}bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as a set of tumors. Specifically in our dataset,

𝐬 o subscript 𝐬 𝑜\mathbf{s}{o}bold_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT{breast, lung, kidney, pancreas, esophagus, liver, stomach, colorectum} 𝐬 t subscript 𝐬 𝑡\mathbf{s}{t}bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT{breast cancer, lung cancer, colorectal cancer,pancreas PDAC, pancreas nonPDAC, liver HCC, liver ICC,liver metastatis, liver hemangioma, stomach GC,stomach nonGC, esophagus EC, esophagus nonEC,kidney tumor/cyst}

We propose to solve these three tasks with a semantic segmentation framework, in which we assign each voxel in the CT scan with a semantic label k∈𝐬 o∪𝐬 t 𝑘 subscript 𝐬 𝑜 subscript 𝐬 𝑡 k\in\mathbf{s}{o}\cup\mathbf{s}{t}italic_k ∈ bold_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∪ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the total number of classes is K=|𝐬 o|+|𝐬 t|𝐾 subscript 𝐬 𝑜 subscript 𝐬 𝑡 K=|\mathbf{s}{o}|+|\mathbf{s}{t}|italic_K = | bold_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | + | bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | . The tumor detection, segmentation, and diagnosis are then evaluated based on the semantic segmentation results.

3.2 Basis: Query-based Mask Transformers

Although Transformers have been used for medical image segmentation as feature extractors, the query-based mask Transformer[42, 52, 11] decoder is rarely explored in medical images. Query-based mask Transformer aims to decode the pixel-level features (usually from a CNN backbone) with object queries. Our method is based on this design and here we provide an overview of its basic components. Query initialization. A set of K 𝐾 K italic_K learnable class queries (i.e., embeddings) 𝐪=[q 1,…,q K]∈ℝ K×d 𝐪 subscript q 1…subscript q 𝐾 superscript ℝ 𝐾 𝑑\mathbf{q}=[\text{q}{1},...,\text{q}{K}]\in\mathbb{R}^{K\times d}bold_q = [ q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_d end_POSTSUPERSCRIPT is defined where K 𝐾 K italic_K is the number of classes and d 𝑑 d italic_d is the query dimension. Each class query is initialized randomly and assigned to a single semantic class.

Query interaction via Transformer. The queries are updated through multi-head cross-attention, multi-head self-attention, and feedforward network[50]. The multi-head cross-attention between queries and image features is computed to update queries conditioned on image features. The multi-head self-attention allows queries to interact with each other.

Decode queries to segmentation. The class query 𝐪 𝐪\mathbf{q}bold_q is processed jointly with 3D image features 𝐅∈ℝ d×D×H×W 𝐅 superscript ℝ 𝑑 𝐷 𝐻 𝑊\mathbf{F}\in\mathbb{R}^{d\times D\times H\times W}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_D × italic_H × italic_W end_POSTSUPERSCRIPT by the decoder. K 𝐾 K italic_K masks can be generated by computing the scalar product between L2-normalized image features 𝐅∈ℝ d×D×H×W 𝐅 superscript ℝ 𝑑 𝐷 𝐻 𝑊\mathbf{F}\in\mathbb{R}^{d\times D\times H\times W}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_D × italic_H × italic_W end_POSTSUPERSCRIPT and class queries 𝐪∈ℝ K×d 𝐪 superscript ℝ 𝐾 𝑑\mathbf{q}\in\mathbb{R}^{K\times d}bold_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_d end_POSTSUPERSCRIPT. The set of class masks is computed as:

𝐌=𝐪×𝐅 𝐌 𝐪 𝐅\mathbf{M}=\mathbf{q}\times\mathbf{F}bold_M = bold_q × bold_F(1)

where 𝐌∈ℝ K×D×H×W 𝐌 superscript ℝ 𝐾 𝐷 𝐻 𝑊\mathbf{M}\in\mathbb{R}^{K\times D\times H\times W}bold_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_D × italic_H × italic_W end_POSTSUPERSCRIPT is K 𝐾 K italic_K mask predictions and will be followed by a softmax to obtain the final pixel-wise class probability map in the task of semantic segmentation.

3.3 CancerUniT: Unified Tumor Transformer

We introduce the novel unified tumor transformer (See Fig.2), which includes semantic query hierarchy for tumor representation, UNet backbone for feature extraction, Transformer for query interaction, and dual-task query decoding for tumor detection task and cancer diagnosis task.

Figure 2: Overview of Unified Tumor Transformer (CancerUniT). We first represent tumor as queries 𝐀 𝐀\mathbf{A}bold_A and 𝐁 𝐁\mathbf{B}bold_B (i.e. feature embedding), and then build the query hierarchy from 𝐀 𝐀\mathbf{A}bold_A to 𝐁 𝐁\mathbf{B}bold_B via linear projection 𝐅𝐂 𝐅𝐂\mathbf{FC}bold_FC according to the tumor sub-type relationship. The tumor queries interact and are updated in a Transformer decoder with input of UNet features 𝐅 𝐅\mathbf{F}bold_F. Dual-task query decoding is performed to generate semantic segmentation map for two tasks. The detection task focuses on major tumor classes while diagnosis task is supervised by fine-grained cancer sub-types. In inference stage, the dual-task tumor segmentation maps are post-processed separately to produce multi-classes tumor instances for tumor detection and cancer diagnosis.

3.3.1 Query Hierarchy

We propose a novel tumor representation via a semantic hierarchy of queries, including shared, detection, and diagnosis queries. By this design, tumors are represented as queries and a “detection-to-diagnosis hierarchy” is established based on the semantic relationship of tumors.

We hereby divide the segmentation targets 𝐬 o∪𝐬 t subscript 𝐬 𝑜 subscript 𝐬 𝑡\mathbf{s}{o}\cup\mathbf{s}{t}bold_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∪ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into three non-overlapping groups, i.e., 𝐦 𝐦\mathbf{m}bold_m, 𝐧 𝐧\mathbf{n}bold_n, and 𝐬 𝐬\mathbf{s}bold_s. 𝐦 𝐦\mathbf{m}bold_m consists of m 𝑚 m italic_m general tumor categories that requires further diagnosis. The i t⁢h superscript 𝑖 𝑡 ℎ i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element 𝐦 i subscript 𝐦 𝑖\mathbf{m}{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be further categorized into the sub-classes of 𝐧 i subscript 𝐧 𝑖\mathbf{n}{i}bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with a number of n i subscript 𝑛 𝑖 n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT sub-classes. 𝐬 𝐬\mathbf{s}bold_s consists the rest of the targets including eight organs and the four cancers that do not require diagnosis in our data.

Shared query. We define a set of shared query 𝐬∈ℝ s×d 𝐬 superscript ℝ 𝑠 𝑑\mathbf{s}\in\mathbb{R}^{s\times d}bold_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_s × italic_d end_POSTSUPERSCRIPT to represent the shared classes in both detection task and diagnosis task. The shared classes include the 8 organ classes, and 4 tumor classes without other sub-types.

𝐬 𝐬\mathbf{s}bold_s 𝐬 o subscript 𝐬 𝑜\mathbf{s}_{o}bold_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT∪\cup∪ {lung cancer, breast cancer, colorectal cancer, kidney tumor/cyst} 𝐦 𝐦\mathbf{m}bold_m{pancreas tumor, liver tumor, stomach tumor, esophagus tumor} 𝐧 𝐧\mathbf{n}bold_n{{PDAC, nonPDAC}, {HCC, ICC, metastasis, hemangioma},{GC, nonGC}, {EC, nonEC)}}

Detection query. We denote 𝐀∈ℝ m×d 𝐀 superscript ℝ 𝑚 𝑑\mathbf{A}\in\mathbb{R}^{m\times d}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT as detection queries with m 𝑚 m italic_m specifying the number of queries. Each detection query corresponds to the general tumor class of an organ, which requires further diagnosis.

Diagnosis query. The cancer diagnosis relies on fine-grained tumor categorization. Similarly, we denote a feature embedding 𝐁∈ℝ n i×d 𝐁 superscript ℝ subscript 𝑛 𝑖 𝑑\mathbf{B}\in\mathbb{R}^{n_{i}\times d}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT as diagnosis queries with n i subscript 𝑛 𝑖 n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT specifying the number of diagnosis classes for tumor 𝐦 𝐢 subscript 𝐦 𝐢\mathbf{m_{i}}bold_m start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT. A group of n i subscript 𝑛 𝑖 n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT diagnosis queries refers to |𝐧 𝐢|subscript 𝐧 𝐢|\mathbf{n_{i}}|| bold_n start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT | tumor sub-types 𝐧 𝐢 subscript 𝐧 𝐢\mathbf{n_{i}}bold_n start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT occurred in organ i 𝑖 i italic_i, and totally we have n=∑i=1|𝐧|n i 𝑛 superscript subscript 𝑖 1 𝐧 subscript 𝑛 𝑖 n=\sum_{i=1}^{|\mathbf{n}|}n_{i}italic_n = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_n | end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT diagnosis queries in this work.

Detection-to-Diagnosis hierarchy via linear projection. Inspired by the clinical practice of detection-then-diagnosis and given the fact that diagnosis queries are subtypes of detection queries, we aim to build a graph treating the detection queries as parent nodes and diagnosis queries as children nodes. In this way, the model is able to learn the hierarchical representation of tumors explicitly.

To build the semantic hierarchical relationship, we project a detection query 𝐀 i∈ℝ 1×d subscript 𝐀 𝑖 superscript ℝ 1 𝑑\mathbf{A}{i}\in\mathbb{R}^{1\times d}bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT into diagnosis queries 𝐁 i∈ℝ 1×n i⁢d subscript 𝐁 𝑖 superscript ℝ 1 subscript 𝑛 𝑖 𝑑\mathbf{B}{i}\in\mathbb{R}^{1\times n_{i}d}bold_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d end_POSTSUPERSCRIPT via a linear projection layer with matrix 𝐖 i∈ℝ n i⁢d×d subscript 𝐖 𝑖 superscript ℝ subscript 𝑛 𝑖 𝑑 𝑑\mathbf{W}{i}\in\mathbb{R}^{n{i}d\times d}bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT. The detection-to-diagnosis procedure is formulated as:

𝐁 i=𝐀 i×𝐖 i T subscript 𝐁 𝑖 subscript 𝐀 𝑖 superscript subscript 𝐖 𝑖 𝑇\mathbf{B}{i}=\mathbf{A}{i}\times\mathbf{W}_{i}^{T}bold_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT(2)

where 𝐁 i=[𝐁 i,1,𝐁 i,2,…⁢𝐁 i,n i]subscript 𝐁 𝑖 subscript 𝐁 𝑖 1 subscript 𝐁 𝑖 2…subscript 𝐁 𝑖 subscript 𝑛 𝑖\mathbf{B}{i}=[\mathbf{B}{i,1},\mathbf{B}{i,2},...\mathbf{B}{i,n_{i}}]bold_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ bold_B start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , … bold_B start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] and the subgroup capacity n i=|𝐧 𝐢|subscript 𝑛 𝑖 subscript 𝐧 𝐢 n_{i}=|\mathbf{n_{i}}|italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = | bold_n start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT | represents the number of subtypes for tumor 𝐦 𝐢 subscript 𝐦 𝐢\mathbf{m_{i}}bold_m start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT. The detection queries 𝐀 𝐀\mathbf{A}bold_A are learnable parameters that are random initialized, while diagnosis queries 𝐁 𝐁\mathbf{B}bold_B are feature embedding conditioned on detection queries.

3.3.2 Meta Architecture

The proposed architecture includes a UNet backbone for feature extraction, a Transformer for query interaction, and a dual-task query decoding stage to generate segmentation masks. Detailed model instantiation is in Appendix.

nnUNet backbone for feature extraction. We adopt nnUNet[26] as the backbone to extract multi-scale features 𝐅=[𝐅 1,𝐅 2,𝐅 3,𝐅 4]𝐅 superscript 𝐅 1 superscript 𝐅 2 superscript 𝐅 3 superscript 𝐅 4\mathbf{F}=[\mathbf{F}^{1},\mathbf{F}^{2},\mathbf{F}^{3},\mathbf{F}^{4}]bold_F = [ bold_F start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_F start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , bold_F start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ] where 𝐅 j∈ℝ d×D j⁢H j⁢W j superscript 𝐅 𝑗 superscript ℝ 𝑑 superscript 𝐷 𝑗 superscript 𝐻 𝑗 superscript 𝑊 𝑗\mathbf{F}^{j}\in\mathbb{R}^{d\times D^{j}H^{j}W^{j}}bold_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_D start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is the j 𝑗 j italic_j-th layer feature map after projecting to number of channel d 𝑑 d italic_d and flattening the spatial dimension D j superscript 𝐷 𝑗 D^{j}italic_D start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, H j superscript 𝐻 𝑗 H^{j}italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, and W j superscript 𝑊 𝑗 W^{j}italic_W start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT.

Transformer for query interaction. We use the standard Transformer decoder[50] with input of UNet features 𝐅 j superscript 𝐅 𝑗\mathbf{F}^{j}bold_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and queries [𝐀 j,𝐁 j,𝐒 j]superscript 𝐀 𝑗 superscript 𝐁 𝑗 superscript 𝐒 𝑗[\mathbf{A}^{j},\mathbf{B}^{j},\mathbf{S}^{j}][ bold_A start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_B start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ] at the j 𝑗 j italic_j-th layer. The Transformer is stacked by three Transformer layers, each of which contains a multi-head cross-attention, a multi-head self-attention, and a feed-forward network. The concatenated query [𝐀,𝐁,𝐒]𝐀 𝐁 𝐒[\mathbf{A},\mathbf{B},\mathbf{S}][ bold_A , bold_B , bold_S ] is updated via the cross attention (denoted as C⁢A 𝐶 𝐴 CA italic_C italic_A) between the queries and the image feature 𝐅 j superscript 𝐅 𝑗\mathbf{F}^{j}bold_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, as well as the query self-attention (denoted as S⁢A 𝑆 𝐴 SA italic_S italic_A). The query interaction is written as:

𝐀 j,𝐁 j,𝐒 j=S⁢A⁢(C⁢A⁢([𝐀 j−1,𝐁 j−1,𝐒 j−1],𝐅 j))superscript 𝐀 𝑗 superscript 𝐁 𝑗 superscript 𝐒 𝑗 𝑆 𝐴 𝐶 𝐴 superscript 𝐀 𝑗 1 superscript 𝐁 𝑗 1 superscript 𝐒 𝑗 1 superscript 𝐅 𝑗\displaystyle\begin{split}\mathbf{A}^{j},\mathbf{B}^{j},\mathbf{S}^{j}&=SA(CA(% [\mathbf{A}^{j-1},\mathbf{B}^{j-1},\mathbf{S}^{j-1}],\mathbf{F}^{j}))\end{split}start_ROW start_CELL bold_A start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_B start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_CELL start_CELL = italic_S italic_A ( italic_C italic_A ( [ bold_A start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT , bold_B start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT , bold_S start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT ] , bold_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) end_CELL end_ROW(3)

Dual-task query decoding. As there exists inclusiveness for the classes in 𝐀 𝐀\mathbf{A}bold_A and 𝐁 𝐁\mathbf{B}bold_B, it is unlikely to decode them jointly if we would like to enforce multi-class exclusivity constraint (e.g. softmax). To better capture the class exclusivity, we propose the dual-task query decoding procedure that decodes queries [𝐀,𝐒]𝐀 𝐒[\mathbf{A},\mathbf{S}][ bold_A , bold_S ] and queries [𝐁,𝐒]𝐁 𝐒[\mathbf{B},\mathbf{S}][ bold_B , bold_S ] separately to perform dual-task semantic segmentation. The query decoding follows Eq.1 with a softmax activation, written as:

𝐌 A+S=s⁢o⁢f⁢t⁢m⁢a⁢x⁢([𝐀,𝐒]×𝐅 4)𝐌 B+S=s⁢o⁢f⁢t⁢m⁢a⁢x⁢([𝐁,𝐒]×𝐅 4)subscript 𝐌 𝐴 𝑆 𝑠 𝑜 𝑓 𝑡 𝑚 𝑎 𝑥 𝐀 𝐒 superscript 𝐅 4 subscript 𝐌 𝐵 𝑆 𝑠 𝑜 𝑓 𝑡 𝑚 𝑎 𝑥 𝐁 𝐒 superscript 𝐅 4\displaystyle\begin{split}\mathbf{M}{A+S}=softmax([\mathbf{A},\mathbf{S}]% \times\mathbf{F}^{4})\ \mathbf{M}{B+S}=softmax([\mathbf{B},\mathbf{S}]\times\mathbf{F}^{4})\end{split}start_ROW start_CELL bold_M start_POSTSUBSCRIPT italic_A + italic_S end_POSTSUBSCRIPT = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( [ bold_A , bold_S ] × bold_F start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL bold_M start_POSTSUBSCRIPT italic_B + italic_S end_POSTSUBSCRIPT = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( [ bold_B , bold_S ] × bold_F start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW(4)

where 𝐌 A+S subscript 𝐌 𝐴 𝑆\mathbf{M}{A+S}bold_M start_POSTSUBSCRIPT italic_A + italic_S end_POSTSUBSCRIPT and 𝐌 B+S subscript 𝐌 𝐵 𝑆\mathbf{M}{B+S}bold_M start_POSTSUBSCRIPT italic_B + italic_S end_POSTSUBSCRIPT are the decoded voxel-wise semantic map for the detection task and the diagnosis task, respectively.

End-to-end training. Our method performs both major tumor segmentation and tumor subtype segmentation directly from CT scans, while vanilla methods only output subtype segmentation maps that are further merged to major tumor segmentation maps. In our work, the loss function is the combination of cross-entropy loss and Dice loss[37], which are applied to both detection output 𝐌 A+S subscript 𝐌 𝐴 𝑆\mathbf{M}{A+S}bold_M start_POSTSUBSCRIPT italic_A + italic_S end_POSTSUBSCRIPT and diagnosis output 𝐌 B+S subscript 𝐌 𝐵 𝑆\mathbf{M}{B+S}bold_M start_POSTSUBSCRIPT italic_B + italic_S end_POSTSUBSCRIPT to enforce the similarity with their corresponding targets.

Inference. End-to-end inference of dual-task segmentation is enabled simultaneously. For the detection process, the tumor segmentation map from 𝐌 A+S subscript 𝐌 𝐴 𝑆\mathbf{M}{A+S}bold_M start_POSTSUBSCRIPT italic_A + italic_S end_POSTSUBSCRIPT is extracted to generate tumor instances (i.e. connected components) with tumor class label i 𝑖 i italic_i. If the predicted tumor instance has overlaps with the ground-truth tumor, the patient is detected with tumor class i 𝑖 i italic_i. For the diagnosis process, we do similar tumor instance extraction from 𝐌 B+S subscript 𝐌 𝐵 𝑆\mathbf{M}{B+S}bold_M start_POSTSUBSCRIPT italic_B + italic_S end_POSTSUBSCRIPT, but each tumor instance is identified as one specific tumor subtype. The patient-level cancer diagnosis category is decided by the tumor subtype with the largest connected component.

Table 1: Dataset description. A-F denote six different hospitals.

Cancers Full spectrum tumors Normal controls Total Breast CRC Kidney Lung Pancreas Esophagus Stomach Liver Abdomen CTA Hospitals A A public A,B C B,D,E A C A F Subtypes BC CRC KT LC PDAC nonPDAC EC nonEC GC nonGC HCC ICC Meta Heman-- Train 428 746 249 2352 1315 727 1185 105 1117 273 284 31 99 147 884 100 10042 Test 50 50 50 50 50 50 50 50 50 50 15 15 15 15 50 21 631 Total 478 796 299 2402 1365 777 1235 155 1167 323 299 46 114 162 934 121 10673

4 Experiments

4.1 Experiment Setup

Dataset description. Our 8-cancer CT dataset, which includes seven in-house tumor datasets (collected from five hospitals), one publicly available kidney tumor dataset [21], and a normal control dataset, is composed of 10,673 contrast-enhanced CT volumes (all in venous phase, except for lung and CT angiography being arterial phase), each from one unique patient. These CT volumes are acquired before treatment. All cancers (and tumor subtypes) in the seven in-house datasets are confirmed by pathology, with four datasets having a full spectrum of tumor subtypes, i.e., liver (4 subtypes), stomach (6), esophagus (4), and pancreas (9). The normal controls consist of 934 abdominal CT and 121 CT angiography (CTA) scans. Some of the datasets for single organs have been involved in our previous publications for other precision oncology research purposes [75, 69, 70, 71, 24, 53, 17, 68].

Tumors in each organ dataset are manually segmented slice-by-slice on CT images by radiologists who provide the data and specialize in the specific disease using either ITK-SNAP [72] or our in-house developed CT annotation tool – CTLabler [51]. During annotation, the radiologists also refer to the other CT phases (e.g., arterial, delay), contrast-enhanced MRI, and radiology/surgery/pathology reports if necessary. All organs are segmented/delineated automatically: the breast is by a nnUNet model trained on 213 additional breast cancer CT volumes with CTV (clinical target volume) masks; the other seven organs are by another nnUNet model trained on the Totalsegmentator dataset [55]. We randomly select 50 CT volumes from each tumor subtype (except for the liver tumor subtypes being 15 each, considering the relatively small liver data size), 50 abdominal, and 21 CTA volumes to form the test set. The remaining 10,042 CT volumes are used as the training set (Table 1).

Implementation details. All images were resampled to a spacing of 3×0.8×0.8 3 0.8 0.8 3\times 0.8\times 0.8 3 × 0.8 × 0.8 mm (Z×X×Y)𝑍 𝑋 𝑌(Z\times X\times Y)( italic_Z × italic_X × italic_Y ). In the training stage, we randomly cropped sub-volumes of size of 48×192×192 48 192 192 48\times 192\times 192 48 × 192 × 192 voxels from CT scans as the input. We employed the online data augmentation of nnUNet [26], including random rotation, scaling, flipping, Gaussian blurring, adding white Gaussian noise, adjusting brightness and contrast, simulation of low resolution, and Gamma transformation, to diversify the training set. The balanced sampling strategy was adopted to encourage model to sample different datasets and also different organ regions evenly. The batch size was set to 8, with 1 batch size per GPU on an 8-GPU machine. We adopted the AdamW optimizer and an initial learning rate of 3e-4. The baseline models were trained from scratch with 700 epochs, and the number of iterations per epoch equaled to training dataset size divided by the batch size. It took 40 GPU days to train a nnUNet from scratch on our dataset with Nvidia V100 GPUs. Due to huge cost, CancerUniT was trained based on the pre-trained nnUNet with a learning rate multiplier 0.1, and we trained 50 epochs. For fair comparison, we also kept tuning nnUNet for another 50 epochs besides 700 epochs whereas no performance improvement was observed.

In the inference stage, we employed the sliding window strategy, where the window size equals to the training patch size. In addition, Gaussian importance weighting and test time augmentation by flipping along all axes were also utilized to improve the robustness of segmentation.

Table 2: Patient-level tumor detection results. Sensitivity (%) and specificity (%) are reported. “DeepLesion model” is trained on DeepLesion dataset[65]) using the detection-based algorithm LENS[62]; “LENS (trained on our data)” is trained on our new 8-cancer dataset.

Model Sensitivity (%)Specificity (%) Br.Crc.Kid.Lung Pan.Eso.St.Liv.Average Abd.CTA Average 8-nnUNet ensemble 96.0 74.0 94.0 74.0 93.0 83.0 92.0 86.7 86.6 80.0 9.5 59.2 DeepLesion model[66]78.0 38.0 86.0 76.0 82.0 34.0 30.0 88.3 64.0 6.0 0.0 4.2 LENS (train on our data)[62]82.0 62.0 76.0 50.0 89.0 72.0 72.0 75.0 72.3 70.0 52.4 64.8 nnUNet[26]90.0 82.0 92.0 94.0 94.0 76.0 91.0 85.0 88.0 92.0 47.6 78.9 TransUNet[7, 26]94.0 86.0 94.0 94.0 94.0 81.0 94.0 88.3 90.7 94.0 47.6 80.3 Ours 94.0 92.0 94.0 94.0 95.0 89.0 93.0 95.0 93.3 96.0 47.6 81.7

Evaluation metrics. We consider the evaluation metrics from three aspects, including patient-level, lesion-level, and dense-level metrics. For patient-level evaluation, sensitivity and specificity are computed. Lesion-level precision and recall (do not consider the tumor type) are computed based on connected component analysis of tumor predictions. Tumor segmentation accuracy is assessed by the Dice coefficient.

Baselines. We compare our method to five baselines. (i) 8-nnUNet ensemble: An ensemble of 8 separately trained nnUNet[26] models. To solve overlapping tumor predictions, we extract the tumor connected components from 8 model predictions and merge them with the priority of tumor size. (ii) nnUNet: A unified nnUNet trained on our dataset as a multi-organ multi-tumor segmentation task. (iii) TransUNet: A leading Transformer model TransUNet[7] on medical image segmentation implemented in the nnUNet framework[26] and with the same settings as (ii). (iv) DeepLesion model: A universal lesion detection model[62] trained on the DeepLesion dataset[66]. (v) LENS (train on our data): A leading medical lesion detection algorithm LENS[62] trained on our dataset.

For a fair comparison, all segmentation-based models adopt fair data augmentations following nnUNet [26] and the same training techniques, while LENS and DeepLesion as detection-based methods adopt augmentations and training techniques following LENS [62]. All the models are trained to be converged.

4.2 Main Results

Table 3: Class-agnostic lesion instance-level detection results. We treat eight types of tumors as one class. The numbers of FN, TP, FP lesions, precision and recall are reported. The total number of ground-truth lesions in the test set is 767. Note one patient might have several lesions in the 560 patients with tumors, and a ground-truth lesion might be matched with multiple TP components.

Model FN TP FP Precision Recall 8-nnUNet ensemble 209 568 1060 34.9%72.8% DeepLesion model[66]376 649 5345 10.8%51.0% LENS (train on our data)[62]267 602 875 40.8%65.2% nnUNet[26]223 557 585 48.8%70.9% TransUNet[7, 26]169 648 726 47.2%77.9% Ours 192 592 508 53.8%75.0%

Table 4: Voxel-level tumor semantic segmentation results (in Dice coefficient %).

Model Breast Colorectum Kidney Lung Pancreas Esophagus Stomach Liver average 8-nnUNet ensemble 0.623 0.474 0.728 0.415 0.690 0.661 0.420 0.703 0.589 nnUNet[26]0.661 0.515 0.736 0.548 0.695 0.597 0.418 0.676 0.601 TransUNet[7, 26]0.700 0.530 0.738 0.540 0.700 0.621 0.444 0.691 0.620 Ours 0.702 0.533 0.739 0.515 0.702 0.652 0.435 0.743 0.628

Table 5: Patient-level cancer diagnosis. The sensitivity (%) for each tumor subtype is reported. We categorize tumor subtypes as two classes of cancer and non-cancer tumors for pancreas, esophagus, and stomach datasets; and consider four major subtypes for liver dataset.

Model Pancreas Esophagus Stomach Liver Average PDAC nonPDAC avg EC nonEC avg GC nonGC avg HCC ICC Meta Heman avg 8-nnUNet ensemble 88.0 74.0 81.0 92.0 32.0 62.0 94.0 28.0 61.0 80.0 60.0 46.7 86.7 68.3 68.1 nnUNet[26]92.0 76.0 84.0 94.0 12.0 53.0 96.0 18.0 57.0 69.0 69.0 33.3 80.0 62.8 64.2 TransUNet[7, 26]94.0 78.0 86.0 94.0 22.0 58.0 96.0 18.0 57.0 60.0 80.0 40.0 80.0 65.0 66.5 Ours 90.0 84.0 87.0 94.0 36.0 65.0 82.0 48.0 65.0 60.0 80.0 40.0 86.7 66.7 70.9

Table 6: Ablation study on the representation of tumor queries. Average detection sensitivity (%) and specificity (%), and voxel-level tumor Dice scores are reported.

Sensitivity Specificity Dice Plain 89.5 76.1 0.605 Parallel 90.1 78.9 0.608 Hierarchy (Ours)93.3 81.7 0.628

Table 7: Efficiency comparison. CancerUniT is 4.5x faster and 8x lighter than the assembly of single-tumor expert models (8-nnUNet).

Model Speed Params 8-nnUNet ensemble 187s 246.24M DeepLesion model[66]17s 70.94M LENS (train on our data)[62]17s 70.94M nnUNet[26]22s 30.78M TransUNet[7, 26]25s 38.53M Ours 42s 30.87M

Patient-level tumor detection per organ. This task aims at the evaluation of whether the model can correctly localize and identify an existing tumor (agnostic of subtypes) or generate false positive tumor predictions in the normal controls. For example, if a patient has a tumor annotated in the liver, a true positive prediction means that the model predicts a liver tumor that overlaps (Dice >>> 0) the ground-truth tumor annotation. We report the sensitivity for each organ and the specificity for normal controls in the test set.

As shown in Table2, our model outperforms all the baseline models in terms of average sensitivity and specificity. Compared to the 8-nnUNet ensemble, our model has substantial improvement in the sensitivity of detecting colorectum (+18%), lung (+20%), and liver (+8%) tumors, and the overall specificity (+21%). We also observe improvements in these organs of other unified models, i.e., nnUNet and TransUNet, which demonstrate that the unified training of multi-organ multi-tumor segmentation will benefit almost every separate task, except for breast tumor (-2%). Without seeing other organs and tumors, the separately trained models have many more false positives than unified models, with a much lower specificity of 59.2%.

Without seeing our data, the DeepLesion[66] model has a moderate average sensitivity (64.0%) and low specificity (4.2%), hardly applicable to the real clinical scenario under such a high false positive rate. After training on our data with a leading lesion detection algorithm LENS[62], the sensitivity for colorectum, esophagus, and stomach, as well as the specificity are substantially improved; nevertheless, these are still lower than the segmentation-based models. These comparisons demonstrate that solving the tumor detection task as semantic segmentation is superior to using object detection methods.

Class-agnostic lesion-level tumor detection. In lesion or tumor-level evaluation, we combine all lesions into one class and extract the lesion instances from the segmentation masks of ground-truth and predictions to compute the overall precision and recall. If a predicted lesion instance mask overlaps a ground-truth lesion, we count this prediction as true positive. As shown in Table3, our approach has the highest precision and recall among all the methods. Both the DeepLesion models and the 8-nnUNet ensemble models have a large number of false positives, resulting in low precision. Similar to patient-level results, semantic segmentation algorithms generally do better than object detection methods. Our model outperforms the unified nnUNet model by approximately 5% in precision and 4% in recall.

Tumor segmentation. This task focuses on the tumor segmentation quality, where our model still ranks as the top in segmentation Dice score per organ, as shown in Table4. Here, we still ignore the subtype of the tumor and treat the tumors in the same organ with the same label. We only compare our model with the segmentation baselines, not the detection models (DeepLesion and LENS). Similar to tumor detection, the second best is the TransUNet model, and the unified nnUNet is better than its single counterpart. The improved performance of our model and TransUNet model illustrates that enhancing the CNN feature extraction with attentions will benefit multi-tumor segmentation. This observation is in line with our assumption that our query-based Transformer better explores the similarity between the inter-organ tumors, thus mutually improving the pixel-level texture differentiation of all tumors.

Tumor diagnosis. Our third evaluation focuses on the diagnostic ability to differentiate different types of tumors on the four organs, i.e., pancreas, esophagus, stomach, and liver, where we have tumor subtypes including cancer and non-cancer. As shown in Table5, our method achieves the highest overall diagnosis performance of 70.9%. Different from previous tumor detection and segmentation results, the single expert model is the second best (68.1%), demonstrating its strong baseline performance on the diagnosis on single organs. The unified nnUNet model has a substantial performance drop (-4%) compared to its separately trained counterpart. We hypothesize that this is due to the difficulty of multi-task training. With only voxel-wise supervision, a vanilla unified nnUNet is hard to well recognize numerous subtypes of tumors for accurate diagnosis. In contrast, our model is capable of exploiting the relationship between different tumor diagnosis tasks with our query hierarchy, thus maintaining high performance and even improving over the single expert models.

Figure 3: (A) Our model can handle multiple CT protocols representing the real-world clinical practice. (B) An example of the 3D masks of 8 organs and a breast tumor; and examples of 8 types of tumors being detected, segmented, and diagnosed by our CancerUniT.

Ablation study. We perform the ablation study on the representation of tumors (Table 6). We compare the other two representations: (1) parallel representation: the detection queries and diagnosis queries are organized as two groups in parallel, without structural connection. (2) plain representation: only diagnosis queries are used in our framework, while the prediction for a major tumor 𝒎 i subscript 𝒎 𝑖\boldsymbol{m}{i}bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the detection branch is directly obtained by merging several subtypes of tumors 𝒏 𝒊 subscript 𝒏 𝒊\boldsymbol{n{i}}bold_italic_n start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT in the diagnosis branch.

Efficiency. We compare the efficiency among various models in both inference speed and model size (number of parameters) as illustrated in Table 7. CancerUniT is 4.5x faster and 8x lighter than the assembly of single-tumor expert models.

Visual results in Fig.3 shows that our model can handle multiple CT protocol, and is capable to detect, segment, and diagnose 8 types of major cancers. Generalizability to public dataset is shown in Supplementary.

5 Conclusion

In this paper, we propose a single unified tumor Transformer (CancerUniT) model to detect, segment and diagnose eight common cancers using 3D CT scans, for the first time. CancerUniT is a query-based Transformer and offers a novel clinically inspired hierarchical tumor representation, with a dual-task query decoding stage for segmentation mask generation. We curate a large collection of CT scans with high clinical quality from 10,673 patients, including eight major types of cancers and occurring non-cancer tumors (pathology-confirmed and manually annotated). Extensive quantitative evaluations have demonstrated the promising performance of our new model. This moves one step closer to a universal high performance cancer screening AI tool.

Acknowledgments. Jieneng Chen and Alan Yuille in this project were partially funded by a 2023 Patrick J. McGovern Foundation award.

References

[1] David A Ahlquist. Universal cancer screening: revolutionary, rational, and realizable. NPJ Precision Oncology, 2(1):23, 2018.
[2] Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M Summers, et al. The medical segmentation decathlon. Nature communications, 13(1):4128, 2022.
[3] Diego Ardila, Atilla P Kiraly, Sujeeth Bharadwaj, Bokyung Choi, Joshua J Reicher, Lily Peng, Daniel Tse, Mozziyar Etemadi, Wenxing Ye, Greg Corrado, et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine, 25(6):954–961, 2019.
[4] Wenya Linda Bi, Ahmed Hosny, Matthew B Schabath, Maryellen L Giger, Nicolai J Birkbak, Alireza Mehrtash, Tavis Allison, Omar Arnaout, Christopher Abbosh, Ian F Dunn, et al. Artificial intelligence in cancer imaging: clinical challenges and applications. CA: A Cancer Journal for Clinicians, 69(2):127–157, 2019.
[5] Patrick Bilic, Patrick Ferdinand Christ, Eugene Vorontsov, Grzegorz Chlebus, Hao Chen, Qi Dou, Chi-Wing Fu, Xiao Han, Pheng-Ann Heng, Jürgen Hesser, et al. The liver tumor segmentation benchmark (lits). arXiv preprint arXiv:1901.04056, 2019.
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
[7] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.
[8] Jie-Neng Chen, Shuyang Sun, Ju He, Philip HS Torr, Alan Yuille, and Song Bai. Transmix: Attend to mix for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12135–12144, 2022.
[9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
[10] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
[11]Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
[12] Chi-Tung Cheng, Jinzheng Cai, Wei Teng, Youjing Zheng, Yu-Ting Huang, Yu-Chao Wang, Chien-Wei Peng, Youbao Tang, Wei-Chen Lee, Ta-Sen Yeh, et al. A flexible three-dimensional heterophase computed tomography hepatocellular carcinoma detection algorithm for generalizable and practical screening. Hepatology Communications, 2022.
[13] Joshua D Cohen, Lu Li, Yuxuan Wang, Christopher Thoburn, Bahman Afsari, Ludmila Danilova, Christopher Douville, Ammar A Javed, Fay Wong, Austin Mattox, et al. Detection and localization of surgically resectable cancers with a multi-analyte blood test. Science, 359(6378):926–930, 2018.
[14] Jeffrey De Fauw, Joseph R Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, Sam Blackwell, Harry Askham, Xavier Glorot, Brendan O’Donoghue, Daniel Visentin, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine, 24(9):1342–1350, 2018.
[15] Konstantin Dmitriev and Arie E Kaufman. Learning multi-class segmentations from single-class datasets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9501–9511, 2019.
[16] Kunio Doi. Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Computerized Medical Imaging and Graphics, 31(4-5):198–211, 2007.
[17] D Dong, M-J Fang, L Tang, X-H Shan, J-B Gao, F Giganti, R-P Wang, X Chen, X-X Wang, D Palumbo, et al. Deep learning radiomic nomogram can predict the number of lymph node metastasis in locally advanced gastric cancer: an international multicenter study. Annals of Oncology, 31(7):912–920, 2020.
[18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[19] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
[20] Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, pages 272–284. Springer, 2022.
[21]Nicholas Heller, Fabian Isensee, Klaus H Maier-Hein, Xiaoshuai Hou, Chunmei Xie, Fengyi Li, Yang Nan, Guangrui Mu, Zhiyong Lin, Miofei Han, et al. The state of the art in kidney and kidney tumor segmentation in contrast-enhanced ct imaging: Results of the kits19 challenge. Medical Image Analysis, 67:101821, 2021.
[22] Ahmed Hosny, Danielle S Bitterman, Christian V Guthier, Jack M Qian, Hannah Roberts, Subha Perni, Anurag Saraf, Luke C Peng, Itai Pashtan, Zezhong Ye, et al. Clinical validation of deep learning algorithms for radiotherapy targeting of non-small-cell lung cancer: an observational study. The Lancet Digital Health, 4(9):e657–e666, 2022.
[23] Xiaojie Huang, Junjie Shan, and Vivek Vaidya. Lung nodule detection in ct using 3d convolutional neural networks. In 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), pages 379–383. IEEE, 2017.
[24] Yan-qi Huang, Chang-hong Liang, Lan He, Jie Tian, Cui-shan Liang, Xin Chen, Ze-lan Ma, and Zai-yi Liu. Development and validation of a radiomics nomogram for preoperative prediction of lymph node metastasis in colorectal cancer. Journal of Clinical Oncology, 34(18):2157–2164, 2016.
[25] Yuankai Huo, Jinzheng Cai, Chi-Tung Cheng, Ashwin Raju, Ke Yan, Bennett A Landman, Jing Xiao, Le Lu, Chien-Hung Liao, and Adam P Harrison. Harvesting, detecting, and characterizing liver lesions from large-scale multi-phase CT data via deep dynamic texture learning. arXiv preprint arXiv:2006.15691, 2020.
[26] Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18(2):203–211, 2021.
[27] Roger Y Kim, Jason L Oke, Lyndsey C Pickup, Reginald F Munden, Travis L Dotson, Christina R Bellinger, Avi Cohen, Michael J Simoff, Pierre P Massion, Claire Filippini, et al. Artificial intelligence tool for assessment of indeterminate pulmonary nodules detected with CT. Radiology, page 212182, 2022.
[28]EA Klein, D Richards, A Cohn, M Tummala, R Lapham, D Cosgrove, G Chung, J Clement, J Gao, N Hunkapiller, et al. Clinical validation of a targeted methylation-based multi-cancer early detection test using an independent validation set. Annals of Oncology, 32(9):1167–1177, 2021.
[29] Ho Hin Lee, Yucheng Tang, Olivia Tang, Yuchen Xu, Yunqiang Chen, Dashan Gao, Shizhong Han, Riqiang Gao, Michael R Savona, Richard G Abramson, et al. Semi-supervised multi-organ segmentation through quality assurance supervision. In Medical Imaging 2020: Image Processing, volume 11313, pages 363–369. SPIE, 2020.
[30] Anne Marie Lennon, Adam H Buchanan, Isaac Kinde, Andrew Warren, Ashley Honushefsky, Ariella T Cohain, David H Ledbetter, Fred Sanfilippo, Kathleen Sheridan, Dillenia Rosica, et al. Feasibility of blood testing combined with PET-CT to screen for cancer and guide intervention. Science, 369(6499):eabb9601, 2020.
[31] Yuan Liu, Ayush Jain, Clara Eng, David H Way, Kang Lee, Peggy Bui, Kimberly Kanada, Guilherme de Oliveira Marinho, Jessica Gallegos, Sara Gabriele, et al. A deep learning system for differential diagnosis of skin diseases. Nature Medicine, 26(6):900–908, 2020.
[32] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
[33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
[34] Xiangde Luo, Wenjun Liao, Jianghong Xiao, Jieneng Chen, Tao Song, Xiaofan Zhang, Kang Li, Dimitris N Metaxas, Guotai Wang, and Shaoting Zhang. Word: A large scale dataset, benchmark and clinical applicable study for abdominal organ segmentation from ct image. Medical Image Analysis, 82:102642, 2022.
[35] Fei Lyu, Baoyao Yang, Andy J. Ma, and Pong C. Yueni. A segmentation-assisted model for universal lesion detection with partial labels. In MICCAI, 2021.
[36] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashrafian, Trevor Back, Mary Chesus, Greg S Corrado, Ara Darzi, et al. International evaluation of an ai system for breast cancer screening. Nature, 577(7788):89–94, 2020.
[37] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
[38] Perry J Pickhardt. Value-added opportunistic CT screening: State of the art. Radiology, 2022.
[39] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[40] Johannes Rueckel, Jonathan I Sperl, Sophia Kaestle, Boj F Hoppe, Nicola Fink, Jan Rudolph, Vincent Schwarze, Thomas Geyer, Frederik F Strobl, Jens Ricke, et al. Reduction of missed thoracic findings in emergency whole-body computed tomography using artificial intelligence assistance. Quant Imaging Med Surg, 11:2486–98, 2021.
[41] Aaron Sodickson, Pieter F Baeyens, Katherine P Andriole, Luciano M Prevedello, Richard D Nawfel, Richard Hanson, and Ramin Khorasani. Recurrent CT, cumulative radiation exposure, and associated radiation-induced cancer risks from CT of adults. Radiology, 251(1):175, 2009.
[42] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7262–7272, 2021.
[43] Ronald M Summers. Road maps for advancement of radiologic computer-aided detection in the 21st century. Radiology, 229(1):11–13, 2003.
[44] Hyuna Sung, Jacques Ferlay, Rebecca L Siegel, Mathieu Laversanne, Isabelle Soerjomataram, Ahmedin Jemal, and Freddie Bray. Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: A Cancer Journal for Clinicians, 71(3):209–249, 2021.
[45] Youbao Tang, Jinzheng Cai, Ke Yan, Lingyun Huang, Guotong Xie, Jing Xiao, Jingjing Lu, Gigin Lin, and Le Lu. Weakly-supervised universal lesion segmentation with regional level set loss. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24, pages 515–525. Springer, 2021.
[46] You-Bao Tang, Ke Yan, Yu-Xing Tang, Jiamin Liu, Jin Xiao, and Ronald M Summers. Uldor: a universal lesion detector for ct scans with pseudo masks and hard negative example mining. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pages 833–836. IEEE, 2019.
[47] National Lung Screening Trial Research Team et al. The national lung screening trial: overview and study design. Radiology, 258(1):243, 2011.
[48] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
[49] Kicky G van Leeuwen, Steven Schalekamp, Matthieu JCM Rutten, Bram van Ginneken, and Maarten de Rooij. Artificial intelligence in radiology: 100 commercially available products and their scientific evidence. European Radiology, 31(6):3797–3804, 2021.
[50] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[51] Fakai Wang, Chi-Tung Cheng, Chien-Wei Peng, Ke Yan, Min Wu, Le Lu, Chien-Hung Liao, and Ling Zhang. Multi-sensitivity segmentation with context-aware augmentation for liver tumor detection in CT. in submission.
[52] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5463–5474, 2021.
[53] Shuo Wang, He Yu, Yuncui Gan, Zhangjie Wu, Encheng Li, Xiaohu Li, Jingxue Cao, Yongbei Zhu, Liusu Wang, Hui Deng, et al. Mining whole-lung information by artificial intelligence for predicting egfr genotype and targeted therapy response in lung cancer: a multicohort study. The Lancet Digital Health, 4(5):e309–e319, 2022.
[54] Yan Wang, Yuyin Zhou, Wei Shen, Seyoun Park, Elliot K Fishman, and Alan L Yuille. Abdominal multi-organ segmentation with organ-attention networks and statistical fusion. Medical image analysis, 55:88–102, 2019.
[55] Jakob Wasserthal, Manfred Meyer, Hanns-Christian Breit, Joshy Cyriac, Shan Yang, and Martin Segeroth. Totalsegmentator: robust segmentation of 104 anatomical structures in CT images. arXiv preprint arXiv:2208.05868, 2022.
[56] WHO. Global health estimates 2020: Deaths by cause, age, sex, by country and by region, 2000–2019, 2020.
[57] Yingda Xia, Jiawen Yao, Le Lu, Lingyun Huang, Guotong Xie, Jing Xiao, Alan Yuille, Kai Cao, and Ling Zhang. Effective pancreatic cancer screening on non-contrast CT scans via anatomy-aware transformers. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 259–269. Springer, 2021.
[58] Yingda Xia, Qihang Yu, Linda Chu, Satomi Kawamoto, Seyoun Park, Fengze Liu, Jieneng Chen, Zhuotun Zhu, Bowen Li, Zongwei Zhou, et al. The felix project: Deep networks to detect pancreatic neoplasms. medRxiv, pages 2022–09, 2022.
[59] Yutong Xie, Yong Xia, Jianpeng Zhang, Yang Song, Dagan Feng, Michael Fulham, and Weidong Cai. Knowledge-based collaborative deep learning for benign-malignant lung nodule classification on chest CT. IEEE Transactions on Medical Imaging, 38(4):991–1004, 2018.
[60] Yutong Xie, Jianpeng Zhang, Chunhua Shen, and Yong Xia. Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In International conference on medical image computing and computer-assisted intervention, pages 171–180. Springer, 2021.
[61] Lianyan Xu, Ke Yan, Le Lu, Weihong Zhang, Xu Chen, Xiaofei Huo, and Jingjing Lu. External and internal validation of a computer assisted diagnostic model for detecting multi-organ mass lesions in CT images. Chinese Medical Sciences Journal, 36(3):210–217, 2021.
[62] Ke Yan, Jinzheng Cai, Youjing Zheng, Adam P. Harrison, Dakai Jin, Youbao Tang, Yuxing Tang, Lingyun Huang, Jing Xiao, and Le Lu. Learning from multiple datasets with heterogeneous and partial labels for universal lesion detection in ct. IEEE Trans. Medical Imaging, 40(10):2759–2770, 2021.
[63] Ke Yan, Yifan Peng, Veit Sandfort, Mohammadhadi Bagheri, Zhiyong Lu, and Ronald M Summers. Holistic and comprehensive annotation of clinically significant findings on diverse CT images: learning from radiology reports and label ontology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8523–8532, 2019.
[64]Ke Yan, Youbao Tang, Yifan Peng, Veit Sandfort, Mohammadhadi Bagheri, Zhiyong Lu, and Ronald M. Summers. Mulan: Multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation. In MICCAI, 2019.
[65] Ke Yan, Xiaosong Wang, Le Lu, and Ronald M Summers. Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. Journal of Medical Imaging, 5(3):036501, 2018.
[66] Ke Yan, Xiaosong Wang, Le Lu, Ling Zhang, Adam P Harrison, Mohammadhadi Bagheri, and Ronald M Summers. Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9261–9270, 2018.
[67] Jiancheng Yang, Yi He, Kaiming Kuang, Zudi Lin, Hanspeter Pfister, and Bingbing Ni. Asymmetric 3d context fusion for universal lesion detection. In MICCAI, 2021.
[68] Xiaojun Yang, Lei Wu, Weitao Ye, Ke Zhao, Yingyi Wang, Weixiao Liu, Jiao Li, Hanxiao Li, Zaiyi Liu, and Changhong Liang. Deep learning signature based on staging ct for preoperative prediction of sentinel lymph node metastasis in breast cancer. Academic Radiology, 27(9):1226–1233, 2020.
[69] Jiawen Yao, Kai Cao, Yang Hou, Jian Zhou, Yingda Xia, Isabella Nogues, Qike Song, Hui Jiang, Xianghua Ye, Jianping Lu, et al. Deep learning for fully automated prediction of overall survival in patients undergoing resection for pancreatic cancer: A retrospective multicenter study. Annals of Surgery, pages 10–1097, 2022.
[70] Jiawen Yao, Xianghua Ye, Yingda Xia, Jian Zhou, Yu Shi, Ke Yan, Fang Wang, Lili Lin, Haogang Yu, Xian-Sheng Hua, et al. Effective opportunistic esophageal cancer screening using noncontrast CT imaging. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 344–354. Springer, 2022.
[71] Lisha Yao, Yingda Xia, Haochen Zhang, Jiawen Yao, Dakai Jin, Bingjiang Qiu, Yuan Zhang, Suyun Li, Yanting Liang, Xian-Sheng Hua, et al. Deepcrc: Colorectum and colorectal cancer segmentation in CT scans via deep colorectal coordinate transform. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 564–573. Springer, 2022.
[72] Paul A. Yushkevich, Joseph Piven, Heather Cody Hazlett, Rachel Gimpel Smith, Sean Ho, James C. Gee, and Guido Gerig. User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability. Neuroimage, 31(3):1116–1128, 2006.
[73] Jianpeng Zhang, Yutong Xie, Yong Xia, and Chunhua Shen. Dodnet: Learning to segment multi-organ and tumors from multiple partially labeled datasets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1195–1204, 2021.
[74] Yao Zhang, Jiawei Yang, Jiang Tian, Zhongchao Shi, Cheng Zhong, Yang Zhang, and Zhiqiang He. Modality-aware mutual learning for multi-modal medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 589–599. Springer, 2021.
[75] Tianyi Zhao, Kai Cao, Jiawen Yao, Isabella Nogues, Le Lu, Lingyun Huang, Jing Xiao, Zhaozheng Yin, and Ling Zhang. 3d graph anatomy geometry-integrated network for pancreatic mass segmentation, diagnosis, and quantitative patient management. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13743–13752, 2021.
[76] Sunyi Zheng, Jiapan Guo, Xiaonan Cui, Raymond NJ Veldhuis, Matthijs Oudkerk, and Peter MA Van Ooijen. Automatic pulmonary nodule detection in CT scans using convolutional neural networks based on maximum intensity projection. IEEE Transactions on Medical Imaging, 39(3):797–805, 2019.
[77] Yuyin Zhou, Zhe Li, Song Bai, Chong Wang, Xinlei Chen, Mei Han, Elliot Fishman, and Alan L Yuille. Prior-aware neural network for partially-supervised multi-organ segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10672–10681, 2019.
[78] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE transactions on medical imaging, 39(6):1856–1867, 2019.
[79] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
[80] Zhuotun Zhu, Yingda Xia, Lingxi Xie, Elliot K Fishman, and Alan L Yuille. Multi-scale coarse-to-fine segmentation for screening pancreatic ductal adenocarcinoma. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 3–12. Springer, 2019.

Appendix: CancerUniT

Abstract. This document contains the Supplementary Materials for the ICCV 2023 paper ”CancerUniT: Towards a Single Unified Model for Effective Detection, Segmentation, and Diagnosis of Eight Major Cancers Using a Large Collection of CT Scans”. It covers the model generalizability to public dataset, (§A), model instantiation details (§B), semantic segmentation results of full spectrum tumors (§C), and the qualitative results (§E).

Appendix A Generalizability to Public Dataset

Our method aims at holistically modeling the multiple cancer screening problem versus non-cancer. However, to the best of our knowledge, no public dataset is suitable for such problems. Nevertheless, our trained model generalizes well on three public single-tumor datasets including MSD pancreas, liver and lung dataset, as shown in Table.8. It is worth noting that our model inference directly without extra training, whereas the 3 single-nnUNets is trained on the MSD dataset with domain knowledge. To be specific, the experiment of 3 single-nnUNet is conducted with 5-fold cross-validation, and our CancerUniT is tested on the same validation set.

Despite not having any prior knowledge of the data distribution, our proposed UniT model effectively suppresses the single-tumor expert model, achieving an average tumor detection sensitivity improvement of 3.1%. Our results demonstrate the efficacy of our proposed method for addressing the tumor detection problem without the need for a specific dataset. The ability to generalize well on public datasets and suppress the single-tumor expert model underscores the potential of our approach to be used as a practical solution for universal cancer screening and diagnosis.

Table 8: Generalizability to 3 Public MSD dataset[2]. Average detection sensitivity is reported. Our model inference directly, whereas 3 single-nnUNets are trained on the MSD dataset.

Pancreatic tumor Liver tumor Lung tumor Avg Speed Param single-nnUNet (trained)88%97%90.5%91.8%66s 92.34M Ours (test)94.7%93.1%97%94.9%42s 30.87M

Appendix B Model Instantiation Details

In our CancerUniT, the hidden dimension of query is set to 32, such that the detection query 𝐀 j∈ℝ 4×32 superscript 𝐀 𝑗 superscript ℝ 4 32\mathbf{A}^{j}\in\mathbb{R}^{4\times 32}bold_A start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 × 32 end_POSTSUPERSCRIPT, the diagnosis query 𝐁 j∈ℝ 10×32 superscript 𝐁 𝑗 superscript ℝ 10 32\mathbf{B}^{j}\in\mathbb{R}^{10\times 32}bold_B start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 10 × 32 end_POSTSUPERSCRIPT, the shared query 𝐒 j∈ℝ 12×32 superscript 𝐒 𝑗 superscript ℝ 12 32\mathbf{S}^{j}\in\mathbb{R}^{12\times 32}bold_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 12 × 32 end_POSTSUPERSCRIPT. We adopt nnUNet[26] as the backbone to extract multi-scale features 𝐅=[𝐅 1,𝐅 2,𝐅 3,𝐅 4]𝐅 superscript 𝐅 1 superscript 𝐅 2 superscript 𝐅 3 superscript 𝐅 4\mathbf{F}=[\mathbf{F}^{1},\mathbf{F}^{2},\mathbf{F}^{3},\mathbf{F}^{4}]bold_F = [ bold_F start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_F start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , bold_F start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ]. Note, 𝐅 j∈ℝ d×(D×H×W)superscript 𝐅 𝑗 superscript ℝ 𝑑 𝐷 𝐻 𝑊\mathbf{F}^{j}\in\mathbb{R}^{d\times(D\times H\times W)}bold_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × ( italic_D × italic_H × italic_W ) end_POSTSUPERSCRIPT is flatten and projected from intermediate spatial feature 𝐅^j∈ℝ C×D×H×W superscript^𝐅 𝑗 superscript ℝ 𝐶 𝐷 𝐻 𝑊\mathbf{\hat{F}}^{j}\in\mathbb{R}^{C\times D\times H\times W}over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_D × italic_H × italic_W end_POSTSUPERSCRIPT. In specific, 𝐅 1∈ℝ 32×(48×192×192)superscript 𝐅 1 superscript ℝ 32 48 192 192\mathbf{F}^{1}\in\mathbb{R}^{32\times(48\times 192\times 192)}bold_F start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 32 × ( 48 × 192 × 192 ) end_POSTSUPERSCRIPT, 𝐅 2∈ℝ 32×(48×96×96)superscript 𝐅 2 superscript ℝ 32 48 96 96\mathbf{F}^{2}\in\mathbb{R}^{32\times(48\times 96\times 96)}bold_F start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 32 × ( 48 × 96 × 96 ) end_POSTSUPERSCRIPT, 𝐅 3∈ℝ 32×(24×48×48)superscript 𝐅 3 superscript ℝ 32 24 48 48\mathbf{F}^{3}\in\mathbb{R}^{32\times(24\times 48\times 48)}bold_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 32 × ( 24 × 48 × 48 ) end_POSTSUPERSCRIPT, and 𝐅 4∈ℝ 32×(12×24×24)superscript 𝐅 4 superscript ℝ 32 12 24 24\mathbf{F}^{4}\in\mathbb{R}^{32\times(12\times 24\times 24)}bold_F start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 32 × ( 12 × 24 × 24 ) end_POSTSUPERSCRIPT. The total number of Transformer layer is set to 3, each of which contains a multi-head cross-attention, a multi-head self-attention, and a feed-forward network. Note, in the inference stage, the tumor segmentation maps are extract to generate the tumor instances with class labels, where those tumor instances with less than 200 voxels are discarded.

Table 9: Voxel-level semantic segmentation results of full spectrum tumors. The Dice coefficient is reported. Note: the Dice values are calculated in a semantic manner, e.g., the HCC voxel is correctly segmented as the HCC subtype (not other liver tumor subtypes or other tumor types) by the semantic segmentation.

Model Pancreas Eso Stomach Liver Average PDAC nonPDAC avg EC nonEC avg GC nonGC avg HCC ICC Meta Heman avg 8-nnUNet ensemble 0.750 0.525 0.638 0.770 0.433 0.602 0.441 0.099 0.270 0.489 0.552 0.296 0.784 0.530 0.510 nnUNet[26]0.758 0.534 0.646 0.739 0.207 0.473 0.453 0.068 0.261 0.410 0.481 0.306 0.739 0.484 0.466 TransUNet[7]0.749 0.553 0.651 0.744 0.321 0.533 0.473 0.128 0.301 0.411 0.503 0.353 0.717 0.496 0.495 Ours 0.728 0.560 0.644 0.738 0.457 0.597 0.389 0.187 0.288 0.368 0.666 0.305 0.773 0.528 0.514

Appendix C Semantic Segmentation Results of Full Spectrum Tumors

We conducted an evaluation of our model’s performance on the semantic segmentation of full spectrum tumors, which is a challenging task that involves the segmentation of multiple tumor subtypes within an organ. The quality of the multi-class tumor segmentation was assessed using the multi-class Dice score, where each subtype of the tumor was treated as an independent semantic class.

Our model outperformed the segmentation baselines and achieved the highest average segmentation Dice score, as demonstrated in Table9. Notably, our model was not compared with detection models such as DeepLesion and LENS, as these models are not designed for semantic segmentation tasks.

Our findings suggest that enhancing the query hierarchy in our model can improve the semantic segmentation of full spectrum tumors. This observation is in line with our assumption that our query-based Transformer model can more effectively explore the similarity between intra-organ tumor subtypes, leading to improved segmentation performance. Overall, our evaluation provides evidence that our proposed model can effectively address the challenges of multi-class tumor segmentation in the context of full spectrum tumors.

Appendix D Universal Cancer Screening: CT vs Blood Test.

Blood test is now one of the most attractive tools for non-invasive multi-organ cancer screening[13, 30, 28]. CT scanning had been considered historically for the same task, but was limited by its insufficient sensitivity and specificity [1]. AI reading in CT as an alternative opportunistic screening tool, our approach also has strong clinical potential for cancer detection screening. The advantage of CT is that this protocol is already an indispensable diagnostic imaging for cancer, but a positive blood test result requires further examinations for confirmation. With our model, clinicians have direct visual analyses of the detected cancer sites and mis-detections of cancer can be largely reduced. No additional cost is needed under the opportunistic CT screening protocol whereas a single blood test can usually take ∼1000 similar-to absent 1000\sim 1000∼ 1000 US dollars.

For relative performance comparison to CancerSeek[13], i.e., cancer vs. normal, our method has higher sensitivity levels in detecting six out of seven types of cancers: approximately for stomach (+18%), pancreas (+24%), esophagus (+26%), colorectum (+35%), lung (+34%), and breast (+57%). Our averaged patient-level cancer detection sensitivity is 94% versus 70% in[13]. The test specificity for normal cases in venous CT is 100% (blood test>>>99%). We acknowledge that the results of the representative blood test[13] and ours may not directly comparable since different test data are used. Nevertheless, the rough comparison shows the high accuracy of CT+AI solution, and thus may re-open doors for multi-cancer screening by CT [1].

Appendix E Qualitative Results

We provide more qualitative results of full spectrum tumors in the test set being segmented and diagnosed by our method as shown in Fig.4. The results demonstrate that our method can not only segment the tumor region well but also predict the class of tumor subtype correctly.

Figure 4: Qualitative results of full spectrum tumors in the test set being segmented and diagnosed by our method (best viewed in color).

Xet Storage Details

Size:: 98.3 kB
Xet hash:: e06f99ea13a25f0ed3aee74e9df588554a92a7fb894acf5dfe8792ee36e842d7

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.