Advertisement
Technical Report|Articles in Press

Deep-Learning-Based Whole-Lung and Lung-Lesion Quantification Despite Inconsistent Ground Truth: Application to Computerized Tomography in SARS-CoV-2 Nonhuman Primate Models

Published:February 26, 2023DOI:https://doi.org/10.1016/j.acra.2023.02.027

      Rationale and Objectives

      Animal modeling of infectious diseases such as coronavirus disease 2019 (COVID-19) is important for exploration of natural history, understanding of pathogenesis, and evaluation of countermeasures. Preclinical studies enable rigorous control of experimental conditions as well as pre-exposure baseline and longitudinal measurements, including medical imaging, that are often unavailable in the clinical research setting. Computerized tomography (CT) imaging provides important diagnostic, prognostic, and disease characterization to clinicians and clinical researchers. In that context, automated deep-learning systems for the analysis of CT imaging have been broadly proposed, but their practical utility has been limited. Manual outlining of the ground truth (i.e., lung-lesions) requires accurate distinctions between abnormal and normal tissues that often have vague boundaries and is subject to reader heterogeneity in interpretation. Indeed, this subjectivity is demonstrated as wide inconsistency in manual outlines among experts and from the same expert. The application of deep-learning data-science tools has been less well-evaluated in the preclinical setting, including in nonhuman primate (NHP) models of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection/COVID-19, in which the translation of human-derived deep-learning tools is challenging. The automated segmentation of the whole lung and lung lesions provides a potentially standardized and automated method to detect and quantify disease.

      Materials and Methods

      We used deep-learning-based quantification of the whole lung and lung lesions on CT scans of NHPs exposed to SARS-CoV-2. We proposed a novel multi-model ensemble technique to address the inconsistency in the ground truths for deep-learning-based automated segmentation of the whole lung and lung lesions. Multiple models were obtained by training the convolutional neural network (CNN) on different subsets of the training data instead of having a single model using the entire training dataset. Moreover, we employed a feature pyramid network (FPN), a CNN that provides predictions at different resolution levels, enabling the network to predict objects with wide size variations.

      Results

      We achieved an average of 99.4 and 60.2% Dice coefficients for whole-lung and lung-lesion segmentation, respectively. The proposed multi-model FPN outperformed well-accepted methods U-Net (50.5%), V-Net (54.5%), and Inception (53.4%) for the challenging lesion-segmentation task. We show the application of segmentation outputs for longitudinal quantification of lung disease in SARS-CoV-2-exposed and mock-exposed NHPs.

      Conclusion

      Deep-learning methods should be optimally characterized for and targeted specifically to preclinical research needs in terms of impact, automation, and dynamic quantification independently from purely clinical applications.

      Key Words

      INTRODUCTION

      Infectious diseases are the second leading cause of death worldwide, and the coronavirus disease 2019 (COVID-19) pandemic has been a sobering reminder of their potential to cause population-level death and disability. Animal modeling is particularly important for exploration of natural history, understanding of pathogenesis, and evaluation of countermeasures for high-consequence viral diseases that are either novel emerging threats (e.g., COVID-19) or are rare and/or historically understudied in humans (e.g., Ebola virus disease, Lassa fever, and Nipah virus disease) (
      • Muñoz-Fontela C
      • Dowling WE
      • Funnell SGP
      • et al.
      Animal models for COVID-19.
      ,
      • Jacob ST.
      Ebola virus disease.
      ,
      • Sattler RA
      • Paessler S
      • Ly H
      • et al.
      Animal models of lassa fever.
      ,
      • de Wit E
      • Munster VJ.
      Animal models of disease shed light on Nipah virus pathogenesis and transmission.
      ). Preclinical studies conducted in animal biosafety level 3 or 4 (ABSL-3/4) laboratories (
      • Byrum R
      • Keith L
      • Bartos C
      • et al.
      Safety precautions and operating procedures in an (A) BSL-4 laboratory: 4. medical imaging procedures.
      ) enable rigorous control of experimental conditions (e.g., pathogen isolate, dose, and exposure route) as well as pre-exposure baseline and longitudinal measurements that are often unavailable in the clinical setting (
      • Jelicks LA
      • Tanowitz HB
      • Albanese C
      Small animal imaging of human disease: from bench to bedside and back.
      ). In human COVID-19 research, medical imaging evaluation in which pre-infection baseline images are almost never available, hindering the ability to use longitudinal assessment as part of natural history or evaluation of responses to therapeutics.
      Computed tomography (CT) imaging provides important clinical information for the diagnosis, characterization, prognostication, and evolution of disease in patients infected with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (
      • Jin C
      • Chen W
      • Cao Y
      • et al.
      Development and evaluation of an artificial intelligence system for COVID-19 diagnosis.
      ,
      • Shi F
      • Wang J
      • Shi J
      • et al.
      Review of artificial intelligence techniques in imaging data acquisition, segmentation, and diagnosis for COVID-19.
      ). Evaluation of lung abnormality in a reproducible manner requires accurate segmentation and quantification of both the whole lung and of lung lesions (lung abnormality) in chest CT images of COVID-19 patients. Unfortunately, the manual process required for accurate segmentation and quantification for a large dataset is time-consuming, with low inter- and intra-observer agreement (
      • Wang G
      • Liu X
      • Li C
      • et al.
      A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images.
      ). Even for experienced radiologists, it can take several hours to accurately segment all lesions within a large area of interest in a single high-resolution chest CT scan. Therefore, automated segmentation methods are critical to unlock the full potential of CT imaging in COVID-19 preclinical and clinical research.
      In the COVID-19 clinical setting, many deep-learning-based segmentation methods have been proposed for automated segmentation of the whole lung and lung lesions (
      • Jin C
      • Chen W
      • Cao Y
      • et al.
      Development and evaluation of an artificial intelligence system for COVID-19 diagnosis.
      • Shi F
      • Wang J
      • Shi J
      • et al.
      Review of artificial intelligence techniques in imaging data acquisition, segmentation, and diagnosis for COVID-19.
      ,
      • Gao K
      • Su J
      • Jiang Z
      • et al.
      Dual-branch combination network (DCN): Towards accurate diagnosis and lesion segmentation of COVID-19 using CT images.
      ,
      • Yang D
      • Xu Z
      • Li W
      • et al.
      Federated semi-supervised learning for COVID region segmentation in chest CT using multi-national data from China.
      ,
      • Fan DP
      • Zhou T
      • Ji GP
      • et al.
      Inf-net: Automatic covid-19 lung infection segmentation from ct images.
      ,
      • Ronneberger O
      • Fischer P
      • Brox T
      U-net: Convolutional networks for biomedical image segmentation.
      ,
      • Hofmanninger J
      • Prayer F
      • Pan J
      • et al.
      Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem.
      , ,
      • Chaganti S
      • Grenier P
      • Balachandran A
      • et al.
      Automated quantification of CT patterns associated with COVID-19 from chest CT.
      ,
      • Wu YH
      • Gao SH
      • Mei J
      • et al.
      Jcs: An explainable covid-19 diagnosis system by joint classification and segmentation.
      ,
      • Vakalopoulou M
      • Chassagnon G
      • Bus N
      • et al.
      Atlasnet: Multi-atlas non-linear deep networks for medical image segmentation.
      ,

      Chassagnon G, Vakalopoulou M, Battistella E, et al., AI-Driven CT-based quantification, staging and short-term outcome prediction of COVID-19 pneumonia, arXiv preprint 2020;arXiv:2004.12852. doi:10.48550/arXiv.2004.1285220.

      ,
      • Ouyang Xi
      • Huo J
      • Xia L
      • et al.
      Dual-sampling attention network for diagnosis of COVID-19 from community acquired pneumonia.
      ). Regarding lung lesions, a dual-branch combination network (DCN) for COVID-19 diagnosis that can simultaneously achieve diagnostic classification and lesion segmentation has been proposed (
      • Gao K
      • Su J
      • Jiang Z
      • et al.
      Dual-branch combination network (DCN): Towards accurate diagnosis and lesion segmentation of COVID-19 using CT images.
      ). A federated, semi-supervised learning framework for COVID-19 lung-lesion segmentation has also been developed (
      • Yang D
      • Xu Z
      • Li W
      • et al.
      Federated semi-supervised learning for COVID region segmentation in chest CT using multi-national data from China.
      ). That framework was effective via shared model weights, compared with fully supervised scenarios with conventional data sharing (
      • Yang D
      • Xu Z
      • Li W
      • et al.
      Federated semi-supervised learning for COVID region segmentation in chest CT using multi-national data from China.
      ). A novel COVID-19 lung-lesion segmentation deep network (Inf-Net) was reported to automatically identify abnormal regions from chest CT slices (
      • Fan DP
      • Zhou T
      • Ji GP
      • et al.
      Inf-net: Automatic covid-19 lung infection segmentation from ct images.
      ). Experimental results demonstrated that the semi-supervised Inf-Net framework improved learning ability and performance.
      A slice-by-slice-based two-dimensional (2D) U-Net (
      • Ronneberger O
      • Fischer P
      • Brox T
      U-net: Convolutional networks for biomedical image segmentation.
      ) was trained for lung-lesion segmentation using a combination of several datasets designed for and derived from other diseases (
      • Hofmanninger J
      • Prayer F
      • Pan J
      • et al.
      Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem.
      ); it was later surmised that it could be applied to COVID-19 chest CT imaging data (). However, quantitative evaluation of lung-lesion segmentation has not been reported. A number of other modeling solutions have been proposed: Dense-U-Net for automated segmentation of ground-glass opacities (GGOs) and areas of consolidation (
      • Chaganti S
      • Grenier P
      • Balachandran A
      • et al.
      Automated quantification of CT patterns associated with COVID-19 from chest CT.
      ), an encode-decoder convolutional neural network (CNN) with an attentive feature-fusion strategy and deep-learning supervision (
      • Wu YH
      • Gao SH
      • Mei J
      • et al.
      Jcs: An explainable covid-19 diagnosis system by joint classification and segmentation.
      ), and CovidENet (an ensemble of two-dimensional and three-dimensional [3D] CNNs based on AtlasNet (
      • Vakalopoulou M
      • Chassagnon G
      • Bus N
      • et al.
      Atlasnet: Multi-atlas non-linear deep networks for medical image segmentation.
      ) for total lesions segmentation (

      Chassagnon G, Vakalopoulou M, Battistella E, et al., AI-Driven CT-based quantification, staging and short-term outcome prediction of COVID-19 pneumonia, arXiv preprint 2020;arXiv:2004.12852. doi:10.48550/arXiv.2004.1285220.

      ). A dual-sampling attention network with a 3D CNN has also been developed to focus on lung lesions when making diagnostic decisions (
      • Ouyang Xi
      • Huo J
      • Xia L
      • et al.
      Dual-sampling attention network for diagnosis of COVID-19 from community acquired pneumonia.
      ).
      Notably, these automated methods were trained on large human datasets. Unfortunately, segmentation models trained on human CT images do not perform well on NHP CT images; this is particularly true in the lungs, in which humans and NHPs have major anatomical differences. Automated segmentation of any tissue/organ or pathology in these models is also challenging due to wide ranges in animal age and size. In addition, defining ground truth consistently is made difficult by the vague boundaries of typical lesions (e.g., GGOs) and the similar radio density distribution between lung consolidation and nearby cardiac and mediastinal tissues and chest walls. These indistinct borders pose a further challenge for automated methods of training that can be exacerbated by a small training dataset. In addition to lung-lesion segmentation, accurate segmentation of the whole-lung field in NHP models (thus defining the inclusion lung geography and serving as denominator) is crucial to quantification of disease between and across experimental groups. Thus far, large datasets derived from NHP CT imaging that might address these challenges have not been available. In summary, the difficulty in accurately pivoting and translating high-performing human-derived models to the NHP requires the derivation of NHP-specific models. In this study, we built deep-learning segmentation models derived from SARS-CoV-2-exposed NHP experiments in which baseline and longitudinal CT imaging was a standard experimental readout. Here we:
      • 1)
        propose a novel multi-model ensemble method to address inconsistency in defining ground truth towards NHP segmentation;
      • 2)
        describe a fully automated deep-learning-based whole-lung and lung-lesion segmentation and quantification method that is NHP-specific;
      • 3)
        demonstrate the application of the whole-lung and lung-lesion quantification in a preclinical experimental setting using CT images from SARS-CoV-2-exposed NHPs.

      MATERIALS AND METHODS

      All experiments were performed in a maximum ABSL-4 containment laboratory at the Integrated Research Facility at Fort Detrick (Frederick, MD), a facility accredited by the Association for Assessment and Accreditation of Laboratory Animal Care International. Experimental procedures were approved by the National Institute of Allergy and Infectious Diseases Division of Clinical Research Animal Care and Use Committee and conducted in compliance with the Animal Welfare Act regulations, Public Health Service policy, and the Guide for the Care and Use of Laboratory Animals (Eighth Edition).

      Dataset

      The study used a unique dataset comprised of longitudinal CT scans of SARS-CoV-2-exposed NHPs imaged in an ABSL-4 laboratory. The dataset consisted of 92 CT images from 18 NHPs (crab-eating [cynomolgus] macaques) with an average weight of 4.3 kg at pre-exposure and an average age of 5.0 year. Of a total of 18 animals (92 scans), whole-lung ground-truth annotation was available for 15 animals (74 scans) from one radiologist (subset-1). Lung-lesion ground-truth annotation came from three different radiologists, including 12 animals (72 scans) from one radiologist (subset-2) and six animals (20 scans) from two radiologists (subset-3). The final ground truth for subset-3 was determined from the intersection of both radiologists' outlines. Comparisons were made for subset-3 images between the intersection of the radiologists’ outlines and the deep-learning-based automated lung-lesion segmentation. We performed subject-wise 10-fold cross-validation. Doing so allowed all its longitudinal scans of an individual animal to be either in the testing or in the training/validation folds. Iteratively, one fold was used for testing, and the remaining nine folds were used for training/validation. For the training and validation, patches were extracted from all scans in those nine folds and then split into two (80% for training and 20% for validation). CT images were obtained pre-exposure and for up to 30 day after exposure to SARS-CoV-2 (GenBank #MW161259 and MR952134). Complete genome sequences of the SARS-CoV-2 isolates were deposited at the National Center for Biotechnology Information (). A total of 74 of 92 and 92 of 92 scans had ground-truth segmentations of whole lung and lung lesions, respectively. All images were acquired from either a 16-slice CT scanner in helical scan mode as part of either a Philips Gemini PET/CT or a Philips Precedence SPECT/CT system (Philips Healthcare, Cleveland, OH, USA). The scanner parameters were set at a voltage of 140 kVp, energy of 250 or 300 mAs per slice, in-plane (transverse) resolution of 0.35 mm, and slice increments of 0.5 mm. The CT dose index (CTDI) was 36.8 mGy. For a scan length of 200 mm, the dose length product (DLP) was 904.8 mGy cm, the rotation time was 0.75 second, and the reconstruction was iterative (Philips IMR) with standard B filter. The maximal interior thoracic breadth of a crab-eating (cynomolgus) macaque is approximately 75 mm (
      • Xie L
      • Zhou Q
      • Liu S
      • et al.
      Normal thoracic radiographic appearance of the cynomolgus monkey (Macaca fascicularis).
      ). In humans, this is approximately 288 mm (
      • Yang S
      • Kim J
      • Choi SJ
      • et al.
      Determining average linear dimension and volume of korean lungs for lung phantom construction.
      ), or nearly four times larger. Two previously trained medical doctors performed manual image segmentation, under the supervision of two senior radiologists. While these radiologists are well-versed at assessing lung CT scans in humans, they are less experienced in evaluating CT scans of NHPs with SARS-CoV-2 infection, as large numbers of these images are not currently available. This study is reported in accordance with ARRIVE guidelines (
      • Kilkenny C
      • Browne WJ
      • Cuthill IC
      • et al.
      Improving bioscience research reporting: the ARRIVE guidelines for reporting animal research.
      ).

      Data Pre-Processing

      Data pre-processing is an essential step in achieving optimal results using machine learning. This step also brings all of the images to the same spatial resolution and radio density range. The CT radiodensity values were thresholded in the pre-processing step to include only relevant organs/tissues in a range of -1024 to 500 Hounsfield units (HU) and exclude irrelevant organs and objects. The thresholded image intensities were rescaled to 0-1. Of note, we did not perform any resampling since all included scans had the same resolution (0.35 × 0.35 × 0.5 mm3).

      CNN Architecture

      We selected several well-accepted evaluators of the efficacy of the CNN-based method. First, we applied a 3D version of U-Net (
      • Çiçek Ö
      • Abdulkadir A
      • Lienkamp SS
      • et al.
      3D U-Net: learning dense volumetric segmentation from sparse annotation.
      ), an algorithm used successfully in many segmentation tasks in medical-image analyses. In addition, V-Net (
      • Milletari F
      • Navab N
      • Ahmadi S-A
      V-net: Fully convolutional neural networks for volumetric medical image segmentation.
      ) was chosen because it has been effective in CT image analysis (
      • Ouyang Xi
      • Huo J
      • Xia L
      • et al.
      Dual-sampling attention network for diagnosis of COVID-19 from community acquired pneumonia.
      ). Inception (
      • Szegedy C
      • Liu W
      • Jia Y
      • et al.
      Going deeper with convolutions.
      ) was also applied to the dataset for comparison. We compared these architectures with a volumetric (3D)FPN (
      • Lin TY
      • Dollár P
      • Girshick R
      • et al.
      Feature pyramid networks for object detection.
      ), adopted in many object-recognition tasks (
      • Kirillov A
      • Girshick R
      • He K
      • et al.
      Panoptic feature pyramid networks.
      ,

      Zhao Q, Sheng T, Wang Y, et al. M2det: A single-shot object detector based on multi-level feature pyramid network, In: Proceedings of the AAAI conference on artificial intelligence 2019;33.

      ). Recognizing objects at vastly different scales is a fundamental challenge in computer vision. A standard solution to this scale-variant issue is to use feature pyramids that enable a model to detect objects across a wide range of scales by scanning the model over both positions and pyramid levels. An FPN was recently used for liver segmentation (
      • Reza SMS
      • Bradley D
      • Aiosa N
      • et al.
      Deep learning for automated liver segmentation to aid in the study of infectious diseases in nonhuman primates.
      ) in whole-body CT images of NHPs, was effective for liver segmentation before and after exposure with different viruses (e.g., Ebola virus, Marburg virus, and Lassa virus), and achieved an average Dice coefficient of 94.77%. CNN training parameters and loss-function details were previously characterized (
      • Reza SMS
      • Bradley D
      • Aiosa N
      • et al.
      Deep learning for automated liver segmentation to aid in the study of infectious diseases in nonhuman primates.
      ). This study extended the liver segmentation method to apply to whole-lung and lung-lesion segmentation.

      Training

      All of the CNNs were trained using input patches of 128 × 128 × 16 voxels in size, and the mini-batch size was set to 90. We heuristically set the input patch size to 128 × 128 × 16 to include a larger in-plane context. This helped us reduce the false positives, primarily seen near the heart, due to partial volume effect or a motion artifact. The input patches were extracted randomly with an equal number from inside and outside the targeted organs/tissues. A 3D Gaussian filter with a standard deviation sigma equal to 1 × 1 × 1 mm3 was applied to blur the manual masks. The parameter updates were performed using the Adam optimizer with a learning rate of 5 × 10-5 and a decay rate of 5 × 10-5. The number of epochs was set to 30; however, all training sessions were stopped earlier. The early stopping parameters were set as a delta-minimum of 1 × 10-5 and patience on the validation accuracy of 4. The input patches were reshuffled before each epoch.

      Loss Function

      Segmentation of any organ is often seen as a voxel-wise classification task, where high class imbalances (sample ratio of different classes) significantly deteriorate the performance. Considering the variation of class ratios (foreground/background) among the subjects and hard-to-detect boundaries with the nearby organs, we chose “focal” (
      • Lin T-Y
      • Goyal P
      • Girshick R
      • et al.
      Focal loss for dense object detection.
      ) as the loss function, which is defined as
      FLoss={α(1gi)σloggiifpi=1(1α)giσlog(1gi)ifpi=0


      where gi is the probability of voxel, i to the foreground, and pi is the ground-truth. The parameter, α is a weighting factor to adjust the class imbalance, and (1gi)σ is called the modulating factor that gives more weight to hard-to-detect samples and less weight to easy-to-detect samples in training. Empirically, α=0.3, and σ=2 were found to be the best fit in this lung and lung lesions segmentation task.

      Post-Processing

      The output of the CNNs is a probability map image at the original CT image resolution. Later, this probability map was thresholded at an empirically chosen value of 0.5 to suppress the low probability values and set higher as the lung/lesions. Finally, 3D-connected component-based object processing was used to suppress the objects below 10 mm3, which was recommended by clinical experts.

      Quantitative Evaluation

      To validate the effectiveness and robustness of our approach, we used five metrics of segmented whole lungs and lung lesions in a volumetric (3D) manner, including Dice similarity, sensitivity, positive predictive value (PPV), relative absolute volume difference (RAVD), and average surface distance (ASD). These metrics are commonly used to measure the segmentation performance ().

      Statistical Analysis

      Mean, standard deviation, median, maximum, and minimum were used for the metrics Dice coefficient, sensitivity, PPV, RAVD, and ASD. A Wilcoxon signed-rank test, a non-parametric version of the paired T-test, was used to measure the statistical significance (p < 0.05) between the performances of any two models. All the statistical analysis was performed in Python using SciPy and NumPy tools.

      Computational Timing and Resources

      The training was performed on a 96 Core Intel Xeon Platinum 8168 @ 2.70GHz CPU and one Tesla V100-SXM3-32GB GPU. The average training time for the FPN for a single model of the 10-fold cross-validation procedure was approximately 14 hours. The inference time for a test scan is ≈1.5 min for a 3D chest CT image. All the CNNs were implemented in Tensorflow and Keras. The image pre-processing and post-processing were done in Python with NumPy, and Scikit-image tools.

      RESULTS

      The whole-lung and lung-lesion segmentation was designed and compartmentalized in two separate binary segmentation tasks.
      We performed a subject-wise 10-fold cross-validation for the whole-lung segmentation using randomized images from 74 longitudinal scans of 15 animals (subset-1). Scans were stratified such that scans from the same animal did not appear in the training/validation folds and the test fold. Since the FPN was previously proven superior to U-Net and V-Net for liver segmentation in CT images of NHPs (
      • Reza SMS
      • Bradley D
      • Aiosa N
      • et al.
      Deep learning for automated liver segmentation to aid in the study of infectious diseases in nonhuman primates.
      ), we applied only the FPN for whole-lung segmentation. The example images shown in Figure 1 illustrate the whole lung segmented by the FPN compared to the manual masks in multiple animals. The quantitative scores for whole-lung segmentation are in Table 1.
      Fig 1
      Figure 1Representative images for whole-lung segmentation by feature pyramid network (FPN) in five animals. Original computed tomography (CT) images (top row) and segmented whole-lung CT images (bottom row) are shown. True positive (TP) regions are in yellow, false positive (FP) in cyan, and false negatives (FN) in magenta, also marked by the arrows (Color version of figure is available online.)
      Table 1Quantitative Scores for Whole-Lung Segmentation by Feature Pyramid Network (FPN)
      Dice (%)Sensitivity (%)PPV (%)RAVD (mm3)ASD (mm)
      Mean99.41299.51099.3210.0070.052
      Standard0.6080.9680.7200.0100.047
      Maximum99.87199.99199.9750.0560.277
      Minimum95.76793.08296.1620.0000.012
      For lung-lesion segmentation, we performed 10-fold cross-validation on the dataset (subset-2). Since lung-lesion segmentation is comparatively more challenging than whole-lung segmentation, we built three separate models for each fold by training the FPN on three different orientations (axial, sagittal, and coronal) with the same training parameters but a new set of randomly selected input patches. The final probability map for a test scan was obtained by averaging the probabilities provided by those three models. As we worked with the original image resolution (0.35 × 0.35 × 0.5 mm3), each orientation provided a different context for the same size of input patches. Therefore, training in three orientations was helpful in this context. Quantitative scores from the 10-fold cross-validation on the training dataset (subset-2) are shown in Table 2, in which the single model indicates that the FPN was trained on only one orientation, while the multi-model is the ensemble of three separate models trained on three orientations. Significant improvements (p < 0.05) were obtained by using the multi-model.
      Table 2Quantitative Scores of Lesions Segmentation Using 10-Fold Cross-Validation on the Training Dataset (subset-2)
      indicates statistically significant improvement. The best metric values are in bold.
      Dice (%)
      indicates statistically significant improvement. The best metric values are in bold.
      Sensitivity (%)
      PPV (%)RAVD (mm3)ASD (mm)
      ModelSMMMSMMMSMMMSMMMSMMM
      Mean50.255.247.458.165.864.91.11.44.54.0
      Standard19.219.921.022.226.724.12.54.09.39.0
      Maximum85.087.191.597.399.399.514.024.460.760.5
      Minimum0.00.00.00.00.00.00.00.00.30.2
      The single model (SM) indicates that the feature pyramid network (FPN) was trained on only one orientation, while the multi-model (MM) is the ensemble of three separate models trained on orientations. Significant improvement (p < 0.05) was obtained by the MM model.
      low asterisk indicates statistically significant improvement. The best metric values are in bold.
      Further, we applied all 30 models (obtained from the 10 folds with three orientations) on the subset-3 scans for prediction. The final segmentation results were obtained by averaging the 30 probability maps and threshold at a value 0.5. We named this as our proposed multi-model FPN. The quantitative scores of data subset-3 using single-model FPN and a multi-model FPN are shown in Table 3. The single-model FPN is the model trained using all the available data in subset-2.
      Table 3Quantitative Scores of Lesion Segmentation for the Test Dataset (subset-3)
      indicates statistically significant improvement. The best metric values are in bold.
      Dice (%)
      Sensitivity (%)
      indicates statistically significant improvement. The best metric values are in bold.
      PPV (%)
      RAVD (mm3)
      indicates statistically significant improvement. The best metric values are in bold.
      ASD (mm)
      ModelSMMMSMMMSMMMSMMMSMMM
      Mean54.960.276.476.948.456.41.81.45.83.1
      Standard18.017.915.615.520.823.24.03.14.42.8
      Maximum76.680.599.499.680.285.117.913.815.011.7
      Minimum10.012.647.148.15.36.70.00.10.50.3
      The single model (SM) and the ensemble of 30 models, multi-model (MM). Significant improvement (p < 0.05) was obtained by the MM model.
      low asterisk indicates statistically significant improvement. The best metric values are in bold.
      In addition to our proposed multi-model FPN, multiple well-accepted CNNs (U-Net, V-Net, and Inception) were also applied by training, in a traditional manner, a single model with all available training subjects (subset-two) for comparison. The single models were trained with the same 3D patches of size (128 × 128 × 16 voxels) and in one orientation (axial). Figure 2 shows examples of the segmented lung lesions provided by those CNNs and the ground truths. The CT images show that the consolidated lung lesions (orange arrow) appear with an iso-dense distribution with the surrounding chest walls, while a vague boundary between GGO and the normal lung tissues is noted (red arrow). The comparison among the CNNs by the Dice coefficient and statistical significance with p-values are shown in Figure 3.
      Fig 2
      Figure 2Representative images for lung-lesion segmentation, selected from different animals. True positive (TP) regions are in yellow, false positive (FP) in cyan, and false negative (FN) in magenta. Vague boundaries between ground-glass opacities (GGOs) and normal lung tissues are indicated by the red arrow. The orange arrow indicates similar radio density distributions of consolidation and chest walls (Color version of figure is available online.)
      Fig 3
      Figure 3Comparison using Dice coefficients from 20 scans of the test dataset (subset-3). Statistical significance (p < 0.05) was calculated by Wilcoxon signed-rank test. ** indicates p < 0.005 and *** for p < 0.0005.
      In order to understand the effect of lung-lesion ground-truth inconsistency, we calculated the Dice coefficient for the two radiologists to be 63%, whereas the proposed method achieved a Dice coefficient of 60.2%. Figure 4 shows generally similar agreement in the Dice coefficients between the two radiologists and the proposed method.
      Fig 4
      Figure 4Agreement between the two radiologists and obtained Dice coefficients by the proposed multi-model feature pyramid network (FPN) method for the 20 test scans (subset-3).
      We also applied the proposed multi-model FPN to segment lung-lesions in SARS-CoV-2-exposed and mock-exposed (exposed to viral media vehicle without virus) NHPs. The baseline-corrected lesion volumes were plotted from Day 0 (pre-exposure) to 8 days post-exposure. Figure 5 shows a representative longitudinal change in lesion volume in a SARS-CoV-2-exposed animal compared to a mock-exposed animal. It is important to note that lesion volume is shown here to demonstrate how these segmentation models may be used in research settings to assess disease progression. It is inappropriate to assess the models' performances using lesion volume, as it does not take into account the position of the lesions, which is critically captured using metrics described above (i.e., Dice, sensitivity, PPV, RAVD, and ASD).
      Fig 5
      Figure 5Plots of lesion volume over time for animals exposed to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) or mock inoculum (viral media vehicle without virus).

      DISCUSSION

      Perhaps the most challenging aspect of automating whole-lung/lung-lesion segmentation is the accurate delineation of anatomic boundaries between consolidated lung lesions and neighboring cardiac and pleuromediastinal tissues and between GGOs and normal lung tissue. Partial volume effects of nearby organs or the motion artifact of the heart pose additional challenges. This automated NHP lung-segmentation study obtained a 99.4% Dice coefficient using 10-fold cross-validation on 74 scans. The proposed multi-model FPN outperformed the conventional single-model U-Net, V-Net, Inception, and FPN by specific metrics (Table 2), although a few outliers lowered the overall scores (Fig 3). A limitation of this initial work is intrinsic to the current SARS-CoV-2 NHP model; that is, our training dataset included mild cases only, and whether these results will be generalizable to more severe disease is uncertain. Additionally, since some animals were scanned multiple times and some more than others, the model may be biased toward animals that were scanned more. There were 4.2 ± 2.0 scans per animal; therefore, we don't believe this bias was significant. The methods used to reduce overfitting in the model (e.g., cross-validation and testing on a hold-out dataset) mitigated these biases. False positives in the segmentation outputs also represent a generic challenge for any automated segmentation method in medical-imaging analysis. Future work to address this will apply a cascaded CNN (
      • Reza SMS
      • Roy S
      • Park DM
      • et al.
      Cascaded convolutional neural networks for spine chordoma tumor segmentation from MRI.
      ,
      • Aswathy AL
      • Vinod Chandra SS
      Cascaded 3D UNet architecture for segmenting the COVID19 infection from lung CT volume.
      ) that uses the auto-context method and has been shown effective in reducing false positives. Overall, the segmentation performance via Dice coefficient of the proposed multi-model FPN for the training (55.2%) and the test datasets (60.2%) (Table 2 and Table 3) are consistent, supporting that the model was not over fit in this context, irrespective of other overfitting paradigms (species, disease state, inoculation methodologies, local CT scan tools, and other environmental, demographic, or human factors).
      We believe this work has broad application in this and other high-consequence viral pathogen animal models, as automated segmentation of the lung and lesions is essential to identifying and defining potential imaging biomarkers of disease. This is particularly true in SARS-CoV-2 NHP models, in which disease is often in apparent by clinical signs at the cage side; indeed higher resolution imaging may be required to characterize and track disease, rather than just virologic readouts. The multi-model FPN employed in this study was recently applied to extract radiomic features from lung CT scans (
      • Castro MA
      • Reza S
      • Chu WT
      • et al.
      Determination of reliable whole-lung CT features for robust standard radiomics and delta-radiomics analysis in a crab-eating macaque model of COVID-19: stability and sensitivity analysis, Medical Imaging 2022: Biomedical Applications in Molecular, Structuraland Functional Imaging.
      ). Similarly, the segmented whole lung can also be used as a region of interest for lung-lobe segmentation and a post-processing mask in lung-lesions segmentation (
      • Doel T
      • Gavaghan DJ
      • Grau V
      Review of automatic pulmonary lobe segmentation methods from CT.
      ).
      The human thorax is roughly four times larger than the macaque thorax (
      • Xie L
      • Zhou Q
      • Liu S
      • et al.
      Normal thoracic radiographic appearance of the cynomolgus monkey (Macaca fascicularis).
      ,
      • Yang S
      • Kim J
      • Choi SJ
      • et al.
      Determining average linear dimension and volume of korean lungs for lung phantom construction.
      ), thus we collected scans with a higher resolution than what is typically used in a human chest scan. The deep-learning algorithm is expected to perform better if the resolution of the scan (relative to the size of the sample) is high. In this study, the samples were smaller than in humans but the image resolution was higher and thus (not accounting for differences in disease presentation) we expect that these methods would perform comparably in human scans with a similar relative resolution.
      Another potential application of automated segmentation is as a platform to enable highly resolved phenotyping of lung-lesions (
      • Gattinoni L
      • Chiumello D
      • Caironi P
      • et al.
      COVID-19 pneumonia: different respiratory treatments for different phenotypes?.
      ), i.e., incorporating segmented abnormality characteristics (e.g., density, shape, and within-lesion radiomic features) towards specific subtyping of consolidation, GGOs, and interstitial thickening (crazy paving). Our proposed method may be applied towards this challenging segmentation task. It might be applicable for various infectious disease analyses, such as quantifying disease longitudinally (

      Shan F, Gao Y, Wang J, et al., Lung infection quantification of COVID-19 in CT images with deep learning. arXiv preprint 2020;arXiv:2003.04655. doi:10.48550/arXiv.2003.04655.

      ), including in the response to therapeutic intervention, diagnostic classification, classification of disease severity, or defining or discovering novel imaging biomarkers. Disease progression may be estimated for each subject by creating or comparing a percentage lesions volume curve over time (
      • Solomon J
      • Douglas D
      • Johnson R
      • et al.
      New image analysis technique for quantitative longitudinal assessment of lung pathology on CT in infected rhesus macaques.
      ).

      CONCLUSIONS

      In a preclinical NHP model, we developed and characterized a novel multi-model deep-learning-based method for automated whole-lung and lung-lesion segmentation and quantification, importantly despite the presence of inconsistency in ground truths. We believe this work serves as a platform to enable better quantification of the amount and type of lung disease in a SARS-CoV-2 NHP model in need of improved measures of disease, rather than just measures of viral infection. We believe this work further enables the characterization and quantification of SARS-CoV-2-related pulmonary disease that is otherwise invisible clinically; furthermore, these meaningful disease readouts are critical additions in the evaluation of therapeutic countermeasures in the NHP model. If and as more severe NHP SARS-CoV-2/COVID-19 models are developed, we expect these tools to serve a similar purpose on the severe end of the disease spectrum.
      Future work may explore the application of this work in other high-consequence viral diseases with lung involvement, including in NHP disease models targeting Lassa virus, Ebola virus, and Nipah virus infections. Future work may include a boosted ensemble (

      Reza S, Butman JA, Park DM, et al. AdaBoosted deep ensembles: getting maximum performance out of small training datasets, In: International Workshop on Machine Learning in, In: Medical Imaging, Springer; Cham 2020;12436:572-582. https://doi.org/10.1007/978-3-030-59861-7_58.

      ) method to combine the outputs of the multiple models instead of averaging. In perspective, deep-learning tools for image processing need to be developed and customized to specific preclinical settings in the context of specific preclinical hypotheses in order to provide more automation, reproducibility, and accuracy. The optimal future contains further cross-fertilization between clinical and preclinical data science as a requirement for deep-learning tools to have their hoped impact.

      Acknowledgments

      The authors thank the members of the Comparative Medicine team at the National Institutes of Health (NIH) National Institute of Allergy and Infectious Diseases (NIAID) Division of Clinical Research (DCR) Integrated Research Facility at Fort Detrick (IRF-Frederick)/, Fort Detrick, Frederick, MD, USA for handling of and caring for the macaques. We thank Philip Sayre (IRF-Frederick) for collecting the imaging data. We thank Claudia Calcagno, Venkatesh Mani and Anya Crane (IRF-Frederick) for critically editing the manuscript and Jiro Wada (IRF-Frederick) for preparing figures.
      This work was supported in part through Laulima Government Solutions, LLC, prime contract with NIAID (Contract No. HHSN272201800013C). M.A.C., T.K.C., S.B., S.M.A., and G.W. performed this work as employees or affiliates of Laulima Government Solutions, LLC. C.L.F. and J.H.K. performed this work as employees of Tunnell Government Services (TGS), a subcontractor of Laulima Government Solutions, LLC, under Contract No. HHSN272201800013C. This work was also supported in part with federal funds from the NIH National Cancer Institute (NCI), under Contract No. 75N910D00024, Task Order No. 75N91019F00130. (I.C. and J.S. were supported by the Clinical Monitoring Research Program Directorate, Frederick National Laboratory for Cancer Research). This project has been also partially funded by the NIH, Clinical Center, Center for Infectious Disease Imaging (CIDI) (W.T.C. and S.R.). Work supported by the by the NIH Center for Interventional Oncology and the NIH Intramural Targeted Anti-COVID-19 (ITAC) Program, funded by the National Institute of Allergy and Infectious Diseases.
      The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Health and Human Services or of the institutions and companies affiliated with the authors, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.

      Declaration of Competing Interest

      None.

      References

        • Wang G
        • Liu X
        • Li C
        • et al.
        A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images.
        IEEE Transact Med Imaging. 2020; 39: 2653-2663
        • Muñoz-Fontela C
        • Dowling WE
        • Funnell SGP
        • et al.
        Animal models for COVID-19.
        Nature. 2020; 586 (Epub 2020 Sep 23): 509-515https://doi.org/10.1038/s41586-020-2787-6
        • Jacob ST.
        Ebola virus disease.
        Nat Rev Dis Primers. 2020; 6: 13https://doi.org/10.1038/s41572-020-0147-3
        • Sattler RA
        • Paessler S
        • Ly H
        • et al.
        Animal models of lassa fever.
        Pathogens. 2020; 9: 197https://doi.org/10.3390/pathogens9030197
        • de Wit E
        • Munster VJ.
        Animal models of disease shed light on Nipah virus pathogenesis and transmission.
        J Pathol. 2015; 235: 196-205https://doi.org/10.1002/path.4444
        • Byrum R
        • Keith L
        • Bartos C
        • et al.
        Safety precautions and operating procedures in an (A) BSL-4 laboratory: 4. medical imaging procedures.
        JoVE (J Visual Exp). 2016; 116: e53601
        • Jelicks LA
        • Tanowitz HB
        • Albanese C
        Small animal imaging of human disease: from bench to bedside and back.
        Am J Pathol. 2012; 182: 294-295
        • Jin C
        • Chen W
        • Cao Y
        • et al.
        Development and evaluation of an artificial intelligence system for COVID-19 diagnosis.
        Nat Commun. 2020; 11: 1-14
        • Shi F
        • Wang J
        • Shi J
        • et al.
        Review of artificial intelligence techniques in imaging data acquisition, segmentation, and diagnosis for COVID-19.
        IEEE Rev Biomed Engineer. 2020; 14: 4-15
        • Gao K
        • Su J
        • Jiang Z
        • et al.
        Dual-branch combination network (DCN): Towards accurate diagnosis and lesion segmentation of COVID-19 using CT images.
        Med Image Analysis. 2021; 67: 101836
        • Yang D
        • Xu Z
        • Li W
        • et al.
        Federated semi-supervised learning for COVID region segmentation in chest CT using multi-national data from China.
        Italy, Japan, Med Image Analysis. 2021; 70: 101992
        • Fan DP
        • Zhou T
        • Ji GP
        • et al.
        Inf-net: Automatic covid-19 lung infection segmentation from ct images.
        IEEE Trans Med Imaging. 2020; 39: 2626-2637
        • Ronneberger O
        • Fischer P
        • Brox T
        U-net: Convolutional networks for biomedical image segmentation.
        in: International Conference on Medical image computing and computer-assisted intervention. Springer, Cham2015
        • Hofmanninger J
        • Prayer F
        • Pan J
        • et al.
        Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem.
        Eur Radiol Exp. 2020; 4: 1-13
      1. https://github.com/JoHof/lungmask. Accessed December 22, 2023.

        • Chaganti S
        • Grenier P
        • Balachandran A
        • et al.
        Automated quantification of CT patterns associated with COVID-19 from chest CT.
        Radiol: Artif Intelligence. 2020; 2e200048
        • Wu YH
        • Gao SH
        • Mei J
        • et al.
        Jcs: An explainable covid-19 diagnosis system by joint classification and segmentation.
        IEEE Transact Image Process. 2021; 30: 3113-3126
        • Vakalopoulou M
        • Chassagnon G
        • Bus N
        • et al.
        Atlasnet: Multi-atlas non-linear deep networks for medical image segmentation.
        in: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham2018
      2. Chassagnon G, Vakalopoulou M, Battistella E, et al., AI-Driven CT-based quantification, staging and short-term outcome prediction of COVID-19 pneumonia, arXiv preprint 2020;arXiv:2004.12852. doi:10.48550/arXiv.2004.1285220.

        • Ouyang Xi
        • Huo J
        • Xia L
        • et al.
        Dual-sampling attention network for diagnosis of COVID-19 from community acquired pneumonia.
        IEEE Transactions on Medical Imaging. 2020; 39: 2595-2605
      3. https://www.ncbi.nlm.nih.gov/sars-cov-2/. Accessed December 22, 2022.

        • Xie L
        • Zhou Q
        • Liu S
        • et al.
        Normal thoracic radiographic appearance of the cynomolgus monkey (Macaca fascicularis).
        PloS One. 2014; 9: e84599https://doi.org/10.1371/journal.pone.0084599
        • Yang S
        • Kim J
        • Choi SJ
        • et al.
        Determining average linear dimension and volume of korean lungs for lung phantom construction.
        Health Physics. 2021; 120: 487-494https://doi.org/10.1097/HP.0000000000001280
        • Kilkenny C
        • Browne WJ
        • Cuthill IC
        • et al.
        Improving bioscience research reporting: the ARRIVE guidelines for reporting animal research.
        PLoS Biol. 2010; 8e1000412
        • Çiçek Ö
        • Abdulkadir A
        • Lienkamp SS
        • et al.
        3D U-Net: learning dense volumetric segmentation from sparse annotation.
        in: International conference on medical image computing and computer-assisted intervention. Springer, Cham2016
        • Milletari F
        • Navab N
        • Ahmadi S-A
        V-net: Fully convolutional neural networks for volumetric medical image segmentation.
        in: 2016 fourth international conference on 3D vision (3DV). IEEE, 2016
        • Szegedy C
        • Liu W
        • Jia Y
        • et al.
        Going deeper with convolutions.
        in: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015
        • Lin TY
        • Dollár P
        • Girshick R
        • et al.
        Feature pyramid networks for object detection.
        in: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017
        • Kirillov A
        • Girshick R
        • He K
        • et al.
        Panoptic feature pyramid networks.
        Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019;
      4. Zhao Q, Sheng T, Wang Y, et al. M2det: A single-shot object detector based on multi-level feature pyramid network, In: Proceedings of the AAAI conference on artificial intelligence 2019;33.

        • Reza SMS
        • Bradley D
        • Aiosa N
        • et al.
        Deep learning for automated liver segmentation to aid in the study of infectious diseases in nonhuman primates.
        Acad Radiol. 2021; 28: S37-S44
        • Lin T-Y
        • Goyal P
        • Girshick R
        • et al.
        Focal loss for dense object detection.
        in: Proceedings of the IEEE international conference on computer vision. 2017
      5. https://loli.github.io/medpy/metric.html. Accessed December 15, 2022.

        • Reza SMS
        • Roy S
        • Park DM
        • et al.
        Cascaded convolutional neural networks for spine chordoma tumor segmentation from MRI.
        Medical Imaging2019: Biomedical Applications in Molecular, Structural, and Functional Imaging. 10953. SPIE, San Diego, California, United States2019: 487-493https://doi.org/10.1117/12.2514000
        • Aswathy AL
        • Vinod Chandra SS
        Cascaded 3D UNet architecture for segmenting the COVID19 infection from lung CT volume.
        Sci Rep. 2022; 12: 3090
        • Castro MA
        • Reza S
        • Chu WT
        • et al.
        Determination of reliable whole-lung CT features for robust standard radiomics and delta-radiomics analysis in a crab-eating macaque model of COVID-19: stability and sensitivity analysis, Medical Imaging 2022: Biomedical Applications in Molecular, Structuraland Functional Imaging.
        San Diego, California, United States. 2022; 12036: 548-564
        • Doel T
        • Gavaghan DJ
        • Grau V
        Review of automatic pulmonary lobe segmentation methods from CT.
        ComputerMed Imaging Graphic. 2015; 40: 13-29
        • Gattinoni L
        • Chiumello D
        • Caironi P
        • et al.
        COVID-19 pneumonia: different respiratory treatments for different phenotypes?.
        Intens Care Med. 2020; 46: 1099-1102https://doi.org/10.1007/s00134-020-06033-2
      6. Shan F, Gao Y, Wang J, et al., Lung infection quantification of COVID-19 in CT images with deep learning. arXiv preprint 2020;arXiv:2003.04655. doi:10.48550/arXiv.2003.04655.

        • Solomon J
        • Douglas D
        • Johnson R
        • et al.
        New image analysis technique for quantitative longitudinal assessment of lung pathology on CT in infected rhesus macaques.
        in: 2014 IEEE 27th International Symposium on Computer-Based Medical Systems. IEEE, 2014https://doi.org/10.1109/CBMS.2014.59
      7. Reza S, Butman JA, Park DM, et al. AdaBoosted deep ensembles: getting maximum performance out of small training datasets, In: International Workshop on Machine Learning in, In: Medical Imaging, Springer; Cham 2020;12436:572-582. https://doi.org/10.1007/978-3-030-59861-7_58.