Giulio Antonelli, Paraskevas Gkolfakis, Georgios Tziatzios, Ioannis S Papanikolaou, Konstantinos Triantafyllou, Cesare Hassan
Abstract Artificial intelligence (AI) systems, especially after the successful application of Convolutional Neural Networks, are revolutionizing modern medicine. Gastrointestinal Endoscopy has shown to be a fertile terrain for the development of AI systems aiming to aid endoscopists in various aspects of their daily activity. Lesion detection can be one of the two main aspects in which AI can increase diagnostic yield and abilities of endoscopists. In colonoscopy, it is well known that a substantial rate of missed neoplasia is still present, representing the major cause of interval cancer. In addition, an extremely high variability in adenoma detection rate, the main key quality indicator in colonoscopy, has been extensively reported. The other domain in which AI is believed to have a considerable impact on everyday clinical practice is lesion characterization and aid in “optical diagnosis”. By predicting in vivo histology, such pathology costs may be averted by the implementation of two separate but synergistic strategies, namely the “l(fā)eave-in-situ” strategy for < 5 mm hyperplastic lesions in the rectosigmoid tract, and “resect and discard” for the other diminutive lesions. In this opinion review we present current available evidence regarding the role of AI in improving lesions’ detection and characterization during colonoscopy.
Key Words: Artificial intelligence; Colonoscopy; Polyp; Adenoma; Detection; Characterization
Colonoscopy and polypectomy are the mainstay in the prevention of colorectal cancer (CRC), and have been shown to reduce its incidence and mortality[1-3]. The development of quality improvement programs and performance measures, their measurement with audit and eventual retraining have led to an increase in adenoma detection rate (ADR), directly associated with a decrease in interval cancer (i.e., a cancer that is identified before the next recommended screening or surveillance examination)[4-6]. Notwithstanding the increasing awareness and the ever-improving quality, a substantial rate of colorectal neoplasia is still missed during colonoscopy, variably reported between 5% and 25%, leading to an interval colorectal cancer rate ranging between 0.5 and 1 per 1000 person-years[7]. The main reasons identified for colorectal neoplasia miss rate are both failure in recognising a lesion although fully visible on the endoscopy screen, due to attention or recognition issues, and failure to expose enough colorectal mucosa and incomplete resection. While mucosal exposure depends on the endoscopist’s examination technique and the quality of bowel preparation, failure to recognise a polyp when visible on the endoscopy screen can be addressed and improved by the application of artificial intelligence (AI), or “deep learning” systems[8,9]. Contrary to human-programmed computer systems, “deep learning” systems autonomously learn to distinguish the characteristics within the images provided using multiple levels of processing[10]. In this way, AI systems can recognize discriminatory characteristics between images that differ from those commonly used and elaborated by the human brain. In addition, AI systems developed with deep learning techniques can acquire fast image processing that can be used real time during an endoscopic examination. Consequently, AI systems can flag the suspect area during the endoscopic examination. These systems have shown a high accuracy when retrospectively applied to still images or stored videos, and more recently have been tested in trials during endoscopic examinations[10]. The other domain in which AI is believed to have a considerable impact on everyday clinical practice is lesion characterization and aid in “optical diagnosis”. When considering the magnitude of colonoscopies performed, covering between 1% and 6% of the target general population per year, the financial and economic burden is relevant[11]. A relevant contribution of such burden is represented by the post-polypectomy histology cost, mostly attributed to diminutive polyps that represent over 90% of all the resected lesions[12-14]. By predicting in vivo histology, such pathology costs may be averted by the implementation of two strategies, namely the “l(fā)eave-in-situ” or the “resect and discard” strategy for < 5 mm hyperplastic lesions in the rectosigmoid[15,16]. Despite the acceptance by experts, the accuracy of optical diagnosis in the community setting has been suboptimal, preventing the implementation of these cost-saving interventions[17,18]. In addition, the clinical relevance of these lesions has been debated, being mostly represented by either non-advanced adenomas or indolent hyperplastic polyps. In addition, the role of pathology as reference standard has been questioned because of possible high interobserver agreement, inadequate orientation or insufficient material[19]. By automatizing the perception phase, AI may overcome both of these pitfalls in detection and characterization. Based on deep learning Computer Aided Detection (CADe), can recognize in real-time lesions that are present on the screen and that may have been missed by the endoscopist. Similarly, Computer Aided Characterization (CADx) can predict the histology of the lesion providing the correct classification to the endoscopist[8]. We present here an overview of recent literature regarding the real-time clinical application of CADe and CADx for colorectal neoplasia.
The goal of AI system development is to build a mathematical model from a set of premarked data (e.g., images) that will allow interpretation of new, unknown data with a reasonable amount of accuracy[10]. Deep learning systems autonomously “l(fā)earn” (i.e., build their own algorithms) starting from libraries of labelled data (images containing a polyp) and subsequently acquire parameters that recognize a polyp in an image they have never been presented before. The phases of AI system development can be summarised in the training phase, the validation phase and the testing phase. In the training phase, an exceptionally large number of images labelled for the regions/ features of interest are presented to the system, that learns to recognise the labelled features building its own algorithms. The system is then initially tested on another set of unknown images, the validation set, in which the performance is evaluated, and the system is fine-tuned by the use of “hyperparameters”, optional settings calibrated by the programmers to optimise the system’s performance. Lastly, a third, unseen set of data (the “test” set) is presented to the system, to evaluate its standalone performance. Ideally, the test set should be a library of unseen images completely different from those of the training and validation sets. The last step is to test it in a randomised controlled clinical trial to face the pitfalls of clinical practice, like suboptimal preparation, patient compliance, operator skills, etc.
In the near future it is conceivable that many different AI systems will be available. Taking into account that many endoscope manufacturers will probably include computer aided diagnosis (CAD) in their new hardware releases and that different systems will be applicable in different endoscopy systems, it will be at the discretion of the different centres to decide how to implement CAD systems in their endoscopy suites.
After a long experimental phase, in the last two years the results of the first clinical trials testing the performance of CADe systems in real-life clinical practice have been published, mostly from Chinese groups[20-25]. A summary of AI systems that are currently available is found in Table 1[20,23,26,27].
Interestingly, no clinical trial showed differences in colonoscopy withdrawal times between groups undergoing CADe examinations and controls. All published trials showed ADR increase in the CADe groups: Wang et al[20]reported a significant (P < 0.001) ADR increase from 20% in the control group to 29% in the CADe group. Su et al[22]and Liu et al[25]reported a significant ADR increase, 39% vs 24% (P < 0.001) and 29% vs 17% (P < 0.001), respectively. All three trials had the limitation of low ADR in the control group, raising concerns about whether AI might compensate, albeit partially, a poor operator technique. However, in a trial by Repici et al[23]high baseline ADR at 40.4% in the control group was outmatched by a 54.8% ADR in the CADe group. A recent meta-analysis of published randomised control trials[28], has shown that the increase in ADR was consistent across all trials. Among the included 4354 patients, the ADR in the control and the CADe groups was 25.2% and 36.6%, respectively, with risk ratio (RR) = 1.44 [95% confidence interval (CI), 1.27-1.62; P < 0.01]. Sub-analysis revealed that the increase in ADR was mainly due to detection of more diminutive adenomas in all studies included in the meta-analysis. No studyshowed advanced adenoma (> 10mm) ADR increase, while only Repici et al[23]showed higher detection rates for adenomas measuring between 6 and 9 mm in the CADe group (12.7% vs 17.2%, P < 0.05)[28]. Only one study[24], not included in the aforementioned meta-analysis, has shown a role of CADe in increasing advanced adenoma detection rate, so far. This study showed an increase of advanced adenoma ADR from 1% in the control group to 3% in the CADe group, and this difference proved statistically significant. However, in this study participants had very low ADR (8% in the control group, 16% in the CADe group) overall; thus concerns are raised regarding the interpretation of the results.
Table 1 Current standalone performance of approved and not approved computer aided diagnosis systems
CADe has shown interesting results also among the other colonoscopy quality indicators. In detail, the “doppelganger” of ADR, namely adenoma miss rate (AMR), was recently reported in a back-to-back randomised trial from Wang et al[29], with an impressive improvement from a worrying 40% in the control group to a quite low (still not negligible) 13% in the CADe group. In this study the authors also underwent an elegant analysis regarding the difference in the miss rate between “visible” (i.e., exposed, but not recognised by the operating endoscopist) and “invisible” (i.e., not exposed by the endoscopist) polyps. Interestingly, they confirmed that when mucosa containing a polyp is effectively exposed by the endoscopist, CADe almost never misses the polyp [AMR-visible in the CADe group: 1.59%; polyp miss rate (PMR)-visible in the CADe group: 2.36%]. This observation further confirms the growing awareness of the importance of effectively exposing all colonic mucosa to increase neoplasia detection. Reduction in AMR was significant for diminutive (39.6% vs 13.1%, P = 0.001) and small polyps (46.9% vs 13.7%, P < 0.0001), but not for adenomas larger than 10mm (15.3% vs 33.3%), confirming that the detection of advanced adenomas is independent of CAD use[29]. Regarding polyp detection rate, meta-analyses have shown[28,30]significantly improved colonoscopy performance regarding PDR in CAD groups: (50.3% vs 34.6%; RR 1.43; 95%CI, 1.34-1.53; P < 0.01), overall. CADe use was also associated with a higher adenoma per colonoscopy (APC) rate, irrespectively of polyp size: overall APC: 0.58 vs 0.36 [RR (95%CI): 1.70 (1.53-1.89), P < 0.01]; while for polyps < 5 mm, 6-9 mm and ≥ 10 mm RR (95%CI) was 1.69 (1.48-1.84), 1.44 (1.19-1.75) and 1.46 (1.04-2.06), respectively. Lastly, a meta-analysis showed improved serrated lesion detection rates by CADe (0.06 vs 0.04, RR: 1.52; 95%CI, 1.14-2.02; P < 0.01)[30]. However, serrated miss rate was found not to be significantly different between the two groups in the back to back study by Wang et al[29]. This discrepancy could be explained either by an inadequate sample size for this specific indicator, or by a CAD system that has still to be optimized (improved training) for serrated adenoma detection.
CADx is the other promising field of clinical application of AI in colonoscopy. While the human operator depends on the application of virtual or physical chromoendoscopy to improve visualisation of mucosal and vascular patterns in order to predict lesion histology, the adequately trained on a wide library AI system should be able to predict histology regardless of the optical visualisation modality[30]. Currently, no randomised clinical trial is available evaluating performance of detection systems. However, many systems are under development and their standalone performance has been evaluated. A recent metanalysis has summarised existing literature, showing how among the 3 prospective studies on CADx[30], AI showed an impressive 92.3% (95%CI, 88.8%-94.9%) sensitivity on polyp histology prediction and a high specificity: 89.8% (95%CI, 85.3%-93.0%). Among the considerable number of retrospective studies, similar pooled results were found[30]. It is important to notice that the majority of these systems are shallow machine learning systems abandoned in favour of deep learning systems, and that solid data from randomised trials, using real-life images will be needed before a true estimate of CADx performance can be made. The performance during live colonoscopy is of course the main focus around this kind of system, where pitfalls such as inadequate bowel preparation or incomplete lesion visualisation are common.
It is well known that extensive training is needed for an endoscopist to achieve acceptable results in predicting in vivo histology of encountered lesions and that this knowledge must be regularly updated and retrained. Thus, measuring the advantage of CADx vs optical diagnosis performance of expert and non-expert endoscopists, is expected. According to the available limited evidence, AI performs similarly to experts but better than non-expert endoscopists in lesion characterisation[29]. Therefore, a significant improvement of non-experts’ performance through CADx could be of great interest, both for training and for quality assurance.
A possible drawback of CADe is the potential large number of false positive results[31]. As previously discussed, CADe systems autonomously learn their own detection algorithms and therefore its outcomes incorporate some unpredictability in the clinical setting that must be interpreted cautiously. Indeed, the system may flag frames that the endoscopists may never have selected as suspicious areas and consequently reduce colonoscopy efficiency. The endoscopist might spend an excessive amount of time to discriminate between an actual false positive and a possible false negative result. Furthermore, although areas flagged by CADe must always be interpreted by trained endoscopists, it is still possible that a false positive area may result in unnecessary polypectomy with related avoidable adverse events. In a recent study[31], authors underwent a post-hoc analysis of a randomized controlled trial (RCT) on CADe performance, where they measured false positive burden and clinical relevance and classified false positives in two broad categories: artefacts from bowel wall and artefacts from bowel content. Overall, they found a mean 27.3% false positive activations per colonoscopy, with nearly 90% of them due to artefacts from the bowel wall (folds, ileocecal valve, diverticula, appendicular foramen, etc.). Interestingly, according to their measurements, less than 10% of the false positive activations resulted in additional time spent by the endoscopist in examining the flagged area, while the majority were instantly dismissed as not relevant. These results must be confirmed with other systems and other settings.
Another domain in which CADe performance has yet to be improved is the detection of non-polypoid lesions[32]. These colorectal lesions account for a large portion of missed colorectal neoplasia and may be associated with a more aggressive biological behaviour. A recent review[32]has shown that among the published RCTs on CADe systems, some of them did not report the number of flat lesions included in the training sets and others did not report sub-analysis on the performance of AI specifically for flat lesions. The authors concluded that in future CADe systems, development and refinement, additional training and validation for the recognition of the individual subtypes of non-polypoid lesions, especially for non-granular lateral spreading tumors (LST-NG), is urgently needed. The authors speculate that a joint partnership between Eastern and Western centres should be prioritized to create datasets with a large number of flat lesions.
In the era of colonoscopy quality measurement and improvement, CAD systems that can integrate quality measurement and reporting have been initially evaluated[4,24]. The indicators that have been measured with CAD are caecal intubation rate, withdrawal time, and even slipping of the scope that can leave areas of the colon uninspected.
The cost-effectiveness of CAD systems has yet to be fully analysed. Only one preliminary study has been published so far by Mori et al[33]focusing on the implementation of AI alongside a “diagnose and leave behind” strategy, showing that this can lead to substantial cost reductions regarding the annual reimbursement for colonoscopies conducted under public health insurances in Japan, England, Norway, and the United States, respectively. Further cost-effectiveness models could further tailor this analysis, for example in the setting of organised screening programs, that in most of the Western world currently account for the greater part of the colonoscopy burden in public health systems. A considerable improvement in AI-aided colonoscopy should be the implementation of systems that integrate CADe and CADx in the same machine[34]. This could reduce costs and increase the practical considerations regarding clinical use. Randomised trials and cost-effectiveness models combining the additional detection provided by CADe to the optical diagnosis improvement provided by CADx could pave the way to a swift implementation of these systems in clinical practice.
Artificial Intelligence is a major breakthrough in the whole medical field, and endoscopy is a very fertile terrain for its development and refinement. However, it may not come without harm. The excessive reliance on AI systems may trigger a relaxation in endoscopic performance with the (un-)conscious thought that “the system is watching”. Moreover, implementation of AI may discourage endoscopists from improving optical diagnosis skills or update their knowledge. As already discussed, the presence of false positives may also push the novice or un-expert to perform unnecessary resections or biopsies, increasing cost and pathology burden.
In the case of CADe, this problem seems of a lesser grade, since regardless of the level of expertise we can affirm that the endoscopists will be able to confirm or discard the region flagged by the AI system with a reasonable level of confidence. The “one and done” issue of ADR might be taken into account, but this is true irrespective of the presence of a CAD system.
On the contrary, when dealing with CADx, only a trained endoscopist with a good confidence in optical diagnosis will be able to accept or refuse the AI characterization output and give the final diagnosis with its consequent actions. It is conceivable that non-experts might passively accept the CADx prediction without the competence to challenge it, raising also the legal issue of the final responsibility of an incorrect diagnosis: the operator, the AI system developer, or the health system?
This argues against using AI accuracy to bypass a suboptimal competence in optical diagnosis, and actually strengthens guideline recommendations that specifically affirm that optical diagnosis can be only performed by endoscopists who are proficient in the technique and are actively trained and audited.
We strongly believe that in every dominion in which we seek AI assistance, competence is the prerequisite and not the final outcome of AI implementation.
World Journal of Gastroenterology2020年47期