黃旗,張蕾,舒鑫,呂青
(1.四川大學(xué)計(jì)算機(jī)學(xué)院,成都610065;2.四川大學(xué)華西醫(yī)院乳腺外科,成都610065)
Breast cancer has been one of the most common cancers that threatens the life of women all over the world.Among current methods for breast cancer prevention,mammography is a general and effective imaging technique used for early breast cancer screening and diagnosis[1].However,the traditional analysis of the mammogram requires a large amount of workload of the breast surgeons,and the computer aided detection(CAD)techniques have been introduced.Considering the main component of many typical CAD approaches is implemented partly using the machine learning and deep learning methods[2-4],big data comes first.To collect enough mammograms with ground truth labels in a short time,we collaborated with doctors in West China Hospital,Sichuan University and devised an annotation process and developed an online annotation tool allowing the collaborative mammogram annotations among the breast surgeons.This tool offers functionalities such as drawing a box on the sample to denote where the lesion lies,labeling its benignancy or malignancy,querying one specific mammogram,checking the annotation process,querying the positive and negative sample distribution,etc.
In the first part of this paper,we reviewed some current annotation method designs used to collect the mammogram data.The next two parts are organized according to the process design and application in terms of the tool.The former part presented a simplified design with a cross validation process,and in the latter one we described some features of the tool and stated the details of the functions of the tool.Finally we discussed some good prospects and some extensions about the tool.
There are several mammogram data collections for the researchers to use and study already.One is called Digital Database for Screening Mammography(DDSM)[5].In the annotation process of the DDSM,the doctors are required to draw a specific irregular bounding covering the possible lesion areas and post a pathology label in benign,benign without call-back and malignant.Besides this,the annotation also requires to offer some detailed attribute information.Another collection of mammogram data is called the INbreast[6],the requirements of annotation process is more complex and the specialist in the field need to draw different types of bounding according to the relevant six types of pathologies.Such kinds of annotation process design can be complicated and tedious sometimes and the specialists is easy to make mistakes when drawing a fairly specific contour profile.Beyond that,such annotation for one single case can cost a great amount of time,which would slow down the speed of mammogram data collection and make it more time-consuming to conduct the cross validation.We want to design a more simple and convenient annotation process which can help to collect reliable mammogram data in a short time.
Based on the consideration of simplicity and convenience,we proposed a particular annotation process for the mammogram data.For one specific mammogram sample,the annotation specialist is required to do the three things:
1.give the pathology result of the mammogram sample as label:benign,malignant and normal.
2.draw a rectangular box in the pixel level covering the lesion patch if exists.
3.post a confidence level for 0 or 1 to complete the annotation.
Only three kinds of label to annotate is simple enough,and the specialist only need to judge the benignancy and malignancy.A direct rectangular bounding to draw also reduces the workload,and the specialist is free to draw so precise that he makes some mistakes.On the other hand,rectangle bounding can preserve the details at the extreme and can be used directly when developing the CAD systems for early diagnosis of breast cancer.Confidence level means how the specialist's sure about the result given.We set only two kinds of confidence for the same reason in pathology label setting.
Such a process is simple and quick,and we put forward a cross validation process intended for the recurrent annotation in multi-user.This process can be illustrated in Fig.1:
Fig.1 Cross validation flow path
The whole procedure can be explained in the following steps:
1.Create three new empty datasets:S1,S2,S3.
2.For a new collection of mammogram without annotation,divide them into several batches.
3.For each batch of mammogram data,assign them to three different annotation specialist.Note that the number is just a numerical example which can be adjust on the basis of the number of the current specialists.
4.For each mammogram sample,once it has been annotated three times,the annotation tool will process the pathology label as follows:if the three pathology labels are all the same,add the sample to S1;if there are two same labels,add it to S2;otherwise,add the sample to S3.
5.For each sample labeled in S1,S2,the annotation tool will process the bounding of the patch as follows:if there're any two different bounding boxes, merge the bounding using the top-left coordinate of the one and the bottom-right coordinate of the other when the overlap region of the two bounding boxes accounts for above 70 percent of the union region.Otherwise move it to S3.Repeat the step 5 until all the samples in S1, S2contain one bounding box only.
6.For each sample annotated in S1,S2,the annotation tool will compute the average of the three confidence levels for the final value.
7.For all the samples in S3,redo the annotation process in step 4 to 7 until S3is empty or unchangeable.
The merge operation in step 5 is illustrated in Fig.2.In the mammogram,there are two boxes:bounding box 1 and bounding box 2.When the overlap region of box 1 and box 2 accounts for above 70 percent of the union region of them,the two boxes will be merged into the box 3.
Fig.2 The merge operation in step 5
For each sample,it will be allocated to all the specialists.We think a dynamic number of batch duplication would not decrease the efficiency too much while it can increase the quality.The threshold 70 is set according to the rule of thumb.The researchers can obtain two kinds of annotated mammogram data after the procedure:S1,S2.S1refers to the primarily-reliable data and S2refers to the secondly-reliable data.Both the two kinds of data can be used especially when mass data is wanted.As for the S3,we just put the data back into the original dataset considering the possibly existed errors.
Coupled with the annotation process,we developed a web-based annotation tool called LabelMamoX,which is also used to examine the actual effect in the previous section.The tool provides a unified interface that can be accessed through any usual platforms.It's online and can be used concurrently by the doctors.The tool also offers the search functionality to query a mammogram of any specific case and its annotation state.
In this section,we plan to show some key features briefly about the annotation module of the tool to illustrate how it can help to collect and build the medical mammogram big data to accelerate the development of CAD systems for mammography.
Firstly,when the doctor logs into the tool,he/she will see the homepage which provides a basic tutorial to begin using this tool.And a typical mammogram data form which is present to the doctors would be designed like in Table 1.
Table 1:Mammogram data form design.
The page that contains the mammogram data form will tell the doctor some basic attributes such as which batch these table data belong to,and the total progress of this batch of data.Such table makes the doctor easy to query and examine.
When the doctor chooses a specific item to begin annotation and enters the page,the mammogram displays on the left and an operation menu shows below the mammogram.(A sketch of it is shown in Fig.3)
Fig.3 The sketch of the annotation page(a)-(b)
The operation menu provides some operations like moving,zooming in and out,checking the state of the ROI patch,etc.The doctor may select the second menu item,and then randomly click on the mammogram and hold on to drag,and now he/she can just begin to draw the rectangular box covering the possible lesion patch.Like we said before,such a rectangle annotation is direct and preserves raw information of the mammogram as much as possible.The doctor finishes drawing by stopping dragging.On the bottom-right corner the doctor can see the coordinate of the top-left corner of the rectangular ROI,as well as its width and height.Such data can be exported directly to be used to create the patch image data for CAD system development or other tasks.In addition,the tool also offers a close-up of the ROI on the top-right corner.This allows the doctor to observe the details of the ROI better.
After finishing drawing the ROI patch,the doctor can select the label between the``benign''and``malignant''.If the doctor judge that the mammogram is normal,he/she can select the``normal''label(In this case he may not need to draw a rectangular box).To ensure the quality of the image to be collected,the doctor can post a confidence with which he/she gives an annotation.We provide two levels of confidence[5],high and low(which denoted by the value of 1 and 0,respectively).The doctor can remodify and even discard the sample or its annotation if any problem exists.This kind of behavior is not necessary because the underlying puzzle sample will usually be found through the cross validation process.All the annotation results,including the ROIs and their labels would be uploaded by the tool to the server and handled by the process we put forward.Also,these results would be present to all the subsequent doctors.Beyond that,this tool can still collect some statistics about the data and annotation state to facilitate the quality control and error correction.
In this paper,we came up with a cross validation process based on the simple annotation procedure and developed a web-based mammogram annotation tool that allows the breast surgeons to give ground truth labels for the mammogram,which helps to build a large and credible mammogram dataset and will accelerate the development of the CAD systems for detecting and diagnosing the breast cancer in the early stage.Furthermore,the tool can be extended to apply for annotating other medical images such as ultrasound images or even more general objects for other new research in the areas of computer graphics and image processing.