Transfer Learning and Fine Tuning in Breast Mammogram Abnormalities Classification on CBIS-DDSM Database

Article history: Received: 15 January, 2020 Accepted: 18 February, 2020 Online: 11 March, 2020


Introduction
This work is an extension of our work originally presented in IWS-SIP 2019 [1] about mammogram abnormalities classification using Transfer Learning (TL) with Mobilenet [2] and Nasnet [3]. In this paper, we also address the classification problem of mammogram abnormalities using the CBIS-DDSM [4] dataset, but we extend the experimentation in transfer learning to other ImageNet pre-trained convolutional neural network (ConvNet) models like: Resnet, Resnext, Xception; to name a few. Finally, Fine Tuning (FT) is used in order to address the overfitting problem and improve previous results on the CBIS-DDSM dataset.
Despite the increase in understanding of breast cancer as a disease, it is still a major public health problem worldwide because of the incidence and mortality rates it presents [5]. According to the International Agency for Research on Cancer (IARC), this illness is the second most frequent form of cancer among women worldwide with 2,088,849 (11.6%) new cases and 626,679 (6.6%) of deaths [6]. The mammogram exam remains the gold standard for screening examination, mainly because it is the only screening test that has proven to reduce mortality [7]. However, mammography has some limitations like the variability of its sensitivity, which is inversely proportional to breast density, the false positive and negative rates, and the patient's exposure to radiation [7]. Other screening tests available are: ultrasound, magnetic resonance (MRI), tomosynthesis, and infrared thermography [8]- [9]; in most cases, the aforementioned screening tests are used as adjunct tests.
The mammogram exam diagnostic relies on the radiologist's experience for detection. However, 10% of all woman screened for cancer are called back for additional testing and just as little as 0.5% of them are diagnosed with breast cancer [10]. This shows that it is important to design CAD systems that aid specialists, and train new ones, in breast lesions detection. A "classic" CAD system is comprised of 5 main stages: image pre-processing, image segmentation or region of interest (ROI) definition, feature extraction and selection, classification, and performance evaluation [9,11]. However, this model can be said to be in change due to advances in the field of machine learning, specifically in Deep Learning; which allows to automatically learn representations of data with multiple levels of abstraction through deep convolutional neural networks [12]. For instance, in the field of computer vision, the classification of natural images has shown an incredible increase in performance since 2012, when the AlexNet ConvNet model to classify natural images in 1000 categories presented in [13] achieved a 15.3% top 5 test error rate. As a matter of fact, the stages of feature extraction and classification can be solved directly by a ConvNet [14]. This reduces the need for feature hand engineering, which was tradition-ally used to create the feature vector, because a ConvNet is able to synthesize its own feature vector [15]. All of this confirms the point indicated in [11] that the development in both image techniques and computer science enhance the interpretation of medical images.
The success of ConvNets and deep learning in computer vision tasks such as image classification heavily relies on the number of examples used in training the model under the supervised learning paradigm [16]. Unfortunately, mammogram public datasets are not "deep enough". In this context, transfer learning and fine tuning are deep learning techniques that can aid the development of accurate enough classifiers by transferring knowledge from another domain where large datasets are available. One of the main problems when dealing with small number of examples in training is overfitting. Transfer Learning and Fine Tuning aid in overcoming this disadvantage of working with ConvNets.
In this work, we aim to classify region of interest images from mass tumors of the CBIS-DDSM [4] dataset. We extend our previous experimentation presented in [1] by using different ConvNets models and also the Fine Tuning technique in order to increase the performance of the classifier of breast mammogram abnormalities in benign and malignant. Our research results indicate that Fine Tuning is able to train an accurate classifier and overcome overfitting. Also, we have included the ROC curve metric to measure the performance of the classifiers here studied.
The remainder of this paper is organized as follows: In Section 2, we perform a review of machine learning concepts related to the current research; specifically convolutional neural networks, transfer learning and fine tuning. Literature review of related works in the field is in Section 3. Our proposed experimental method, dataset, and model are presented in Section 4. Section 5 presents experimental data results. Finally, discussion and future works are presented in Section 6 and 7, respectively.

Convolutional Neural Networks
Models based on the Convolutional Neural Networks (ConvNets) architecture have been able to achieve high accurate results in image classification and detection tasks in the ImageNet [17] dataset, under the supervised learning paradigm and back-propagation. That is the case of residual networks, proposed in [18], which achieved a 3.57% error rate in the ImageNet test set in 2015.
Traditional pattern recognition classifiers rely on a hand designed feature extractor that derives relevant information from the raw input data [16]. Thus, the feature extraction step aims to reduce the dimension of the data while characterizing the raw input data (image, sound, etc.) meaningfully so that a trainable classifier is able to categorize its feature vectors [16,19]. However, the design of the feature extractor requires specialized knowledge about the data (hand-engineering) that, in some cases, could be unknown [19]. On the contrary, ConvNets eliminate the feature extraction process by absorbing it in their architecture [16]. As pointed out in [14], the structure of a ConvNet combines both: the feature extraction and classification steps in one single model that is trained on backpropagation; the feature extraction task is, therefore, learned from data in the first layers of the model, while the last full connecting layers constitute the classifier task. The LeNet-5 ConvNet, proposed in [15] to solve the handwritten classification task, reduces the input image of 32 × 32 into a 120 vector that is called the feature vector. Thus,the feature vector can be used with any type of trainable classifier to solve the classification task. In fact, this approach is used in [19], where LeNet-5 is used as a black box feature extractor for several Support Vector Machines (SVM) that are trained based on it. Something similar is performed by [14], where the authors build an AlexNet [13] like model, trained it on a large dataset, and then use the trained model as a feature extractor to train new classifiers; however, their approach is more similar to a transfer learning set-up, as it will be discussed in Section 2.2.
The deepness in the number of layers of the ConvNets has been increasing in order to obtain better results since year 2012. However, the number of layers cannot be increased indefinitely due to the vanishing and exploding gradient problems. In order to overcome this problem, the structure of the traditional ConvNet, comprised basically of convolutional and pooling layers, has been revisited. An example of those architecture designs are found in Residual Networks [18], MobileNet [2], Inception [20] and NasNet [3] Con-vNets. Also the study of regularization functions has been of aid in avoiding the overfitting of the Networks in training [21,22]. For a review of the state of the art in Convolutional Neural Networks the reader may review the works of: [12,15,23,24].
Thus, ConvNets have some advantages compared with traditional artificial neural networks (ANN): reduction of training parameters by shared weights, local connections and object location invariance [12]. An ANN depends on all the connections between its layers; which increases the number of parameters to train, and makes the training of the model more expensive computationally. On the contrary, a ConvNet reduces the number of parameters trained because the convolution operation is of the local type. Yet, the main disadvantages of the ConvNet model are: the training time, which may be large since it is an hyper-parametrized model, and the susceptibility to overfitting. The most important aspect of ConvNets, as a deep learning method, consists in being a Representation Learning method that automatically discovers a representation of the data that is used for classification and detection tasks [12], as previously discussed. This is specially important because it implies that the need for carefully hand engineer feature extractors is not required. Remind that the feature extractor is embedded in the design of the ConvNet and, therefore, learned from data.
The success of a ConvNet relies in its architecture as well as on training the model with enough number of samples. Unfortunately, public mammogram datasets do not have as many examples as the number used per category in the ImageNet dataset. In order to overcome this difficulty, TL and FT are studied as a means to train deep ConvNets in order to classify mammogram abnormalities. In this Section we review the concept of TL and FT as it is used in this article.

Transfer Learning
A definition of the term is found in [25] and [26]. In their work, the purpose of TL is defined as to improve the performance of a learning algorithm in a target learning task T T (i.e. pathology classification) over a target domain D T (i.e. mammogram ROI images) by using www.astesj.com 155 the knowledge of the learning algorithm trained in a source learning task T S (i.e. 1000 category classification) over a source domain D S (i.e. natural images) which is larger than the target domain where: Depending on the relations defined by (1) and (2), different categories of TL are defined in literature. However, it is important to consider that computer vision tasks are particular and different from their corresponding data mining tasks. Thus, despite the fact that a mammogram image is very different from a natural image, the visual properties of objects in an image are general (i.e. edges, textures, shapes, etc.).
In a ConvNet, the knowledge is represented by the value of the weights trained by the back-propagation algorithm on each layer. Therefore, TL implies using the pre-trained ConvNet as a feature extractor or replacing the last original layer with a set of layers that are trained to obtain the target learning task desired. The latter is the approach that we have followed in our experiments. One of the main advantages in this technique is that training time is reduced by not re-training the whole ConvNet, but only the added layers or a trainable classifier in the case of using the pre-trained ConvNet as a feature extractor itself.

Fine Tuning
In this case, some of the last layers of the pre-trained ConvNet are re-trained with the new images I ∈ D T . Thus, the ConvNet is divided in two parts. Let us define γ as the layer from which the re-training of the ConvNet will occur. If the original ConvNet model has L layers, similarly to TL, we can replace the last original layer and add some layers in order to obtain the desired target learning task. Differently to TL, we also choose a r number of layers before the last one that are also to be trained. This means that the original weights from layer 0 to layer γ − 1 are preserved or frozen. FT presents more computing resources and time training since the number of parameters to be trained is increased by the r layers that are added to the training queue.

Over and Underfitting
A machine learning algorithm may suffer of two problems when training: overfitting and underfitting. The former reduces the capacity of the model to predict new unseen data which means that the model has a high variance. The latter, means that the model is not complex enough to reflect the nature of the data and find a pattern [27]. ConvNet models are characterized by being overparameterized; which means that the parameters of the model exceed the size of the training data [28]. The overfitting problem is reflected when plotting the train vs validation accuracy curve of the model. The difference between the train and validation curve should be minimum. In order to overcome overfitting in ConvNets, Data Augmentation, Regularization, and Early Stopping are usually used. Data augmentation is a basic strategy that consists in increasing the size of the dataset by performing transformations to the original images (i.e. rotation, zoom, reflection, etc). On the other hand, regularization techniques aim to penalize extreme parameter weights values (e.g. L 2 regularization) [27] or controlling the co-adaptation between neurons (e.g. Dropout) [13]. Early Stopping is also considered a regularization technique which aims to interrupt training when the performance of the ConvNet degrades on the validation set. This prevents that the model learn a form of statistical noise [29].
As discussed earlier, TL and FT may also prevent overfitting since the CovNet model is not whole retrained; in other words, TL and FT have less parameters to learn compared to training a model from randomly initialized weights. In the present work, Dropout [30] and Early Stopping have been used as regularization techniques altogether with data augmentation; these are changes introduced in this work that differ from our previous experiment.

Search Process
In our previous work [1], we used the methodology by [31], in order to find relevant works for study. Table 1 shows the search string designed to retrieve information from: Springer Link, Science Direct (Elsevier), IEEE Xplore, Scopus, Web of Science, ACM digital library, and PubMed. A total of 174 studies (including our previous work) were gathered from each repository as shown in Table 2. From these studies, a total of 32 primary documents were retrieved according to a selection study process where documents should have experimental methodology with results regarding the use of transfer learning in mammogram breast cancer classification. "breast cancer" AND ("classification" OR "detection" OR "prediction") AND ("ensemble learning" OR "transfer learning" ) AND mammo*

Literature review discussion
In our literature review, the most common ConvNet used for transfer learning is the model proposed in [13], named AlexNet, with a total www.astesj.com 156 of 12 cases. The second most frequent model found is VGG16. These results are shown in Figure 1. In mammogram mass abnormality classification and detection there are two main approaches found in literature: a) Processing the whole mammogram image and b) processing the region of interest. The former is found in [32,33]. Their aim is to find an "end to end design". Our approach belongs to the latter, where the ROI image is extracted. In fact processing a whole mammogram images in their original size seems to be a problem itself because mammogram images far exceed the traditional size used in many trained ConvNets in the ImageNet dataset. An interesting approach for an end to end design is proposed in [34]; In their work, the principles of the YOLO [35] architecture are used. However, the mammogram is also resized to 448 × 448.
As in the case of mammogram classification, Transfer Learning and Fine Tuning are used differently by different authors in literature. In the next subsection this differences are enlightened and discussed.

Transfer Learning as a Feature Extractor
In this case, the pre-trained ConvNet is used to extract a feature vector which is later used to train another kind of classifier algorithm like Support Vector Machines (SVM). This case is illustrated in [36]; the author extracts several feature vectors from different layers of the pre-trained AlexNet and trains Support Vector Machines (SVM) for each case. In the end, the author builds an ensemble of SVM. Other similar examples are found in [37,38].

Transfer Learning as a new ConvNet Classifier
In this case, the last full connecting layer of the pre-trained ConvNet may be substituted with a set of additional layers, where the last full connecting layer has only one neuron and the logistic regression for binary classification, or just the number of random initialized neurons required in proportion to the new classification task. For instance, in the case of benign vs malignant classification of the mammogram abnormality this can be achieved with a single neuron or two. Only the added layers are trained while the rest of the Con-vNet's weights remain frozen. This approach is tested in both our current and previous work [1]. In a similar fashion, VGG16 [39], GoogLeNet [40] and AlexNet [13] are trained in TL in [41].

Transfer Learning as weight initialization
In this case the whole ConvNet is re-train but uses the values of the ImageNet pre-trained ConvNet model as initial values for the weights. The last full connecting layer with 1000 categories is substituted by one or two neurons to address the binary classification problem [42]- [43].

Fine Tuning
This is the most common technique found in literature. In this case, the model's last full connecting layer is substituted with the number of neurons needed for the new classification or a set of new layers are added before the output layer. Differently to transfer learning as a new ConvNet, some of the last layers of the model are retrained with the new data as indicated in Section 2.3. For example, VGG16 [39], InceptionV3 [20], and ResNet50 [18] are fine tuned in [44]; the author found that when the number of convolutional blocks exceeds 2, the accuracy of the fine tuned model drops. Also, a comparison of the classification performance between the training of VGG16 in FT and using it as a feature extractor is explored in [45].

Data Augmentation and Pre-processing
In literature there is some discussion about the impact of both data augmentation and pre-processing of the medical image. As stated by [46], the achievements in medical images visual tasks with deep learning do not only rely in the ConvNet model but also in the pre-processing of images. For instance, some of the preprocessing methods found in literature are: global contrast normalization (GCN), local contrast normalization, and Otsu's threshold segmentation. However, there is some discussion about improving the image quality. In [37] is reported that global contrast normalization did not aid in improving the experimental results presented in the paper.
Since datasets are not so large, data augmentation is used by almost all researchers. Some of the most common techniques used are: rotations and cropping. However, the rotation operation yields to distortion of the original image. Because of this reason, right angle rotations are preferred to random rotation angles [37,43,44].

Transfer Learning and Fine Tuning Model
The problem to solve is to classify ROI patch mammogram images I in two classes Y = {benign, malignant}. In a supervised learning paradigm this means to find a prediction function φ(·) that maps an input space X formed by ROI patch mammogram images (I ∈ X) to the output space Y as indicated in (3) www.astesj.com 157 However, since TL and FT are to be used to improve the performance of φ(·), 3 may be written as indicated in 4 Function φ T (·) is to be trained by using pre-trained ConvNets on the ImageNet dataset, which is denominated as X S . Therefore, our approach satisfies the relations indicated in (1) and (2), since both images and the classification task are different between source and target.
Our approach in both TL and FT consists in replacing the last full connecting layer related to the original ImageNet classification task with a set of layers as indicated in Table 3. The global average pooling (GAvg) layer helps to flatten the original model layer previous to the 1000 full connecting SoftMax classification. The classification layer is comprised of 1 neuron and the Sigmoid function. In TL, only the last layers indicated in Table 3 are to be trained. In FT, we define the γ value that indicates the layer from which the training of the weights is to be performed. It is important to remember that all weights before γ remain with their original value from the ImageNet.

Dataset
In the present study, as well as our previous work, we use the Curated Breast Imaging Subset of DDSM (CBIS-DDSM) [4] which is an updated and standardized version of the Digital Database for Screening Mammography (DDSM). The dataset includes a subset of the DDSM data selected and curated by a trained mammographer. For our experiments we extract the ROI images from the mammogram images. We have only considered mass problems, leaving micro calcifications for a future work. The dataset is originally organized in train and test sets. The number of images per abnormality class and set type is shown in Table 4.  Figure 2: A diagram depicting the methodology followed in our research. First, mammogram and corresponding mass binary mask are read. The mask is used to extract the ROI from the mammogram image. Next, the ROI image is enhanced (pre-process). As a third step, data augmentation is used to increase the number of samples and create train, validation and test sets. Finally, transfer learning is performed and evaluated. The best non overfitting model is selected to be fine tunned

Methodology
The methodology followed in the current experiments is shown in Figure 2. First, the ROI images are extracted from the mammogram image by using the binary segmentation masks provided in the CBIS-DDSM dataset. After that, images are pre-processed to enhance contrast. Next, a single dataset is formed in order to use data augmentation and create three sets of data: train, validation and test. In this work, we do not use Otsu algorithm to segment the previously obtained ROI by creating an intermediate binary mask. This is because our previous work showed that training with the ROI image segmented with Otsu did not overcome the results obtained with the original background image. Finally, the models are trained in TL and FT, and their performance is evaluated. These steps are described in more detail below.

Image Pre-processing
In this section we present the steps performed in the pre-processing stage of our proposed method. The original mammogram image www.astesj.com has a pixel value which ranges up to 65 535 (i.e. 16 bit resolution).
In order to generate a dataset of PNG images saved in disk, we change the resolution of the images by normalizing them between 0 and 255, considering the minimum and maximum values of pixel in the original DICOM mammogram image. After that, and different to our previous work, we used the Contrast Limited Adaptive Histogram Equalization (CLAHE) [47] to improve image quality. By using the provided binary masks, we extract the ROI through the coordinates of a bounding box around the suspicious mass. A second normalization of the pixel value is carried out on the ROI considering the minimum and maximum pixel values in it. This originates ROI images with different width and height sizes. As in our previous work, aspect ratio is considered. In (5), the aspect ratio is defined; where r is the aspect ratio, w and h are the width and height of the image respectively.
Aspect ratio was considered previous to resizing the image in order to preserve the best quality possible from the original image in both, upsampling and downsampling procedures. For upsampling, cubic interpolation was used, whereas for downsampling, area interpolation gives best results. Also, images with an aspect ratio inferior to 0.4 and superior to 1.5 where removed from dataset.
ROI images were resized to a final size of 320 × 320. This was achieved only with images whose aspect ratio was inside the limits presented in (6). Resizing the ROI images consisted in two parts: 1) we resized the ROI image to 328 × 328, 2) cropping around the center of the image. In other words, we have resized the original ROI image to a bigger size (with a padding of 8 pixels) and then cropped in the center of the image to obtain the desired size. Finally, the image is filtered with fast non local means denoising algorithm [48]. This is because CBIS-DDSM images are film mammography and appear to have some noise that has not been removed from the image. Figure 3 presents the steps carried out on image pre-processing for a sample of the training set. Both, the full mammogram image and its corresponding binary mass mask are shown in sub-figures A and C respectively. Image C shows the identification of the bounding box. B presents the application of CLAHE on the mammogram image. Finally, E presents the processed ROI image after performing crop center on D which was extracted from B through the bounding box.

Data Augmentation and Dataset Generation
In our previous work we did not use data augmentation and our models presented overfitting. In order to overcome this difficulty, we implemented the output structure indicated in Table 3 which uses dropout. As pointed out in Section 3.2.5, performing transformations over the images distorts them. Because of that, some researchers use right angles. In our case, we have used the Augmentor Library [49], which has been designed to permit rotations of the images limiting the degree of distortion. Additionally, the Augmentor library permits to apply other operations for data augmentation like zoom, bright, shear. The function uses a probability value to control the number of artificially created images. To augment the dataset, we first join both Train and Test sets. The dataset was increased to a total of 60 000 images, where 80% of the images are used for training, 10% www.astesj.com for validation and 10% for testing. The augmentation operations used are depicted in Table 5.

Model Training
The generated augmented dataset is tested first in Transfer Learning. A total of 4096 neurons are used for the FC layer, a dropout value of 0.2, and a single neuron in the output layer (or classification layer, see Table 3) for binary classification. Early Stopping, with a patience of 50, was enabled in order to stop training when the performance degrades. A maximum number of 1 000 epochs is proposed. The learning rate for TL is 1×10 −5 . The loss function is set to binary cross entropy and the optimization algorithm used is RMSProp [50]. Binary cross entropy was chosen because the classification layer of our model uses the Sigmoid activation function to discriminate between benign and malignant mass pathology by using one neuron. RMSProp, an adaptive gradient algorithm, is frequently used in computer vision tasks [51]. For instance, the results achieved in [52] indicate a better training result obtained by using RMSProp instead of Stochastic Gradient Descent (SGD). Also, as presented in [53], RMSProp outperforms other common optimization algorithms [51]. This has inspired the theoretical research in [51], where the authors establish the reasons of the success of RMSProp in deep learning and propose new algorithms for optimization. In our study, we consider that RMSProp is convenient for our image classification task because of the aforementioned reasons.
Transfer Learning was carried out in 20 models provided by the Keras API in Tensorflow v1.13.1 [54]. The models used are shown in Table 6. Each model performance is evaluated in the train-validation curve of accuracy. If the difference between train accuracy and test accuracy is over 10%, we consider that overfitting has occurred and the model is rejected. Once a suitable not overfitting model is found, FT is performed in order to further increase the performance of the selected classifier.
FT training parameters are: learning rate of 2 × 10 −7 ; Dropout remains at 0.2. Respect to the FC neurons, only Global average pooling is performed. Binary cross entropy is set as the cost function and the optimization algorithm used is also RMSProp.

Computing Resources
In order to train deep learning models it is necessary to use Graphical Processing Units (GPU). In our case we have used: Nvidia Tesla K80, with 12 GB of memory, Nvidia Tesla K40, with 12 GB of memory, and Nvidia GeForce RTX 2080, with 8 GB of memory.
Regarding software resources, our program used Tensorflow v13.1 [54] as the machine learning framework. For image data augmentation, as indicated before, the Augmentor library [49] was used.

Experimental Results
In this Section, we present our experimental results. As a difference wrt. to our previous work, we include additional metrics to evaluate the performance of the classification model such as the area under the ROC curve and the F 1 Score. These metrics are described below. According to the methodology proposed in Figure 2, it is important to estimate the overfitting of the trained model φ T (·). In order to do so, let us define β as the overfitting ratio by comparing train (train acc) and validation accuracy (valid acc) as in β = train acc valid acc If β ≈ 1, we could say that train acc ≈ valid acc and therefore that there is little overfitting. Values of β > 1 will reflect that there is a considerable difference between train and validation accuracy, meaning that the model has overfitted.

Performance Metrics
The confusion matrix compares both the prediction of the trained classifier and the true labels provided in the test set. It consists of four main measures: true positives (T P), true negatives (T N), false positives (FP), and false negatives (FN). The total positive cases are P = T P + FN; similarly, the total negative cases: N = T N + FP. From these measures, the true positive rate (8), false positive rate (9), and true negative rate (10) are derived.
The elements over the diagonal belong to T P and T N and reflect all the correct classifications made by the model. The FP and FN correspond to wrongly classified cases. For instance, FN corresponds to a true malignant tumor that is classified as benign. From these measurements, metrics like: Accuracy (11), F 1 Score (12), and the area under the ROC curve (13) are defined.

ACC =
T P + T N T P + T N + FP + T N (11) www.astesj.com

Transfer Learning Experiment
The ImageNet pre-trained ConvNets presented in Table 6 are trained under TL to predict mammogram abnormalities classes in mammogram roi images.   Table 7. The overfitting ratio stated in (7) is presented in Table 8. The results indicate that Resnet-50 has the best AUC, however its β shows that there is overfitting. On the contrary, both VGG16 and VGG19 do not present overfitting, but their classification performance is lower compared to Resnet-50. These results are reflected in Figure 4 and 5, where the plot train vs test accuracy is presented for Resnet-50 and VGG16, respectively.

Fine Tuning Experiment
According to the results indicated in Tables 7 and 8, we proceeded to train the VGG16 in Fine Tuning. Different deepness levels where tried in order to search for classification performance improvement. This is indicated through the γ value. VGG16 was trained from layers γ = 8, γ = 10. In order to denominate the trained model, we propose to use the pre-trained ConvNet name followd by the layer from which fine tunning occured and added the keyword FT to distinguish the model from those trained in TL mode. For instance, VGG16-10-FT means that VGG16 was fine tuned from layer 10. Similarly, we trained VGG19 at γ = 17. In all cases, only global average pooling followed by dropout and the classification layer were used; except for the case of VGG16-8-FT, where a full connecting layer of 4096 neurons was used.
The results obtained are shown in Table 9. We have complimented the experimental results with the Fine Tuning of models: Xception, Resnet101, Resnet152 and Resnet50. It is observable that the best results are achieved by VGG models. The best result achieved corresponds to the VGG16-8-FT. However, Table 10 suggest that the second best result (VGG16-10-FT) has less overfitting and therefore is preferred. In fact β value for VGG16-10-FT and VGG19-17-FT is similar, but the performance of the latter is poorer.
In Figure 6, the plot of train and validation accuracy vs the number of epochs for VGG16-10-FT is presented. Figure 7 shows the ROC curve obtained, whereas Figure 8 presents the confusion matrix for the test set generated.   Figure 4: Comparison of the train and validation accuracy for Resnet50-TL when trained in transfer learning. As can be seen, despite that Resnet50 achieves an ACC = 0.86 accuracy, the model presents overfitting due to the distance between both curves, described by the ratio β = 1.16 Figure 5: Comparison of the train and validation accuracy for Vgg16-TL when trained in transfer learning. VGG16 achieves a lower level of accuracy compared to Resnet50 (ACC = 0.64, β = 1.01) but there is no overfitting, due to the little distance between train and validation curves.  Fine Tuning the VGG16 model from layer 10 (γ = 10) helps to increase the performance of the classifier while controlling the overfitting (β = 1.07), which means that train accuracy is 7% over the test accuracy. www.astesj.com

Conclusion
The present study compared the performance of TL and FT of different pre-trained ConvNet models on the ImageNet dataset such as: VGG, DenseNet, Inception, Resnet, Resnext and Xception. In our previous work, we have experimented with Nasnet and Mobilenet without data augmentation. This showed that the models had a trend to overfit. In order to overcome this problem and increase the performance of a Fine Tuned model, in this work we have used data augmentation as indicated in Section 4.3.2. Our experiments showed that increasing the dataset up to 30 000 images per category helped to achieve good results. Special care was taken to increase the dataset by using the Augmentor library [49] which permits to rotate images avoiding excessive distortion. Compared to our previous work, the image pre-processing has also been changed: instead of using histogram equalization, CLAHE was selected to enhance the contrast the images. Also, the image was resized to a bigger value in order to crop around the center. Finally, image filtering was applied to reduce the presence of noise in the image, as indicated in Section 4.3.1.
In order to estimate overfitting, we proposed a simple ratio relation described in (7). This permitted to conclude that models could achieve good results in classification metrics but overfit in the end. Considering this, we decided to increase the complexity of the model by using the fine tuning technique, where weights from layer 1 to γ − 1 are frozen, and weights from layer γ until the end are trained with back-propagation. This allowed to increase the performance of both VGG16 and VGG19. Increasing the number of neurons in the FC layer of VGG16-8-FT improved the results but there is a slight overfitting.
The CBIS-DDSM dataset, despite the fact of having at most 1 696 examples for mass abnormality classification, their samples present some artifacts and noise, which is the reason why we used some pre-processing and filtering algorithms in order to improve the image. This is probably due to the fact that the original images are of film type. Certainly, our experiments suggest that TL and FT of pre-trained ConvNets is able to classify film mammogram ROI images. However, the data augmentation increase of the original dataset is considerably. Therefore, public mammogram datasets of Full Digital Mammography, which has better quality than film, with enough sample data would aid to train better classifiers.
With respect to our previous work, we have also included new metrics to evaluate the performance of the classifier. This is important, since the accuracy metric is susceptible to be distorted when the dataset is skewed or unbalance. In other words, it tends to benefit the class with a majority of examples.

Future Works
The interest in improving CAD systems in mammography is clear since the disease is a public health problem with high rates of both incidence and mortality. In this and our previous work, we have used the CBIS-DDSM dataset mainly. In a future work, we aim to evaluate the performance of TL and FT in other datasets such as INbreast [55] and Mias [56].
Also, we will be addressing the problem of localization and detection of the mass in the mammogram image. This problem is also of interest and could be formulated from the classification problem here addressed. One of the important things to notice is the peculiarity of the size of the mammogram image compared to the size used in ImageNet trained ConvNets. Mammogram images are of considerable size and it could be of interest to design both the classifier and the object detection avoiding to excessively resize the original image.

Conflict of Interest
The authors declare no conflict of interest.