Classifying Garments from Fashion-MNIST Dataset Through CNNs

A R T I C L E I N F O A B S T R A C T Article history: Received: 27 October, 2020 Accepted: 24 January, 2021 Online: 16 February, 2021 Online fashion market is constantly growing, and an algorithm capable of identifying garments can help companies in the clothing sales sector to understand the profile of potential buyers and focus on sales targeting specific niches, as well as developing campaigns based on the taste of customers and improve user experience. Artificial Intelligence approaches able to understand and label humans’ clothes are necessary, and can be used to improve sales, or better understanding users. Convolutional Neural Network models have been shown efficiency in image c1assification. This paper presents four different Convolutional Neural Networks models that used Fashion-MNIST dataset. Fashion-MNIST is a dataset made to help researchers finding models to classify this kind of product such as clothes, and the paper that describes it presents a comparison between the main classification methods to find the one that better label this kind of data. The main goal of this project is to provide future research with better comparisons between classification methods. This paper presents a Convolutional Neural Network approach for this problem and compare the classification results with the original ones. This method could enhance accuracy from 89.7% (the best result in the original paper, using SVM) to 99.1% (with a new cnn model called cnn-dropout-3).


Introduction
This paper is an extension of work originally presented in the Iberian Conference on Information Systems and Technologies [1].
The fashion market has changed dramatically over the last 30 years, resulting in an evolution in that industry [2]. Understanding customer tastes and better-directing sales are the way to increase profit [3].
The rise of internet business lets people buy their clothes through websites, faster and easier. The introduction of methods to improve user's experience when searching for items in these platforms is decisive [4].
Classifying clothes is part of the broad task of classifying scenes [5][6][7][8][9]. The automatic generation of image labels that describe those products can alleviate human annotators' workload [10]. This kind of information may also help labeling scenes and better understanding users' tastes, culture, and financial status [5].
In [11], the authors present Fashion -MNIST data set based on images from Zalando, which is the Europe's largest online fashion platform. Fashion MNIST has 70,000 products with 28x28 pixel grey scale images divided into 10 categories: t-shirt, trouser, pullover, dress, coat, sandals, shirt, sneaker, bags and ankle boots.

ASTESJ ISSN: 2415-6698
CNNs can have better results when compared to SVM [13]. Knowing that, this paper proposes the use of Convolutional Neural Networks (CNN) to label FashionMNIST dataset. The main goal is to compare those results with the original one, providing future research to be able to easily choose the most suitable classification method. In this paper, we present the development of four different CNN models and compared the results with the original ones. Our original work [1], used these four models with TensorFlow 1 (TF1) to get the results. Now, we present this extension using TensorFlow 2 (TF2) and GPU computing (with tensorflow-gpu and keras).

Feature Learning
Machine Learning refers to computer systems capable of learning and modifying their behavior, in response to external stimuli or through experiences accumulated during their operation [14].
The main objective of Machine Learning is to generalize beyond the existing examples in training set, because regardless the amount of existing data, it is very unlikely that during the tests, the same examples will appear [15].
Conventional Machine Learning techniques are limited to processing natural data in its raw form. To build a model capable of doing pattern recognition, it is necessary to develop a feature extractor, which transform raw data into a representation that the classifier can detect [16,17].
The group of methods that allow systems (based on Machine Learning) to discover the necessary representations for detecting and classifying raw data is known as Feature Learning [18].
These methods can not only learn how to map the feature to a result, but also to build the representation itself, often resulting in better performance than the representations developed by a specialist [19].
The growing scientific interest in this topic has been followed by a notable success, both in academia and industry. Mainly in areas of speech recognition, signal processing, object recognition and natural language processing [20].
Some of the Feature Learning techniques that have been producing promising results refer to Deep Learning. As more data becomes available, the more successful this technique will be [19].

Deep Learning
There are several techniques capable of Feature Learning, some are known as Deep Learning. They are models made by multiple nonlinear transformations, to produce abstract and more useful representations.
These models transform a representation into another, more abstract than the previous. In classification models, for example, more representation layers tend to amplify aspects of the data that are important for classification and hide irrelevant variations. By adding more layers, these models can represent more complexes functions [21].
The key aspect of Deep Learning is that layers of representations are not specified by a human specialist. They are learned through data, using common Machine Learning procedures. In the Deep Learning context, we can use Convolutional Neural Networks for this.

Convolutional Neural Network
Convolutional Neural Networks [22], refer to a variation of MLP (Multilayer Perceptron) and are based on the visual cortex behavior, where the neurons of the initial regions are responsible for detecting simple geometric shapes in the image, such as corners and edges, and the final neurons detect more complex graphic shapes. The process is repeated throughout the cortex until neurons in the final region detect characteristics of higher abstraction level, such as specific faces [23].
CNNs are used in problems where it is necessary to find relevant information implicit in data set, through operations that occur in convolution and pooling layers. In relation to the image classification task, variants of Convolutional Neural Networks that have been prevalent in the literature [24], and they demonstrate excellent results in the MNIST, CIFAR and ImageNet datasets [11,25].
A convolutional layer is composed by several neurons, each one is responsible for applying a filter to a piece of the input matrix [23]. The convolution operation consists of applying a series of filters, sliding over the entire input matrix, and the result of applying the filters is called a feature map [22].
A pooling layer implements a nonlinear sub sampling function to dimensional reduction and small invariances capture. Pooling reduces the dimensionality of the input feature map and produces a new feature, creating something like a summary of the input.
For each filter, the highest value (max pooling) is selected, or the average (average pooling) is calculated. The pooling application speeds up training and reinforces CNN's strength in relation to position and size of most important characteristics of the training data.

Dropout
One of the most common challenges in training a Convolutional Neural Network is overfitting. There are several ways to mitigate this problem in a Deep Learning model: increasing the number or size of layers, or use a technique known as dropout.
The term dropout refers to "dropping" units (neurons) from a neural network, which means, temporarily removing them from model. The choice of which neuron to remove is random, and the amount can be fixed using a constant, for example, the constant 0.5 defines that half of the neurons will be removed. This technique has considerably improved the accuracy and performance of neural network in several applications, such as object classification, speech recognition and document classification, among others [6]. This shows that this technique can be generalized for any problem.

Related Work
In [6], the authors presented a context sensitive grammar in an And-Or graph representation, that can produce a large set of composite graphical templates of cloth configurations.
In [5], the authors introduce a pipeline for recognizing and classifying clothes in natural scenes. To do so, they used a multiclass learner based on a Random Forest. Data was extracted features maps which were converted to histograms and used as inputs for a Random Forest (for clothing types) and an SVM (for clothing attributes) classifier. As result, their pipeline can describe the clothes on a scene, as show in Figure 1. They also created their own dataset with 80000 images labeled in 15 classes. In [26], the authors propose a knowledge-guided fashion analysis network for clothing landmark localization, and classification. To do so, they used a Bidirectional Convolutional Neural Network. As results, the model can not only predict landmarks, but also category and attributes, as shows Figure 2.

Data Set
Fashion-MNIST is a direct drop-in alternative to the original MNIST dataset, for benchmarking machine learning algorithms [11]. MNIST [27] is a collection of handwritten digits, and contains 70000 greyscale 28x28 images, associated with 10 labels, where 60000 are part of the training set and 10000 of the testing. Fashion-MNIST has the exact same structure, but images are fashion products, not digits. A sample of this set can be seen in Figure 3. The dataset can be obtained as two 785 columns CSV, one with training images, and the other with testing ones. Each CSV row is an image that has a column with the label (enumerated from 0 to 9) and 784 remaining columns that describe the 28x28 pixel image with values from 0 to 255 representing pixel luminosity.

CNN Models
To label this dataset, four CNN models were done in Python with Keras and TensorFlow. Training was executed in a Jupyter notebook, using GPU. We also used Weights and Biases [14] to grab information about training and hardware usage.
Proposed models were named: cnn-dropout-1, cnn-dropout-2, cnn-dropout-3 and cnn-simple. The goal of those models was to be able to label the dataset without the need of too much training or processing on activation, so developers can use it on real time applications such as online stores and searching websites.

cnn-dropout-1 and cnn-dropout-3
Both models use two consecutive blocks containing: a convolution, a max pooling, and finally a drop out. These blocks are connected to two more fully connected layers, who are connected to an output layer of ten neurons, each one representing a category. The only difference between these two models is that cnn-dropout-3 has considerably lower dropout values, as show in Figure 4. where the first drop out values is for model 1, and the second one for model 3. This topology has 44426 trainable parameters.

cnn-dropout-2
This proposed model is very similar to the cnn-dropout-1 model. However, it has two layers of convolutions before each max pooling. This model has about 32340 trainable parameters as shows Figure 5.

cnn-simple
Cnn-simple is a model with less layers. It has only two convolutions, followed by a fully connected layer, in addition to the respective dropout and max pooling like other models. This model has 110968 trainable parameters and is shown in Figure 6. Since this model has only one max pooling, the image gets to the dense layer with 14x14 pixels in size (four times the size of other models which is 7x7). So, de dense layer training is expected to be slower.
All those models were modeled based on Keras Sequential model. Convolutional and dense layers used Rectified Linear Unit (ReLU) activation functions, except by the last dense layer on each model (output layer), were Softmax was used. The optimizer used was Adadelta [28] Batch size was 128 and we trained the models for 12 epochs. To improve results, image pixel luminosity values were normalized to float numbers between 0 and 1.
When evaluating time, the two more accurate models, were also de faster in training. Figure 10 presents the relation between time and the training epoch. The lines tilt angle shows how fast the training was. Since we are using the same dataset, it is possible to compare these models with traditional non-convolutive machine learning algorithms presented in [11]. Table 1 presents the comparison among our model with theirs. Results shows that even our worst results (cnn-dropout-1) got better results than the best result in [11] (SVC, 89.70%).

Conclusions
Obtained results evidence that classifying fashion products with CNN can be more accurate than by using other conventional machine learning models. In addition, it was observed that the dropout technique together with more convolutive layers are effective when it comes to reducing the bias of a model.
Using TensorFlow 2 and GPU for training, we could reach not only a better training time, but also, better accuracies. Table 2 shows the differences between our original work and the present. We also could decrease loss and bias, which were our main problems. We could not evaluate improvements in runtime since we used different hardware than in the original run.
Our original work found that the best model was cnn-simple, but now, with these new results, we discovered that cnn-dropout-3 is better (using TF2). This is good news because this model is faster to train, since it has an extra max-pooling layer that decreases dense layer inputs by a quarter.
About our goals, we could compare obtained results with the ones from the original FashionMNIST paper, and they show that CNNs can be great classifiers for garments. Table 1 contains our main results about it and can be used in future works to help researchers and developers finding the best classification technique.