Machine learning framework for image classification

Hereby in this paper, we are interested to extraction methods and classification in case of image classification and recognition application. We expose the performance of training models on varying classifier algorithms on Caltech 101 images categories. For feature extraction functions we evaluate the use of the classical SURF technique against global color feature extraction. The purpose of our work is to guess the best machine learning framework techniques to recognize the stop sign images. The trained model will be integrated into a robotic system in a future work.


INTRODUCTION
The purpose of this paper is twofold. On one hand, it is an introduction to image classification paradigm. On the other hand, it attempts to give a comparison between different feature extraction and classification algorithms.
The rest of this paper is structured as follows: Section II provides back-ground information on machine learning. Section III presents a detailed description of the Bag of Features paradigm. It also exposes the SURF detector of image Region of interest (ROI) and highlights the unsupervised K-Means algorithm. In section IV we describe learning and recognition based on Bag of Words (BoW) models. Section V discusses experimentations carried out in the evaluation of different classifiers on image Caltech 101 dataset. In the conclusion we synthetize the obtained results and present the current direction of our research.

II. MACHINE LEARNING PARADIGM
Machine learning (ML) is an algorithm set especially suited to prediction. These ML methods are easier to implement and perform better than the classical statistical approaches [1].
Instead of starting with a data model, ML learns the relationship between the response and its predictors by the use of algorithms. During the learning phase ML algorithms observe inputs and responses in order to find dominant patterns.
In this work we are interested in computer vision. We deploy and test a machine learning based framework in image category classification. To achieve tests we use the Calltech 1 dataset.
As the main issue in image classification is image features extraction, we use in our research the Bag of Features (BoF) techniques described in section III.

III. BAG OF FEATURES PARADIGM FOR IMAGE CLASSIFICATION
Development of Bag of Features (BoF) model is inspired by that of Bag of words (BoW). In document classification fields (text documents), a BoW is a vector that represents the frequency of vocabulary words in a text document.
In computer vision, BoW model is used to classify image. In that case image features are considered as words. With the use of BoW model, an image is considered as a document. For "words" definition in images we use three stages: feature extraction, feature description (section III.A), and codebook generation (section III.B) [5][6][7][8][9][10][11].

A. Speed Up Robust Features (SURF) Technique
For features detection and extraction, we use the Speed Up Robust Features method. Salient features and descriptors are extracted from each image. This method is chosen over Scale-Invariant Feature Transform (SIFT) due to its concise descriptor length. In SURF, a descriptor vector of length 64 is generated using a histogram of gradient orientations in the local neighborhood around each key-point [12]. To analyze image and extract features, SURF considers the processing of grey-level images only as they contain enough information [13].
In this paper, the SURF implementation is provided by the Matlab R2015a library.

B. Descriptors clustering: K-Means
After extracting descriptors from the training images, unsupervised learning algorithms, such as K-means, are used in order to group them into N clusters of visual words. The metric used to categorize a descriptor into its cluster centroid is the "Euclidean distance". For this purpose, each image extracted descriptor is assigned to its closest cluster centroid.
In order to generate the histogram of counts, the cluster centroid's number of occupants is incremented each time a descriptor is mapped into it.
At the end of this process, each image is characterized by a histogram vector of length N. To ensure the invariance of this method with respect to the number of descriptors used, it is essential to normalize each histogram by its L2-norm.
To group the descriptors and construct the N visual words we use the K-means clustering. This approach is selected over Expectation Maximization (EM) as many experimental methods have confirmed the computational efficiency of K-means with respect to EM [14].

IV. LEARNING AND RECOGNITION BASED ON BOF MODEL
Research in computer vision field have led to many learning approaches to leverage the BoF model for the purpose of image recognition. For multiple label classification problems, the evaluation metric which is used is the confusion matrix.
A confusion matrix is defined as a particular table making it possible to visualize the accuracy of a supervised learning algorithm. Matrix columns symbolize the instances in a predicted class whereas rows represent the instances in an actual class (or vice-versa). The appellation is due the fact that it makes it simple to see if the system confuses two categories (i.e. mislabeling one as another) [15].
In this work we investigate many supervised learning algorithms such as: SVM [16], k-nearest neighbors [17] and Boosted Regression Trees [18,19] to classify an image. Each image in dataset is encoded by its "BoF" histogram vector as shown in  In the following we provide a summary of different experiments that we use to evaluate the performance of our image classification machine learning framework. Our results are reported on Calltech 101 image dataset to which we have added some new images of existing categories. We are interested in stop sign category recognition.

A. SURF Local Feature Extractor and Descriptor
In this experiment, we test the local feature extractor SURF and its robustness in matching features even after rotation and scaling image (Fig. 3, Fig. 4).

B. Bag of Features Image Encoding
We use BoF to encode each image of the dataset into a vector feature which rep-resents the histogram of visual word occurrences contained in it (Fig. 5).

C. Classifier Training Process
The encoded training images are fed into a classifier training process to generate a predictive model. In this section, we are interested in measuring the classifier average accuracy and its confusion matrix.
Image categories from Calltech101 dataset used are described in TABLE I.

1) Experiment 1: Classifier Evaluation Based on the Number of Image Categories.
For tests we use the SURF extractor and the Linear SVM classifier. We chose 30% among images for training and the remaining for validation. The obtained confusion matrixes are shown in TABLE II. , TABLE III. , TABLE IV. and TABLE  V.   TABLE II As shown in Fig. 6, we notice that the average accuracy of the classifier is influ-enced by the number of categories in training dataset. This metric is lower when the numbers of sets increase. 2) Experiment 2: Evaluating Image Category Classifier Using Color Extractor.
The Linear SVM classifier is applied on sign stop, ferry and laptop categories. We use a global color features extractor instead of the SURF technique.
It is reported that the achieved average accuracy is 0.76, as shown in TABLE VI.
We notice that in our approach is better to use a local features extractor (SURF) than a global features extractor. This result is expected as the global features extraction technique is better with scene categorization and not for object classification [20].

3) Experiment 3: Training Learner Evaluation.
We next fix the number of categories to 4, the feature extraction technique to SURF and evaluate models on varying the classifier algorithm: SVM, KNN and ensemble classifier categories. We then generate the histogram (Fig.7) of the average accuracy based on the training classifier. Measurements show that the image classification process performs better when we use a likehood SVM. It's reported that the Cubic SVM yields average accuracy which reaches 90%. The KNN techniques offer an average accuracy around 65%. Among the ensemble classifier trainers (2 last tested algorithms) the bagged trees achieves the best accuracy.

VI. CONCLUSION
In this paper, we related the different techniques and algorithms used in our machine learning framework for image classification. We presented machine learning state-of-the-art applied to computer vision. We introduced the Bag of Features paradigm and highlighted the SURF as its technique for image features extraction and description. Through experimentations we proofed that using SURF local feature extractor method and a SVM (cubic SVM) training classifier performs best average accuracy. In test scenarios we focused on stop sign image as we project to apply the trained classifier in a robotic system.