Figure 1 shows the general structure of the proposed method for image classification. By extracting the corresponding descriptors for each training sample, all of the training data are in the form of a matrix in which each row represents the feature vector of a sample, and each column represents a feature in the feature space. The training data were labeled; therefore, each related class is available at the start of the proposed algorithm. A parametric or non-parametric model is created based on these training samples for predicting the class of a test image. As shown in Fig. 1, feature extraction involves the use of one of the HOG or LBP methods, or of a combination of these two methods. Classification refers to one of the three methods mentioned above: KNN, SVM, and NB. With the arrival of new samples (test images), the desired features are extracted, and the label of the unknown sample is estimated using the created model. After the test samples are determined, a confusion matrix is generated for each classification model. The efficiencies of the different methods are examined using the confusion matrices.

Fig. 1
figure 1

General structure of the proposed method, including HOG, LBP, KNN, SVM, and NB approaches


The HOG was introduced in 2005 for pedestrian detection in static images. Today, this method plays an important role in identifying humans in movies [29, 30], as well as in various other applications such as sketch-based image retrieval [31] and real-time vehicle detection [32].

To extract the HOG, an image is first filtered using the horizontal and vertical operators in the x– and y-directions, so that image gradients are obtained for the x– and y-directions.

$${G}_x={mathrm{D}}_xast I$$


$${G}_y={mathrm{D}}_yast I$$


$${D}_y=left[begin{array}{c}-1kern0.75em \ {}0\ {}1end{array}right],{D}_x=left[-1kern0.75em 0kern1em 1right]$$


In the above, I is the original image. Dx and Dy are the filter masks in the x and y directions, respectively, and are defined as the vectors of Eq. (3). In ref. [29], more complex masks, including Sobel operators, were used to calculate the image gradients, and their performance was evaluated. This study reveals that the use of the masks of Eq. (3), in addition to providing simplicity, leads to better results in pedestrian detection. In Eq. (3), Gx and Gy denote the gradients of the image in the x and y directions, respectively, and the sign * indicates the convolution operation. In convolution operations, the neighbors of a central pixel are summed with specified weights, and the result is placed as the current pixel value. The weights are determined by a weight matrix or convolution mask. After calculating Gx and Gy, the magnitude and orientation of the gradient in each pixel are obtained as follows:

$$left|mathrm{G}left(mathrm{i},mathrm{j}right)right|=sqrt{{left[{mathrm{G}}_{mathrm{x}}left(mathrm{i},kern0.5em mathrm{j}right)right]}^2+{left[{mathrm{G}}_{mathrm{y}}left(mathrm{i},kern0.5em mathrm{j}right)right]}^2},kern0.5em {uptheta}_{mathrm{G}}{=tan}^{hbox{-} 1}left[frac{{mathrm{G}}_{mathrm{y}}kern0.5em left(mathrm{i},kern0.5em mathrm{j}right)}{{mathrm{G}}_{mathrm{x}}left(mathrm{i},kern0.5em mathrm{j}right)}right]$$


Here, |G| is the magnitude of the gradient, θ is the gradient direction, and i and j represent rows and columns in the image, respectively. To calculate the gradient histogram in each cell, the gradient orientation is first limited to a range of 0–180, as follows:

$${uptheta}_{mathrm{G}}^{prime }=f(x)=left{begin{array}{c}{uptheta}_{mathrm{G}},kern5.25em 0le {uptheta}_{mathrm{G}}<180{}^{circ}\ {}{uptheta}_{mathrm{G}}-180,kern0.5em 180{}^{circ}le {uptheta}_{mathrm{G}}le 360{}^{circ}end{array}right.$$


To calculate the histogram of gradients, the distance between 0–180° is divided by n equal distances, representing the number of directions of the gradient or histogram bars. Each of these distances forms a histogram channel. The range from 0–180° is used instead of the 360° range because usually, additional bars are needed for extraction in the range of 0–360°. Thus, the smaller range saves more time for feature extraction. Experimental observations have also shown that using a 360° range has little effect on improving the results relative to a 180° range. As discussed in ref. [29], a nine-bar histogram achieved better results in experiments; accordingly, the present study uses the same number of bars to calculate the HOG.

To calculate the histogram, the image is divided into several cells. Each pixel then votes for one of the histogram channels, based on its gradient orientation. These votes are weighted based on the magnitude of the gradient in that pixel. This generates a histogram for each cell for describing the gradient of the pixels. In some cases, the HOG is calculated for a block (consisting of several cells) by connecting the histograms of adjacent cells (Fig. 2).

Fig. 2
figure 2


The LBP is another type of visual descriptor used in various machine vision applications [33,34,35]. This descriptor can be used for powerful feature extraction methods for face descriptions [36], analyzing wear in carpets [37], etc.

The feature vector in the LBP for gray-scale images is calculated as follows.

  • The desired image is divided into several blocks, and each block is divided into several cells.

  • The following calculations are performed for each pixel in a cell.

  • Each pixel is compared to its eight neighboring pixels. The neighboring pixels are examined individually in a particular direction (e.g., clockwise).

  • When the center pixel is larger than the neighboring pixel, the number ‘0’ is written; otherwise, the number ‘1’ is written. In this manner, an eight-bit number is obtained by comparing the central pixel with its eight neighbors. For convenience, this number is usually converted to a decimal number between 0–255 (Fig. 3).

  • A histogram of the numbers obtained in the previous step is calculated for each cell. This histogram has 256 bars (from 0–255), and each bar shows the number of repetitions of a specific number in that cell.

  • If necessary, the desired histogram is normalized.

  • The histogram of the entire block is obtained by connecting the histograms of neighboring cells. Thus, if a block contains four cells, the generated feature vector has a length of 256 × 4.

Fig. 3
figure 3

Figure 3 shows an example of binary-pattern calculations in a 3 × 3 neighborhood. The binary pattern 00010011 is assigned to the central pixel, with a gray level of ‘5’ in the image on the left. After completing the calculations for all of the blocks, the generated feature vectors are processed by using an appropriate model to categorize the desired images. These classifiers can be used to categorize objects, recognize faces, analyze textures, and so on.

Different types of LBP algorithms have been proposed, with various changes relative to the original algorithm. One of the most useful and widely used types of LBP is the uniform pattern, which can significantly reduce the length of the feature vector [35]. This idea stems from the fact that the number of occurrences of certain binary patterns, called uniform binary patterns, is particularly important.

A generated binary pattern for a pixel is called a uniform pattern if it has a maximum of two 0–1 or 1–0 transitions. For example, 00010000 is a uniform pattern with two transitions, 0–1 and 1–0, but pattern 01010111, with five transitions, is not uniform. Uniform binary patterns with the highest number of events correspond to the basic features of the image, such as its edges, corners, and important points [34].

Therefore, uniform patterns can be considered as factors in identifying the main features of an image. All non-uniform patterns are assigned to a single bin, and each uniform pattern has a separate bin. As 58 uniform patterns are in the range of 0–255, the uniform LBP feature vector will have a length of 59; this is significantly reduced from the length of 256 in an ordinary LBP.

In this manner, by comparing local neighborhoods and calculating a uniform LBP histogram, an image signature is created to represent the type of texture. The generated signatures are sufficiently distinctive for images falling into different classes. Therefore, the LBP can be used to classify the textures.

KNN algorithm

The KNN algorithm is one of the most common non-parametric classification methods [10,11,12, 38]. In non-parametric methods, there is no need to calculate the parameters in the learning phase. Using the data itself, an algorithm is designed to check whether the new data belong to the training class. The advantage of these methods is that they do not require parameter estimations, and are usually more accurate than parametric methods; however, their main limitation is that they require all of the training samples to classify new samples. This increases the memory and computational costs, especially for large datasets. The KNN classification steps used to categorize images are as follows.

  • The training datasets and related labels are uploaded, and then a value of K is chosen as the number of neighbors.

  • The distance between the test image and each training sample is calculated.

  • The training samples are sorted in ascending order, based on the distances calculated in the previous step.

  • The first K items are selected from the sorted set.

  • The label is checked for the selected items in the previous step.

  • The most frequently labeled class is selected as the predicted class for the test sample.

Figure 4 shows an example of two-class classification using the KNN algorithm. In this case, if k = 3, the test sample is considered as being in class B, because out of the three close neighbors, two neighbors are labeled B, and one neighbor is labeled A.

Fig. 4
figure 4

Two-class classification using the KNN algorithm

If k = 6, the situation will be different because in this case, four neighbors are labeled A and two neighbors are labeled B; therefore, the test sample is placed in class A. The accuracy of the classification algorithm can be verified by comparing the predicted labels with the actual labels of the test samples.

SVM-based classifier

The SVM is one of the currently widely used methods for classification [5, 6]. The current popularity of the SVM method can be compared with the popularity of neural networks over the past decade. The SVM is based on a linear classification of data. Figure 5 shows an example of a dataset that can be categorized linearly. Several lines are drawn to categorize the data. In a linear division of data, an attempt is made to select a line with a more reliable margin. Quadratic programming is used to find the optimal linear separator; this is a known method for solving limited problems.

Fig. 5
figure 5

The basic idea of the SVM is that, assuming that the categories are linearly separable, a line is obtained with a maximum margin to separate the categories. To find such a separator, two boundary lines are drawn parallel to the separator line, and are separated such that they collide with the data. The separation that maximizes the training data margin among the linear separators minimizes the generalization error. The training data closest to the separating line are called the support vectors (Fig. 5). Notably, for dimensions greater than two, the term ‘hyperplane’ is used instead of ‘line.’ A hyperplane is a geometric concept representing a generalization of the concept of a plane in n-dimensions. In other words, the hyperplane defines a subsequent k subspace in an n-dimensional space such that k < n.

One of the best properties of SVM is that, in cases where the data are not linearly separable, the SVM maps the data to a larger dimension using a nonlinear mapping function Φ. In this way, the data can be linearly separated in this new space.

This implies that samples that are not linearly separable in their original space I move to a new feature space called F to create a hyperplane for separating them. When this hyperplane returns to its original space I, it forms a nonlinear curve. As shown in Fig. 6, the input data are not linearly separable, and no line can accurately represent the boundary between the two classes. However, by mapping them from a two-dimensional space to a three-dimensional space, it is possible to create a hyperplane for separating the boundaries of these two classes.

Fig. 6
figure 6

Mapping from two-dimensional space to three-dimensional space

NB categories

NB is a probability-based machine-learning algorithm that can be used for a wide range of classification problems [7,8,9]. Common applications of the NB algorithm include spam filtering, document classification, and emotion prediction.

This NB algorithm uses the Bayesian theorem to produce results, based on a hypothesis of strong independence between the features. This implies that changing the value of one feature does not directly affect the value of any other feature. Although this assumption is simplistic (as the algorithm’s name implies) for real-world datasets, the NB classifier has nevertheless found a worthy place among classification algorithms.

Suppose X = (x1, x2, .. xn) expresses a data sample as a vector of n independent variables. To calculate the probability P[Ck| (x1, x2, .. xn)], it is sufficient to use the joint probability, and to simplify it using a conditional probability concerning the independence of the variables.

Deep learning methods

YOLO is a state-of-the-art real-time object detection system. It uses a single neural network to obtain a full image. This network divides the image into regions, and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted using the predicted probabilities.

To use the YOLO algorithm, it is first necessary to prepare training images.

Preparing a training set for the YOLO algorithm is different from the approach for traditional methods, such as SVM or KNN. It is necessary to draw a bounding box around each object, and the corresponding class is determined by the user. Various programs can be used to draw bounding boxes and tag them. In this study, the MAKESENSE ( program is used to tag the training images.

In this manner, a text file is created for each of the training images, in which the specifications of an object are written in each row. The first number is related to the class of the object; then the coordinates of the center of the rectangle and its length and width are written. The number of rows in the file is equal to the number of objects in the image. Figure 7 shows an example of the training data and generated file from the output of the MAKESENSE program. This program is also used to tag images for another deep learning method (faster R-CNN).

Fig. 7
figure 7

Preparing training data set for deep learning methods. (a): Labeling training images; (b): Output text file

Faster R-CNN is a deep convolutional network used for object detection, and works as a single, end-to-end, unified network for the user. The network can predict the locations of multiple objects in a short time. Researchers at UC Berkeley developed R-CNN [25] in 2014. The R-CNN is a deep convolutional network capable of detecting 80 different types of objects in images. In comparison with the generic pipeline for the object detection methods, the foremost contribution of R-CNN is the extraction of features based on a CNN.

The R-CNN consists of three principal modules. The first module produces 2000 region proposals using a selective search algorithm. After resizing to a fixed predefined size, the next module extracts a feature vector of length 4096 from each region proposal. The third module utilizes a pre-trained SVM algorithm to classify the region proposal into one of the object classes, or as background. The R-CNN model has some weaknesses: it is a multistage model, where each stage is an independent part. Consequently, it cannot be trained end-to-end. It captures the extracted features from a pre-trained CNN on the disk to train the SVMs. This requires a bulk storage on the order of gigabytes. The R-CNN depends on a selective search algorithm for creating region proposals, which takes a long time. In addition, this algorithm cannot be customized for detection problems. Each region proposal is fed without dependence on the CNN for the feature extraction, making it inappropriate to run the R-CNN in real time. As an extension of the R-CNN model, the fast R-CNN model was proposed [24] to overcome some of these limitations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit


This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (