The K-Nearest Neighbor (KNN) Classification Example in R

   The K-Nearest Neighbor (KNN) is a supervised machine learning algorithm and it is used to solve the classification and regression problems. The basic concept of this model is that a given data is calculated to predict the nearest target class through the previously measured distance (Minkowski, Euclidean, Manhattan, etc. distance calculation methods).

   Based on distance value, KNN finds the K and the closest items in a dataset and decides the classification. K is an integer value and it refers to the number of training samples that are closest to the element that needs to be classified. To predict data, the model searches the entire dataset to find out k the most frequently observed label instances. KNN does not learn from the dataset, it decides the results by calculating the input data thus, it is called lazy learning.

   In this tutorial, we'll learn how to classify the Iris dataset with the KNN model in R. We use the 'class' package's 'knn' function. The tutorial covers:
  1. Preparing the data
  2. Defining the model
  3. Source code listing
We'll start by loading the required packages in R.

library(class)
library(caret)


Preparing the data

We use the Iris dataset in this tutorial. First, we'll load the dataset, you can check the content of dataset by using str() command.


data("iris")
str(iris)

'data.frame': 150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ..

Next, we'll split the dataset into the train and test parts.

indexes=createDataPartition(iris$Species, p=.85, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]

We'll extract x feature and y label parts from the test and train data. Here column 5 is Species that is a label data.

xtrain = train[,-5]
ytrain = train[,5]
xtest = test[,-5]
ytest = test[, 5]


Defining the model

   The knn function requires a classification factor (cl) parameter. It is a label part of train data. We'll set a 3 to the number of neighbors parameter. The model provides calculating the input data.

yhat = knn(xtrain, xtest, ytrain, k=3)

Next, we'll check the prediction accuracy with the confusion matrix function.

cm = confusionMatrix(ytest, yhat)
print(cm)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          7          0         0
  versicolor      0          7         0
  virginica       0          0         7

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8389, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : 9.56e-11   
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3333
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            1.0000           1.0000


   In this tutorial, we've briefly learned hot to classify data with the knn model in R. The full source code is listed below.


Source code listing

library(class)
library(caret)
 
data("iris")
str(iris)
 
indexes=createDataPartition(iris$Species, p=.85, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]
 
xtrain = train[,-5]
ytrain = train[,5]
xtest = test[,-5]
ytest = test[, 5]
 
yhat = knn(xtrain, xtest, ytrain, k=3)
 
cm = confusionMatrix(ytest, yhat)
print(cm)


No comments:

Post a Comment