Based on distance value, KNN finds the K and the closest items in a dataset and decides the classification. K is an integer value and it refers to the number of training samples that are closest to the element that needs to be classified. To predict data, the model searches the entire dataset to find out k the most frequently observed label instances. KNN does not learn from the dataset, it decides the results by calculating the input data thus, it is called lazy learning.
In this tutorial, we'll learn how to classify the Iris dataset with the KNN model in R. We use the 'class' package's 'knn' function. The tutorial covers:
- Preparing the data
- Defining the model
- Source code listing
library(class)
library(caret)
Preparing the data
We use the Iris dataset in this tutorial. First, we'll load the dataset, you can check the content of dataset by using str() command.
data("iris")
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ..
|
Next, we'll split the dataset into the train and test parts.
indexes=createDataPartition(iris$Species, p=.85, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]
We'll extract x feature and y label parts from the test and train data. Here column 5 is Species that is a label data.
xtrain = train[,-5]
ytrain = train[,5]
xtest = test[,-5]
ytest = test[, 5]
Defining the model
The knn function requires a classification factor (cl) parameter. It is a label part of train data. We'll set a 3 to the number of neighbors parameter. The model provides calculating the input data.
yhat = knn(xtrain, xtest, ytrain, k=3)
Next, we'll check the prediction accuracy with the confusion matrix function.
cm = confusionMatrix(ytest, yhat)
print(cm)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 7 0 0
versicolor 0 7 0
virginica 0 0 7
Overall Statistics
Accuracy : 1
95% CI : (0.8389, 1)
No Information Rate : 0.3333
P-Value [Acc > NIR] : 9.56e-11
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 1.0000 1.0000
Specificity 1.0000 1.0000 1.0000
Pos Pred Value 1.0000 1.0000 1.0000
Neg Pred Value 1.0000 1.0000 1.0000
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3333 0.3333
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 1.0000 1.0000
In this tutorial, we've briefly learned hot to classify data with the knn model in R. The full source code is listed below.
Source code listing
library(class)
library(caret)
data("iris")
str(iris)
indexes=createDataPartition(iris$Species, p=.85, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]
xtrain = train[,-5]
ytrain = train[,5]
xtest = test[,-5]
ytest = test[, 5]
yhat = knn(xtrain, xtest, ytrain, k=3)
cm = confusionMatrix(ytest, yhat)
print(cm)
No comments:
Post a Comment