In this post, we'll learn how to use the lof() function to extract outliers in a given dataset with a decision threshold value. For this tutorial, we'll need a 'Rlof' library in R. We'll start by installing the package.
install.packages("Rlof")
Then we can load the package.
library(Rlof)
Preparing the data
First, we'll generate a sample dataset for this tutorial and visualize it in a plot.
set.seed(124) test = runif(100)*10 test[sample(1:100, 6)] = sample(-10:30, 6) plot(test, col="blue", type='p', pch=19)
Defining the Lof()
Next, we calculate LOF for each element in the test data. Here, we set 5 into argument k, the distance to calculate LOFs. We can print the header part of it.
mlof = lof(test, k=5)
head(mlof) [1] 1.0379733 1.0355735 1.0372052 0.9481038 0.9537252 1.1713114
Next, we check the probability distribution range of the mlof data.
quantile(mlof) 0% 25% 50% 75% 100% 0.9007470 0.9794372 1.0363230 1.1577253 26.9784639
Here, I set 97 percent value as a threshold to decide the value as an outlier. You may change it according to your data density.
quantile(mlof, .97) 97% 1.976995
thr = quantile(mlof, .97)
Next, we'll extract the elements that are equal to or higher than the threshold value from test data.
out_index = which(mlof >= thr)
print(out_index) [1] 14 19 34
print(test[out_index]) [1] 18 -4 -3
Finally, we'll plot the results to check the outliers in a chart.
plot(test, col="blue", type='p', pch=19) points(x=out_index, y=test[out_index], pch=19, col="red")
The plot shows the outlier points in test data.
In this post, we've briefly learned how to use the lof() function to find out the outliers in a dataset.
Source code listing
install.packages("Rlof")
library(Rlof)
set.seed(124)
test = runif(100)*10
test[sample(1:100, 6)] = sample(-10:30, 6)
plot(test, col="blue", type='p', pch=19)
mlof = lof(test, k=5)
print(mlof)
quantile(mlof)
quantile(mlof, .97)
thr = quantile(mlof, .97)
out_index = which(mlof >= thr)
print(out_index)
print(test[out_index])
plot(test, col="blue", type='p', pch=19)
points(x=out_index, y=test[out_index], pch=19, col="red")
Outlier check with SVM novelty detection in R
Outlier check with kmeans distance calculation with R
Outlier detection with boxplot.stats function in R
No comments:
Post a Comment