Empirical rule or 68-95-99.7 rule explains the percent of data within 1, 2, and 3 standard deviation range for normal distribution.
In this post, we'll briefly learn those two definitions with R.
Preparing the data
First, we'll generate a sample population data for this tutorial. We can create it with the sample() command in R.
set.seed(1234)
x <- sample(-50:50, 100, replace = T)
x[sample(1:100, 80)] = sample(-20:20,80, replace = T)
We'll visualize the x data in a plot.
plot(x, type = "l", col = "blue")
Before going to the standard deviation, we need to understand the mean value of giving vector data.
Mean (μ) is a central value of elements in a numerical set. We can get the mean value of an x vector with the mean() command in R.
mean(x)
[1] -0.93
Standard deviation is a measurement value of variations (differences) of the elements from the mean value of a set. It can be represented by σ letter, std, or SD. To get σ value, we'll use the sd() command in R.
sd(x)
[1] 16.07951
Variance is the value of squared deviation from the mean value of a set. Variance can be taken with the below commands.
var(x)
[1] 258.5506
sd(x)^2
[1] 258.5506
Empirical or 68-95-99.7 rule
The percentage of values located in a range of 1σ, 2σ, and 3σ will be 68%, 95%, and 99.7% respectively. The 68-95-99.7 rule is based on those values and its name comes from those percentage values. It explains the distribution of sample data in the range of 1, 2 and 3 sigmas and their statistical percentage in those areas. Here, 1σ represents the range between -σ to σ, 2σ is from -2σ to 2σ, and 3σ is from -3σ to 3σ.
We can check the x data and its sigma range by plotting the normal distribution plot.
s <- sd(x)
m <- mean(x)
index <- seq(min(x), max(x), length = 100)
dn <- dnorm(index, mean = m, sd = s)
plot(index, dn, type = "l", lwd = 2) + abline(m, m) + grid()
text(m - 2, .02, "μ", pos = 3)
abline(s, 1, col = "green")
abline(-s, 1, col = "green")
text(s + 2, .02, "σ", pos = 3, col = "darkgreen")
text(-s + 2, .02, "-σ", pos = 3, co = "darkgreen")
abline(-2 * s, 1, col = "blue")
abline(2 * s, 1, col = "blue")
text(2 * s + 3, .02, "2σ", pos = 3, col = "blue")
text(-2 * s + 3, .02, "-3σ", pos = 3, col = "blue")
abline(3 * s, 1, col = "red")
abline(-3 * s, 1, col = "red")
text(3 * s + 3, .02, "3σ", pos = 3, col = "red")
text(-3 * s + 3, .02, "-3σ", pos = 3, col = "red")
Calculating the percentages
Finally, we'll calculate the percentages of values in 2σ [-σ:σ], 4σ [-2σ:2σ], and 6σ [-3σ:3σ] ranges.
x <= s & x >= (-s) -> sigma1
length(sigma1[sigma1 == TRUE])
[1] 67
x <= (s * 2) & x >= (-s * 2) -> sigma2
length(sigma2[sigma2 == TRUE])
[1] 95
x <= (s * 3) & x >= (-s * 3) -> sigma3
length(sigma3[sigma3 == TRUE])
[1] 99
The results show the closest outputs to the expected values. If we increase the number of samples we'll come closer to the values of 68-95-99.7.
In this post, we've briefly learned standard deviation and 68-95-99.7 rule in R. The full source code is listed below.
Source code listing
set.seed(1234)
x <- sample(-50:50, 100, replace = T)
x[sample(1:100, 80)] = sample(-20:20,80, replace = T)
plot(x, type = "l", col = "blue")
mean(x)
sd(x)
var(x)
sd(x)^2
s <- sd(x)
m <- mean(x)
index <- seq(min(x), max(x), length = 100)
dn <- dnorm(index, mean = m, sd = s)
plot(index, dn, type = "l", lwd = 2) + abline(m, m) + grid()
text(m - 2, .02, "μ", pos = 3)
abline(s, 1, col = "green")
abline(-s, 1, col = "green")
text(s + 2, .02, "σ", pos = 3, col = "darkgreen")
text(-s + 2, .02, "-σ", pos = 3, co = "darkgreen")
abline(-2 * s, 1, col = "blue")
abline(2 * s, 1, col = "blue")
text(2 * s + 3, .02, "2σ", pos = 3, col = "blue")
text(-2 * s + 3, .02, "-3σ", pos = 3, col = "blue")
abline(3 * s, 1, col = "red")
abline(-3 * s, 1, col = "red")
text(3 * s + 3, .02, "3σ", pos = 3, col = "red")
text(-3 * s + 3, .02, "-3σ", pos = 3, col = "red")
x <= s & x >= (-s) -> sigma1
length(sigma1[sigma1 == TRUE])
x <= (s * 2) & x >= (-s * 2) -> sigma2
length(sigma2[sigma2 == TRUE])
x <= (s * 3) & x >= (-s * 3) -> sigma3
length(sigma3[sigma3 == TRUE])
Data science is one of the top course in today's career. Your content will going to helpful for all the beginners who are trying to find Empirical Rule in Statistics . Thanks for sharing useful information. keep updating.
ReplyDelete