In this post, we'll briefly learn how to use the logistic regression model and classify data in R. We use glm() function to define a logistic regression model in R.
Creating data
First, we create a dataset and split it into test and train parts. It is better to use a binary class data to understand the logistic regression well. The dataset contains exam data with the binary output value of a 'result' (1 - pass, 0 - fail).
exam = data.frame(test = sample(40:100,200,replace = T), paper = sample(30:100,200,replace = T)) exam = cbind(exam, result=ifelse(exam$test>65 & exam$paper>40, 1, 0))
index = sample(1:nrow(exam), size = .80 * nrow(exam))
train = exam[index, ]
test = exam[-index, ]
head(train) test paper result 198 80 37 0 28 76 69 1 180 75 98 1 114 97 33 0 78 77 63 1 88 94 46 1
Building the model
Next, we build a logistic regression model with 'glm' binomial method.
exam_glm = glm(result~test + paper, data = train, family = "binomial") summary(exam_glm) Call: glm(formula = result ~ test + paper, family = "binomial", data = train) Deviance Residuals: Min 1Q Median 3Q Max -2.40363 -0.43696 -0.07681 0.44613 2.08213 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -15.52439 2.35453 -6.593 4.30e-11 *** test 0.16269 0.02443 6.658 2.77e-11 *** paper 0.06139 0.01417 4.332 1.48e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 221.41 on 159 degrees of freedom Residual deviance: 102.86 on 157 degrees of freedom AIC: 108.86 Number of Fisher Scoring iterations: 6
Now, we can draw a plot for our model with a 'ggplot' function.
library(ggplot2)
ggplot(exam_glm, aes(x = test + paper, y = result)) + geom_point() + stat_smooth(method = "glm", method.args=list(family="binomial"), se=F)
Predicting the class with logistic regression
Logistic regression output class becomes categorical data. The model predicts the probability of a class in a range of [0 ~ 1]. We separate predicted data into two classes according to their probability values; if the value is higher than 0.5, the class is A, otherwise, the class is B.
pred = predict(exam_glm, test, type="response") test=cbind(test, pred_result=ifelse(pred >.5, 1, 0)) table(test$result, test$pred_result) # confusion matrix 0 1 0 15 5 1 4 16
head(test) test paper result pred_result 4 93 66 1 1 7 72 55 1 0 12 67 47 1 0 19 60 88 0 0 21 94 50 1 1 24 100 78 1 1
We check the accuracy.
acc = mean(test$result==test$pred_result) cat("Accuracy: ", acc) Accuracy: 0.775
In this post, we've briefly learned how to use a logistic regression model and predict data in R.
The full source code is listed below.
library(ggplot2) set.seed(123) exam = data.frame(test=sample(40:100,200,replace = T), paper=sample(30:100,200,replace = T)) exam = cbind(exam, result=ifelse(exam$test>65 & exam$paper>40, 1, 0)) head(exam) index = sample(1:nrow(exam), size = .80 * nrow(exam)) train = exam[index, ] test = exam[-index, ] head(train) exam_glm = glm(result~test + paper, data = train, family = "binomial") summary(exam_glm) ggplot(exam_glm, aes(x = test + paper, y = result)) + geom_point() + stat_smooth(method = "glm", method.args=list(family="binomial"), se=F) pred = predict(exam_glm, test, type="response") test=cbind(test, pred_result=ifelse(pred>.5, 1, 0)) table(test$result, test$pred_result) head(test)
No comments:
Post a Comment