> devtools::install_github("rstudio/keras")
> library(keras)
> install_keras()
First, we'll generate sample dataset for this tutorial and split it into the train and test parts.
n=2000 # number of sample data
a <- sample(1:20, n, replace = T)
b <- sample(1:50, n, replace = T)
c <- sample(1:100, n, replace = T)
flag <- ifelse(a > 15 & b > 30 & c > 60, "red",
ifelse(a<=9 & b<25& c<=35, "yellow", "green"))
df <- data.frame(a = a,
b = b,
c = c,
flag = as.factor(flag))
> tail(df,15)
a b c flag
1986 3 50 91 green
1987 9 12 56 green
1988 10 21 14 green
1989 13 6 22 green
1990 6 14 9 yellow
1991 10 27 86 green
1992 4 16 6 yellow
1993 18 31 33 green
1994 4 50 51 green
1995 2 31 34 green
1996 18 8 88 green
1997 7 36 89 green
1998 16 34 91 red
1999 9 17 80 green
2000 9 22 91 green
indexes = sample(1:nrow(df), size = .95 * nrow(df))
train <- df[indexes, ]
test <- df[-indexes, ]
Next, we'll convert X input data into the matrix type and Y output labels into the numerical category type.
train.x <- as.matrix(train[, 1:3], c(1,3,nrow(train)))
train.y <- to_categorical(matrix(as.numeric(train[,4])-1))
test.x <- as.matrix(test[, 1:3], c(1,3,nrow(test)))
test.y <- to_categorical(matrix(as.numeric(test[,4])-1))
Building a model
Here, input_shape is 3 (a, b, c count), units number is 3 (red, green yellow labels count), activation is 'softmax' (for multi-class categorical type).
model <- keras_model_sequential()
model %>% layer_dense(units=64, activation = "relu", input_shape = c(3))
%>% layer_dense(units =3, activation = "softmax")
model %>% compile(optimizer = "rmsprop",
loss = "categorical_crossentropy",
> print(model)
Layer (type) Output Shape Param #
dense_356 (Dense) (None, 64) 256
dense_357 (Dense) (None, 3) 195
Total params: 451
Trainable params: 451
Non-trainable params: 0
We'll fit the model with train data and then predict a test data with a model.
model %>% fit(train.x, train.y,
epochs = 50,
batch_size = 50)
pred <- model %>% predict(test.x)
To make the results readable, I'll change the format of the output.
pred <- format(round(pred, 2), nsamll = 4)
result <- data.frame("green"=pred[,1], "red"=pred[,2], "yellow"=pred[,3],
"predicted" = ifelse(max.col(pred[ ,1:3])==1, "green",
ifelse(max.col(pred[ ,1:3])==2, "red", "yellow")),
original = test[ ,4])
> head(result,20)
green red yellow predicted original
1 1.00 0.00 0.00 green green
2 1.00 0.00 0.00 green green
3 1.00 0.00 0.00 green green
4 1.00 0.00 0.00 green green
5 0.45 0.55 0.00 red red
6 1.00 0.00 0.00 green green
7 0.93 0.00 0.07 green green
8 0.52 0.36 0.12 green green
9 0.96 0.04 0.00 green green
10 1.00 0.00 0.00 green green
11 0.28 0.04 0.68 yellow yellow
12 1.00 0.00 0.00 green green
13 1.00 0.00 0.00 green green
14 1.00 0.00 0.00 green green
15 1.00 0.00 0.00 green green
16 1.00 0.00 0.00 green green
17 0.73 0.27 0.00 green green
18 1.00 0.00 0.00 green green
19 0.52 0.38 0.10 green green
20 0.34 0.00 0.66 yellow yellow
Evaluating the model accuracy and loss.
scores <- model %>% evaluate(test.x, test.y)
> print(scores)
[1] 0.08444449
[1] 0.99
Confusion matrix check with a caret
> cfm=caret::confusionMatrix(result$predicted, result$original)
> print(cfm)
Confusion Matrix and Statistics
Prediction green red yellow
green 89 0 0
red 1 2 0
yellow 0 0 8
Overall Statistics
Accuracy : 0.99
95% CI : (0.9455, 0.9997)
No Information Rate : 0.9
P-Value [Acc > NIR] : 0.0003217
Kappa : 0.9479
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: green Class: red Class: yellow
Sensitivity 0.9889 1.0000 1.00
Specificity 1.0000 0.9898 1.00
Pos Pred Value 1.0000 0.6667 1.00
Neg Pred Value 0.9091 1.0000 1.00
Prevalence 0.9000 0.0200 0.08
Detection Rate 0.8900 0.0200 0.08
Detection Prevalence 0.8900 0.0300 0.08
Balanced Accuracy 0.9944 0.9949 1.00
The full source code is listed below.
If you have any comments about the post, please leave it below, thank you!
Thank you for reading!
ReplyDeleteI do not understand this part of your code.
"# collecting everything in data frame to read it easily
result <- data.frame("green" = pred[,1],
"red" = pred[,2],
"yellow" = pred[,3],
"predicted" = ifelse(max.col(pred[ ,1:3]) == 1, "green",
ifelse(max.col(pred[ ,1:3]) == "2", "red", "yellow")),
original = test[ ,4])"
My database includes 60 columns which the last column is the label. Also, I do have 11 class variables in my label column. Could you please help me with this issue?.
It is just to print the original and predicted values with probability and decided label. In "predicted" column, we are changing the probability values to label. It selects the column with the highest value as a final output.
DeleteIn your case, you will have 11 predicted columns in your prediction. Your job is to filter out the highest predicted value as a final result. You can use the same method above or apply some other methods to change probability value to label. Hope this will help you!