I already introduced the business problem for classification in the first file. I will state the steps for my classification analysis which is similar to the regression, in some way. My approach: 1. Since EDA is already done, I did not do it here 2.Start by finding best model with metric as accuracy 3. Special twist, by changing the metric from accuracy to higher recall. This might sound confusing now, but will all be clear later on.
Most importantly, ENJOY!
#This is where the magic starts: special categorization (>=6 is good)
new_df = df %>%
mutate(good = ifelse(quality >= 6, "1", "0") %>% as.factor()) %>%
select(-quality); dim(df)
table(new_df$good)
#split into test/train
p=ncol(new_df)
idx0 <- createDataPartition(new_df$good, p = .9,
list = FALSE,
times = 1)
off_train.x = new_df[idx0, -p] %>% as.matrix()
dim(off_train.x)
off_train.x = off_train.x %>% scale(center=TRUE)
off_train.y = new_df[idx0, p]
official_train = data.frame(off_train.x, good=off_train.y)
dim(official_train)
off_test.x = new_df[-idx0, -p] %>% as.matrix()
off_test.x = off_test.x %>% scale(., center=attr(off_train.x, "scaled:center"),
scale=attr(off_train.x, "scaled:scale"))
off_test.y = new_df[-idx0, p]
official_test = data.frame(off_test.x, good=off_test.y)
#split train into train/validation
idx = createDataPartition(official_train$good, p = 0.9, list=F)
train.x = official_train[idx, -p] %>% as.matrix(); dim(train.x)
train.y = official_train[idx, p]; table(train.y)
train = data.frame(train.x, good=train.y)
test.x = official_train[-idx, -p] %>% as.matrix(); dim(test.x)
test.y = official_train[-idx, p]; table(test.y)
test = data.frame(test.x, good=test.y)
Start by the reduced model as asked by you
#logistic regression only on few vars for interpretation purposes
set.seed(2)
fitprova = glm(good~alcohol*density,data = train, family = binomial)
summary(fitprova)
##
## Call:
## glm(formula = good ~ alcohol * density, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0339 -1.1167 0.5045 0.9038 1.5124
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.89500 0.05737 15.601 < 2e-16 ***
## alcohol 1.19244 0.08559 13.931 < 2e-16 ***
## density 0.12214 0.08285 1.474 0.14
## alcohol:density -0.27584 0.06106 -4.517 6.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3972.8 on 3252 degrees of freedom
## Residual deviance: 3434.3 on 3249 degrees of freedom
## AIC: 3442.3
##
## Number of Fisher Scoring iterations: 5
Model is the following: log-odds(good) = beta0 + beta1alcohol + beta2density + beta3alcoholdensity e^(b0) = 2.39 ==> odds of wine being good when alcohol and density are set to their means e^(b0+b1) = 7.38 ==> odds of wine being good when alcohol increases by 1 sd and density is equal to its mean e^(b2) = 1.08 ==> odds ratio of wine being good when density increases by 1 sd and alcohol is equal to its mean e^(b2+b3) = 0.81 ==> odds ratio of wine being good when both alcohol and density increase by 1sd. We conclude that when alcohol or density increase individually, it leads to more chance of wine being good, but when they increase together we observe the contrary effect (wine become poorer). We also see that alcohol has a bigger effect on whether the wine is good or not, compared to its counterpart density. Worthy to note that density is not quite significant so take this analysis with a pinch of salt
let’s fit some modelslogistic reg on this classification pb. For now let’s use a treshold of 0.5, but I later came up with an ingenious way to find the optimal treshold for our business problem.
#let's fit some modelslogistic reg on this classification #pb. For now let's use a treshold of 0.5, but I later came #up with an ingenious way to find the optimal treshold for #our business problem.
threshold = 0.5
glm_mod = glm(good ~. , data = train, family = binomial)
glm_prob = predict(glm_mod, data.frame(test.x), type="response")
glm_class = as.factor(ifelse(glm_prob >= threshold, 1, 0))
cm = confusionMatrix(glm_class, test.y, positive = "1")
glm_eval = cm; glm_eval
#Now Lasso with 10-fold CV
cvlasso = cv.glmnet(train.x, train.y,
family = "binomial",
type.measure = "auc")
coef(cvlasso, s=cvlasso$lambda.1se)
cvlasso_prob = predict(cvlasso,
test.x,
type="response",
s=cvlasso$lambda.1se)
lasso_class = as.factor(ifelse(cvlasso_prob >= threshold, 1, 0))
cm_lasso = confusionMatrix(lasso_class, test.y, positive = "1")
cvlasso_eval = cm; cvlasso_eval
#Decision tree but this time classification problem with boosting
dt = C5.0(train.x, train.y, trials = 10)
dt_pred = predict(dt, test.x)
confMat = confusionMatrix(dt_pred, test.y, positive="1")
dtboost_eval = list(confusionMatrix = confMat); dtboost_eval
#Now, let's fit a RF in 2 ways: a VANILLA one and one with CV
rfcat = randomForest(good~., data=train, ntree=1000, mtry=sqrt(p))
rfcat_prob = predict(rfcat, test.x)
confMat = confusionMatrix(rfcat_prob, test.y, positive="1")
rfcat_eval = list(confusionMatrix = confMat)
rfcat_eval
#Now, for the RF with CV
ct = trainControl(method = "repeatedcv", number = 5, repeats = 2)
grid_rf = expand.grid(.mtry = c(p/3, 3.5, sqrt(p)))
cvrfcat = train(good~., data = train,
method = 'rf',
metric = "Accuracy",
trControl = ct,
tuneGrid = grid_rf)
cvrfcat_pred = predict(cvrfcat, test.x)
confMat = confusionMatrix(cvrfcat_pred, test.y, positive = "1")
cvrfcat_eval = list(confusionMatrix = confMat); cvrfcat_eval
importance(rfcat)
#Tried to do reduced model, but as expected got a better result with full model
#(even though the most important variables were selected)
cvrfcat.reduced = caret::train(good~alcohol+density+residual.sugar+chlorides, data = train,
method = 'rf',
metric = "Accuracy",
trControl = ct,
tuneGrid = grid_rf)
cvrfcat_pred.reduced = predict(cvrfcat.reduced, test.x)
confMat = confusionMatrix(cvrfcat_pred.reduced, test.y, positive = "1")
cvrfcat_eval.reduced = list(confusionMatrix = confMat); cvrfcat_eval.reduced
Following the trouble I got through in the regression problem to only observe an ever-so slight improvement in model performance, I will input the most common hyperparameters that I have observed without doing a 150-line script to find the best possible hyperparameters.
glm_ = c(glm_eval$confusionMatrix$overall,
glm_eval$confusionMatrix$byClass)
lasso_cv = c(cvlasso_eval$confusionMatrix$overall,
cvlasso_eval$confusionMatrix$byClass)
classif_tree_boost = c(dtboost_eval$confusionMatrix$overall,
dtboost_eval$confusionMatrix$byClass)
rf = c(rfcat_eval$confusionMatrix$overall,
rfcat_eval$confusionMatrix$byClass)
cv_rf = c(cvrfcat_eval$confusionMatrix$overall,
cvrfcat_eval$confusionMatrix$byClass)
rf_cv_reduced = c(cvrfcat_eval.reduced$confusionMatrix$overall,
cvrfcat_eval.reduced$confusionMatrix$byClass)
xgb_cv = c(xg_eval$confusionMatrix$overall,
xg_eval$confusionMatrix$byClass)
all = cbind(glm_, lasso_cv,
classif_tree_boost,
rf, cv_rf, rf_cv_reduced, xgb_cv) %>% data.frame()
knitr::kable(all %>% round(3),
caption = "comparing all models")
| classif_tree_boost | rf | cv_rf | rf_cv_reduced | xgb_cv | |
|---|---|---|---|---|---|
| Accuracy | 0.809 | 0.845 | 0.848 | 0.825 | 0.842 |
| Kappa | 0.523 | 0.609 | 0.613 | 0.562 | 0.603 |
| AccuracyLower | 0.764 | 0.803 | 0.806 | 0.782 | 0.800 |
| AccuracyUpper | 0.848 | 0.881 | 0.883 | 0.863 | 0.878 |
| AccuracyNull | 0.701 | 0.701 | 0.701 | 0.701 | 0.701 |
| AccuracyPValue | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| McnemarPValue | 0.054 | 0.011 | 0.003 | 0.023 | 0.017 |
| Sensitivity | 0.897 | 0.929 | 0.937 | 0.913 | 0.925 |
| Specificity | 0.602 | 0.648 | 0.639 | 0.620 | 0.648 |
| Pos Pred Value | 0.841 | 0.861 | 0.859 | 0.849 | 0.860 |
| Neg Pred Value | 0.714 | 0.795 | 0.812 | 0.753 | 0.787 |
| Precision | 0.841 | 0.861 | 0.859 | 0.849 | 0.860 |
| Recall | 0.897 | 0.929 | 0.937 | 0.913 | 0.925 |
| F1 | 0.868 | 0.894 | 0.896 | 0.880 | 0.891 |
| Prevalence | 0.701 | 0.701 | 0.701 | 0.701 | 0.701 |
| Detection Rate | 0.629 | 0.651 | 0.657 | 0.640 | 0.648 |
| Detection Prevalence | 0.748 | 0.756 | 0.765 | 0.753 | 0.753 |
| Balanced Accuracy | 0.750 | 0.789 | 0.788 | 0.767 | 0.787 |
Now we see the result of best model on test set (the official one, not the validation that we were using before)
cvrfcat_off = train(good~., data = official_train,
method = 'rf',
metric = "Accuracy",
trControl = ct,
tuneGrid = grid_rf)
cvrfcat_pred_off = predict(cvrfcat_off, off_test.x)
confMat = confusionMatrix(cvrfcat_pred_off, off_test.y, positive = "1")
cvrfcat_eval_off = list(confusionMatrix = confMat); cvrfcat_eval_off
## $confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 72 19
## 1 48 262
##
## Accuracy : 0.8329
## 95% CI : (0.7927, 0.8681)
## No Information Rate : 0.7007
## P-Value [Acc > NIR] : 7.771e-10
##
## Kappa : 0.572
##
## Mcnemar's Test P-Value : 0.0006245
##
## Sensitivity : 0.9324
## Specificity : 0.6000
## Pos Pred Value : 0.8452
## Neg Pred Value : 0.7912
## Prevalence : 0.7007
## Detection Rate : 0.6534
## Detection Prevalence : 0.7731
## Balanced Accuracy : 0.7662
##
## 'Positive' Class : 1
##
Maximizing accuracy is a very good metric, but let’s think a bit outside the box: How to pick best treshold? Above, I picked threshold based on getting better accuracy.But let us think about it. Is accuracy what really matters? Let me ask you this: how many times did you really think that a wine tasted horrible? Very rarely right? Let’s be real, most of us are not connoisseurs and will not notice a huge difference between a wine that is “not bad”, a wine that is “good” and a wine that is “very good”. This is why I think that at the beginning when marketing the product in front of the jury, we will use accuracy as a metric to show them that we are right more times than not; and when our product goes into the market, we will use our secret tactic, allowing for more FP compared to FN. So our threshold will be less than 0.5 and we will use a cost matrix in a fictional scenario that I will create myself.
In other words, TP = -40 TN = -40 FN = 25 FP = 10
So we get: Cost = 25FN - 40TP + 10FP - 40TN Let me further explain: when we predict that wine is bad but it is actually good, this doesn’t mean that it is good according to the user, but just that it is good following the parameters we inputted into our model. Indeed, a user won’t make a difference between a wine that is rated 5.8 or 5.9 compared to 6.1 or 6.2. Thus, we can profit from this margin by penalizing more FN. The concept that I have in mind is hard to explain in words, but I hope to convince you about it in my presentation!
par(mfrow=c(1,2))
grid.arrange(plot,plot2, ncol=2)
We look at value on x-axis corresponding to minimum value on y-axis: This will be our threshold
I will not include any blatant conclusions to my business problems because they might seem absurd reading them, but when I explain myself it will be much clearer. I provided all the methodologies and techniques used but will draw the big conclusion on the 3rd. Hope my project was enjoyable and pleasant! Looking forward to presenting it