proj_2_class

I already introduced the business problem for classification in the first file. I will state the steps for my classification analysis which is similar to the regression, in some way. My approach: 1. Since EDA is already done, I did not do it here 2.Start by finding best model with metric as accuracy 3. Special twist, by changing the metric from accuracy to higher recall. This might sound confusing now, but will all be clear later on.

Most importantly, ENJOY!

#This is where the magic starts: special categorization (>=6 is good)
new_df = df %>% 
  mutate(good = ifelse(quality >= 6, "1", "0") %>% as.factor()) %>%
  select(-quality); dim(df)
table(new_df$good)

#split into test/train
p=ncol(new_df)
idx0 <- createDataPartition(new_df$good, p = .9, 
                                   list = FALSE, 
                                   times = 1)


off_train.x = new_df[idx0, -p] %>% as.matrix()
dim(off_train.x)
off_train.x = off_train.x %>% scale(center=TRUE)
off_train.y = new_df[idx0, p]
official_train = data.frame(off_train.x, good=off_train.y)
dim(official_train)

off_test.x = new_df[-idx0, -p] %>% as.matrix()
off_test.x = off_test.x %>% scale(., center=attr(off_train.x, "scaled:center"),
                                  scale=attr(off_train.x, "scaled:scale"))
off_test.y = new_df[-idx0, p]
official_test = data.frame(off_test.x, good=off_test.y)


#split train into train/validation
idx = createDataPartition(official_train$good, p = 0.9, list=F)

train.x = official_train[idx, -p] %>% as.matrix(); dim(train.x)
train.y = official_train[idx, p]; table(train.y)
train = data.frame(train.x, good=train.y)

test.x = official_train[-idx, -p] %>% as.matrix(); dim(test.x)
test.y = official_train[-idx, p]; table(test.y)
test = data.frame(test.x, good=test.y)

Start by the reduced model as asked by you

#logistic regression only on few vars for interpretation purposes
set.seed(2)
fitprova = glm(good~alcohol*density,data = train, family = binomial)
summary(fitprova)

## 
## Call:
## glm(formula = good ~ alcohol * density, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0339  -1.1167   0.5045   0.9038   1.5124  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.89500    0.05737  15.601  < 2e-16 ***
## alcohol          1.19244    0.08559  13.931  < 2e-16 ***
## density          0.12214    0.08285   1.474     0.14    
## alcohol:density -0.27584    0.06106  -4.517 6.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3972.8  on 3252  degrees of freedom
## Residual deviance: 3434.3  on 3249  degrees of freedom
## AIC: 3442.3
## 
## Number of Fisher Scoring iterations: 5

Model is the following: log-odds(good) = beta0 + beta1alcohol + beta2density + beta3alcoholdensity e^(b0) = 2.39 ==> odds of wine being good when alcohol and density are set to their means e^(b0+b1) = 7.38 ==> odds of wine being good when alcohol increases by 1 sd and density is equal to its mean e^(b2) = 1.08 ==> odds ratio of wine being good when density increases by 1 sd and alcohol is equal to its mean e^(b2+b3) = 0.81 ==> odds ratio of wine being good when both alcohol and density increase by 1sd. We conclude that when alcohol or density increase individually, it leads to more chance of wine being good, but when they increase together we observe the contrary effect (wine become poorer). We also see that alcohol has a bigger effect on whether the wine is good or not, compared to its counterpart density. Worthy to note that density is not quite significant so take this analysis with a pinch of salt

let’s fit some modelslogistic reg on this classification pb. For now let’s use a treshold of 0.5, but I later came up with an ingenious way to find the optimal treshold for our business problem.

#let's fit some modelslogistic reg on this classification #pb. For now let's use a treshold of 0.5, but I later came #up with an ingenious way to find the optimal treshold for #our business problem.
threshold = 0.5
glm_mod = glm(good ~. , data = train, family = binomial)
glm_prob = predict(glm_mod, data.frame(test.x), type="response")
glm_class = as.factor(ifelse(glm_prob >= threshold, 1, 0))
cm = confusionMatrix(glm_class, test.y, positive = "1")
glm_eval = cm; glm_eval



#Now Lasso with 10-fold CV
cvlasso = cv.glmnet(train.x, train.y, 
                       family = "binomial",
                       type.measure = "auc")
coef(cvlasso, s=cvlasso$lambda.1se) 

cvlasso_prob = predict(cvlasso, 
                          test.x, 
                          type="response", 
                          s=cvlasso$lambda.1se)

lasso_class = as.factor(ifelse(cvlasso_prob >= threshold, 1, 0))
cm_lasso = confusionMatrix(lasso_class, test.y, positive = "1")
cvlasso_eval = cm; cvlasso_eval


#Decision tree but this time classification problem with boosting
dt = C5.0(train.x, train.y, trials = 10)
dt_pred = predict(dt, test.x)
confMat = confusionMatrix(dt_pred, test.y, positive="1")
dtboost_eval = list(confusionMatrix = confMat); dtboost_eval




#Now, let's fit a RF in 2 ways: a VANILLA one and one with CV
rfcat = randomForest(good~., data=train, ntree=1000, mtry=sqrt(p))
rfcat_prob = predict(rfcat, test.x)
confMat = confusionMatrix(rfcat_prob, test.y, positive="1")
rfcat_eval = list(confusionMatrix = confMat)
rfcat_eval


#Now, for the RF with CV
ct = trainControl(method = "repeatedcv", number = 5, repeats = 2)
grid_rf = expand.grid(.mtry = c(p/3, 3.5, sqrt(p)))

cvrfcat = train(good~., data = train,
                     method = 'rf',
                     metric = "Accuracy",
                     trControl = ct,
                     tuneGrid = grid_rf)


cvrfcat_pred = predict(cvrfcat, test.x)
confMat = confusionMatrix(cvrfcat_pred, test.y, positive = "1")
cvrfcat_eval = list(confusionMatrix = confMat); cvrfcat_eval

importance(rfcat)

#Tried to do reduced model, but as expected got a better result with full model
#(even though the most important variables were selected)
cvrfcat.reduced = caret::train(good~alcohol+density+residual.sugar+chlorides, data = train,
                     method = 'rf',
                     metric = "Accuracy",
                     trControl = ct,
                     tuneGrid = grid_rf)


cvrfcat_pred.reduced = predict(cvrfcat.reduced, test.x)
confMat = confusionMatrix(cvrfcat_pred.reduced, test.y, positive = "1")
cvrfcat_eval.reduced = list(confusionMatrix = confMat); cvrfcat_eval.reduced

Following the trouble I got through in the regression problem to only observe an ever-so slight improvement in model performance, I will input the most common hyperparameters that I have observed without doing a 150-line script to find the best possible hyperparameters.

glm_ = c(glm_eval$confusionMatrix$overall, 
        glm_eval$confusionMatrix$byClass)
lasso_cv = c(cvlasso_eval$confusionMatrix$overall, 
             cvlasso_eval$confusionMatrix$byClass)
classif_tree_boost = c(dtboost_eval$confusionMatrix$overall, 
                        dtboost_eval$confusionMatrix$byClass)
rf = c(rfcat_eval$confusionMatrix$overall, 
       rfcat_eval$confusionMatrix$byClass)
cv_rf = c(cvrfcat_eval$confusionMatrix$overall, 
          cvrfcat_eval$confusionMatrix$byClass)
rf_cv_reduced = c(cvrfcat_eval.reduced$confusionMatrix$overall,
          cvrfcat_eval.reduced$confusionMatrix$byClass)
xgb_cv = c(xg_eval$confusionMatrix$overall, 
                          xg_eval$confusionMatrix$byClass)

all = cbind(glm_, lasso_cv, 
             classif_tree_boost,
            rf, cv_rf, rf_cv_reduced, xgb_cv) %>% data.frame()

knitr::kable(all %>% round(3),
             caption = "comparing all models")

comparing all models
	classif_tree_boost	rf	cv_rf	rf_cv_reduced	xgb_cv
Accuracy	0.809	0.845	0.848	0.825	0.842
Kappa	0.523	0.609	0.613	0.562	0.603
AccuracyLower	0.764	0.803	0.806	0.782	0.800
AccuracyUpper	0.848	0.881	0.883	0.863	0.878
AccuracyNull	0.701	0.701	0.701	0.701	0.701
AccuracyPValue	0.000	0.000	0.000	0.000	0.000
McnemarPValue	0.054	0.011	0.003	0.023	0.017
Sensitivity	0.897	0.929	0.937	0.913	0.925
Specificity	0.602	0.648	0.639	0.620	0.648
Pos Pred Value	0.841	0.861	0.859	0.849	0.860
Neg Pred Value	0.714	0.795	0.812	0.753	0.787
Precision	0.841	0.861	0.859	0.849	0.860
Recall	0.897	0.929	0.937	0.913	0.925
F1	0.868	0.894	0.896	0.880	0.891
Prevalence	0.701	0.701	0.701	0.701	0.701
Detection Rate	0.629	0.651	0.657	0.640	0.648
Detection Prevalence	0.748	0.756	0.765	0.753	0.753
Balanced Accuracy	0.750	0.789	0.788	0.767	0.787

Now we see the result of best model on test set (the official one, not the validation that we were using before)

cvrfcat_off = train(good~., data = official_train,
                method = 'rf',
                metric = "Accuracy",
                trControl = ct,
                tuneGrid = grid_rf)


cvrfcat_pred_off = predict(cvrfcat_off, off_test.x)
confMat = confusionMatrix(cvrfcat_pred_off, off_test.y, positive = "1")
cvrfcat_eval_off = list(confusionMatrix = confMat); cvrfcat_eval_off

## $confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  72  19
##          1  48 262
##                                           
##                Accuracy : 0.8329          
##                  95% CI : (0.7927, 0.8681)
##     No Information Rate : 0.7007          
##     P-Value [Acc > NIR] : 7.771e-10       
##                                           
##                   Kappa : 0.572           
##                                           
##  Mcnemar's Test P-Value : 0.0006245       
##                                           
##             Sensitivity : 0.9324          
##             Specificity : 0.6000          
##          Pos Pred Value : 0.8452          
##          Neg Pred Value : 0.7912          
##              Prevalence : 0.7007          
##          Detection Rate : 0.6534          
##    Detection Prevalence : 0.7731          
##       Balanced Accuracy : 0.7662          
##                                           
##        'Positive' Class : 1               
##

Maximizing accuracy is a very good metric, but let’s think a bit outside the box: How to pick best treshold? Above, I picked threshold based on getting better accuracy.But let us think about it. Is accuracy what really matters? Let me ask you this: how many times did you really think that a wine tasted horrible? Very rarely right? Let’s be real, most of us are not connoisseurs and will not notice a huge difference between a wine that is “not bad”, a wine that is “good” and a wine that is “very good”. This is why I think that at the beginning when marketing the product in front of the jury, we will use accuracy as a metric to show them that we are right more times than not; and when our product goes into the market, we will use our secret tactic, allowing for more FP compared to FN. So our threshold will be less than 0.5 and we will use a cost matrix in a fictional scenario that I will create myself.

Predict wine good - actually is good: -40
Predict wine bad - actually is bad: -40
Predict wine good- actually is bad: 10
Predict wine bad - actually is good: 25

In other words, TP = -40 TN = -40 FN = 25 FP = 10

So we get: Cost = 25FN - 40TP + 10FP - 40TN Let me further explain: when we predict that wine is bad but it is actually good, this doesn’t mean that it is good according to the user, but just that it is good following the parameters we inputted into our model. Indeed, a user won’t make a difference between a wine that is rated 5.8 or 5.9 compared to 6.1 or 6.2. Thus, we can profit from this margin by penalizing more FN. The concept that I have in mind is hard to explain in words, but I hope to convince you about it in my presentation!

par(mfrow=c(1,2))
grid.arrange(plot,plot2, ncol=2)

We look at value on x-axis corresponding to minimum value on y-axis: This will be our threshold

I will not include any blatant conclusions to my business problems because they might seem absurd reading them, but when I explain myself it will be much clearer. I provided all the methodologies and techniques used but will draw the big conclusion on the 3rd. Hope my project was enjoyable and pleasant! Looking forward to presenting it

proj_2_class

Marc

6/1/2022