Welcome to my wine project! I hope you will appreciate the journey! For a long time, wine tasters have been seen as the only trustworthy source for giving a reliable opinion on the quality of wine. More recently, Vivino has come into the picture: it relies on community-based ratings. My company “Weeno” (aka We Know) is a newcomer into the field of wine tasting and has to define business problems to make an effective entry into this market. I have defined 2 business goals: one for the short-term and the other for the long-term.

Our short-term business problem is of customer acquisition. In order to achieve this goal, a crucial part has to be how reliable we are; because let’s be real, nowadays community-based platforms are taking a lot of markets by storm. This is why to be able to compete with them, we must be able to convince our consumers that our results are the most accurate possible. We will use regression to answer this first business problem by predicting the quality of wine given a set of inputs (nothing to crazy!). Our focus here must be on lowing the error rates we make as much as possible (you will see later which error metrics I chose). Our second task will be that of customer retention. Indeed, it is one thing to give accurate predictions, but what if we go one step further and predict whether the user will like the wine or not. This is achieved through a classification problem that you can find in the other file. Through this regression problem, we are going to predict whether a wine is “good” or “bad” and this will allow us to keep our customers loyal because they would think that we are giving them these predictions based on their preferences, when in reality it is all pre-programmed. However, this must remain a secret between us.

Eventually, when my business expands and my reach increases significantly I will be able to ask users for their ratings and eventually not only display the algorithm’s prediction but also what others think. Now you must be thinking: but isn’t the community part exactly like Vivino’s model, and to that I say “YES”. We are taking customers from Vivino, increasing customer loyalty and then using their own model in an indirect way! However, this is too much of a long-term thinking and not in the scope of this project.

PART 1: Regression

With the scenario being set, in this file you will find the regression problem in which I will proceed in 3 main simple steps: 1. Data cleaning + Data exploration 2. Find the best performing VANILLA model 3. Optimize the best model found in 2.

————————-Step -1: import lbraries ——

–Step 0: import the data

df <- read.csv(file = "WineQuality.csv", 
                     header=T,
                     sep=",",
                     dec=".",
                     stringsAsFactors = T)

————Step 1: Data exploration + Data cleaning

df %>% map(~ sum(is.na(.)))
## $fixed.acidity
## [1] 0
## 
## $volatile.acidity
## [1] 0
## 
## $citric.acid
## [1] 0
## 
## $residual.sugar
## [1] 0
## 
## $chlorides
## [1] 0
## 
## $free.sulfur.dioxide
## [1] 0
## 
## $total.sulfur.dioxide
## [1] 0
## 
## $density
## [1] 0
## 
## $pH
## [1] 0
## 
## $sulphates
## [1] 0
## 
## $alcohol
## [1] 0
## 
## $quality
## [1] 0
#Wonderful, no null values

Understanding the variables: 1)Residual sugar: the amount of sugar left after the fermentation stops 2)pH: The level of acidity 3)Free sulfur dioxide: prevents microbial growth and the oxidation of wine 4)Fixed acidity: non-volatile acids that do not evaporate immediately 5)Density: the sweeter the wine, the higher density 6)Chlorides: the amount of salt in the wine 7)Alcohol: we all know what it is… 8)Volatile acidity: leads to unpleasant vinegar taste 9)Sulphates: a wine additive that contributes to SO2 levels 10)Total sulfur dioxide: Amount of free + bound forms of SO2 11)Citric acid: acts as preservative to increase acidity

Before starting our exploration phase, allow me to remove some blatant outliers. This step might be unnecessary in your eyes because we are compromising a significant number of data observations. However, these observations largely interfere with the reality of the situation and the results will not be satisfactory. I have tried to do the experiment multiple times, fitting the algorithms to this dataset (once without the outliers and another time as it is), and there was a significant improvement in performance (on test) when I had removed the outliers.

#For outlier cleaning, I have used Cook's distance as a metric as it allows me to get rid of data points with a significant leverage.
table(df$quality)
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5
# Get rid of outliers
outliers = c()
for ( i in 1:11 ) {
  stats = boxplot.stats(df[[i]])$stats
  bottom_outlier_rows = which(df[[i]] < stats[1])
  top_outlier_rows = which(df[[i]] > stats[5])
  outliers = c(outliers , top_outlier_rows[ !top_outlier_rows %in% outliers ] )
  outliers = c(outliers , bottom_outlier_rows[ !bottom_outlier_rows %in% outliers ] )
}

#Detect/Remove outliers
mod = lm(quality ~ ., data = df)
cooksd = cooks.distance(mod)
lapply(1:6, function(x) plot(mod, which=x, labels.id= 1:nrow(df))) %>% invisible()

#these graphs show what I have been talking about earlier regardin the high 
#leverage that some data points have
abline(h = 20*mean(cooksd, na.rm = T), col = "red")

head(df[cooksd > 20 * mean(cooksd, na.rm=T), ])
##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1418           8.6            0.550        0.35          15.55     0.057
## 1689           6.7            0.250        0.26           1.55     0.041
## 1932           7.1            0.490        0.22           2.00     0.047
## 2782           7.8            0.965        0.60          65.80     0.074
## 3308           9.4            0.240        0.29           8.50     0.037
## 4746           6.1            0.260        0.25           2.90     0.047
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1418                35.5                366.5 1.00010 3.04      0.63    11.0
## 1689               118.5                216.0 0.99490 3.55      0.63     9.4
## 1932               146.5                307.5 0.99240 3.24      0.37    11.0
## 2782                 8.0                160.0 1.03898 3.39      0.69    11.7
## 3308               124.0                208.0 0.99395 2.90      0.38    11.0
## 4746               289.0                440.0 0.99314 3.44      0.64    10.5
##      quality
## 1418       3
## 1689       3
## 1932       3
## 2782       6
## 3308       3
## 4746       3
coutliers = as.numeric(rownames(df[cooksd > 20 * mean(cooksd, na.rm=T), ]))
outliers = c(outliers , coutliers[ !coutliers %in% outliers ] )
df = df[-outliers, ]
mod = lm(quality ~ ., data = df)
par(mfrow=c(2,3))
lapply(1:6, function(x) plot(mod, which=x, labels.id= 1:nrow(df))) %>% invisible()

#much better results after the omission of the outliers
range(df$quality)
## [1] 3 9

values in population generally range on a scale from 0 to 10 but in our dataset minimum is 3 and max is 9, so no exceptional or horrible wine.

table(df$quality)
## 
##    3    4    5    6    7    8    9 
##    8   89 1106 1866  797  145    4

Very few have a quality of 9 and 3. Let’s see a histogram of quality

str(df)
## 'data.frame':    4015 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Only numerical variables, all continuous except for quality (which is the target variable)

This makes our life easier in the univariate EDA, as we only have to draw histograms!

p1 <- ggplot(df, aes(x=fixed.acidity)) + ggtitle("Fixed Acidity") + 
  xlab("Fixed acidity") + geom_histogram(fill="salmon")
p2 <- ggplot(df, aes(x=volatile.acidity )) + ggtitle("Volatile Acidity") + 
  xlab("Volatile acidity") + geom_histogram(fill="salmon") 
p3 <- ggplot(df, aes(x=citric.acid)) + ggtitle("Citric acid") + xlab("Citric acid") + 
  geom_histogram(fill="salmon") 
p4 <- ggplot(df, aes(x=residual.sugar)) + ggtitle("residual sugar") +
  xlab("residual sugar") + geom_histogram(fill="salmon")
grid.arrange(p1, p2, p3, p4, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p5 <- ggplot(df, aes(x=chlorides)) + ggtitle("Chlorides") + 
  xlab("Chlorides") + geom_histogram(fill="salmon")
p6 <- ggplot(df, aes(x=free.sulfur.dioxide )) + ggtitle("Free sulfur dioxide")+ 
  xlab("Free sulfur dioxide") + geom_histogram(fill="salmon") 
p7 <- ggplot(df, aes(x=total.sulfur.dioxide)) + ggtitle("Total sulfure dioxide") + xlab("Total sulfure dioxide") + 
  geom_histogram(fill="salmon") 
p8 <- ggplot(df, aes(x=pH)) + ggtitle("pH") + 
  xlab("pH") + geom_histogram(fill="salmon")
grid.arrange(p5, p6, p7, p8, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p9 <- ggplot(df, aes(x=sulphates)) + ggtitle("Sulphates") + 
  xlab("Sulphates") + geom_histogram(fill="salmon")
p10 <- ggplot(df, aes(x=alcohol )) + ggtitle("Alcohol")+ 
  xlab("alcohol") + geom_histogram(fill="salmon") 
p11 <- ggplot(df, aes(x=quality)) + ggtitle("Quality") + xlab("quality") + 
  geom_bar(fill="salmon") 
grid.arrange(p9, p10, p11, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

grid.arrange(p4,p5,p6, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We notice that some variables are too skewed, I find it better to apply some transformations (there is a variety of transformations like log, sqrt, Box-Cox, Yeo-Johnon… so I will start by using log and sqrt to see which is better. Typically, we want that skewness < 1). What we could also do, and is more automatic is use “center=TRUE” later on when scaling, so R automatically takes care of skewness for us. So I will comment this code anyways to provide an alternative solution

# skewness(df$residual.sugar)
# df <- df %>% mutate(residual.sugar = log(residual.sugar))
# skewness(df$residual.sugar)
# 
# skewness(df$chlorides)
# df <- df %>% mutate(chlorides = log(chlorides))
# skewness(df$chlorides)
# 
# skewness(df$free.sulfur.dioxide)
# df<- df %>% mutate(free.sulfur.dioxide = sqrt(free.sulfur.dioxide))
# skewness(df$free.sulfur.dioxide)

#Check for correlation among variables
df %>% ggcorr()

Nothing really alarming except the high correlation between density and residual sugar. We will see later with stepwise selection and lasso, if this is taken care of.

I could spend more time analyzing the localization (Averages, mean, quantiles), and the variations but except the clearly skewed distributions of some of the variables, there is really nothing much to comment on from a unilateral perspective. With that being said, I think it is much more interesting to take a look at the bivariate analysis

#bivariate analysis:
g1 <- ggplot(df, aes(factor(quality), fixed.acidity, fill=factor(quality))) + 
  geom_boxplot() +
  labs(x = "quality", y = "fixed.acidity", title = "Boxplot of Quality vs. fixed.acidity") + 
  theme(legend.position = 'none', plot.title = element_text(size = 9, hjust=0.5))

g2 <- ggplot(df, aes(factor(quality), volatile.acidity, fill=factor(quality))) + 
  geom_boxplot() +
  labs(x = "quality", y = "volatile.acidity", title = "Boxplot of Quality vs. volatile.acidity") + 
  theme(legend.position = 'none', plot.title = element_text(size = 9, hjust=0.5))

g3 <- ggplot(df, aes(factor(quality), citric.acid, fill=factor(quality))) + 
  geom_boxplot() +
  labs(x = "quality", y = "citric.acid", title = "Boxplot of Quality vs. citric.acid") + 
  theme(legend.position = 'none', plot.title = element_text(size = 9, hjust=0.5))

g4 <- ggplot(df, aes(factor(quality), residual.sugar, fill=factor(quality))) + 
  geom_boxplot() +
  labs(x = "quality", y = "residual.sugar", title = "Boxplot of Quality vs. residual.sugar") + 
  theme(legend.position = 'none', plot.title = element_text(size = 9, hjust=0.5))
ggarrange(g1, g2, g3, g4, nrow = 2, ncol =2)

g5 <- ggplot(df, aes(factor(quality), chlorides, fill=factor(quality))) + 
  geom_boxplot() +
  labs(x = "Quality", y = "chlorides", title = "Boxplot of Quality vs. chlorides") + 
  theme(legend.position = 'none', plot.title = element_text(size = 9, hjust=0.5))

g6 <- ggplot(df, aes(factor(quality), free.sulfur.dioxide, fill=factor(quality))) + 
  geom_boxplot() +
  labs(x = "quality", y = "free.sulfur.dioxide", title = "Boxplot of quality vs. free.sulfur.dioxide") + 
  theme(legend.position = 'none', plot.title = element_text(size = 9, hjust=0.5))

g7 <- ggplot(df, aes(factor(quality), total.sulfur.dioxide, fill=factor(quality))) + 
  geom_boxplot() +
  labs(x = "quality", y = "total.sulfur.dioxide", title = "Boxplot of quality vs. total.sulfur.dioxide") + 
  theme(legend.position = 'none', plot.title = element_text(size = 9, hjust=0.5))

g8 <- ggplot(df, aes(factor(quality), density, fill=factor(quality))) + 
  geom_boxplot() +
  labs(x = "quality", y = "density", title = "Boxplot of quality vs. density") + 
  theme(legend.position = 'none', plot.title = element_text(size = 9, hjust=0.5))
ggarrange(g5, g6, g7, g8, nrow = 2, ncol =2)

g9 <- ggplot(df, aes(factor(quality), pH, fill=factor(quality))) + 
  geom_boxplot() +
  labs(x = "quality", y = "pH", title = "Boxplot of Quality vs. pH") + 
  theme(legend.position = 'none', plot.title = element_text(size = 9, hjust=0.5))

g10 <- ggplot(df, aes(factor(quality), sulphates, fill=factor(quality))) + 
  geom_boxplot() +
  labs(x = "quality", y = "sulphates", title = "Boxplot of quality vs. sulphates") + 
  theme(legend.position = 'none', plot.title = element_text(size = 9, hjust=0.5))

g11 <- ggplot(df, aes(factor(quality), alcohol, fill=factor(quality))) + 
  geom_boxplot() +
  labs(x = "quality", y = "alcohol", title = "Boxplot of quality vs. alcohol") + 
  theme(legend.position = 'none', plot.title = element_text(size = 9, hjust=0.5))
ggarrange(g9, g10, g11, nrow = 2, ncol =2)

We notice an increasing trend between the quality and the alcohol level: the higher the quality the higher the alcohol level. The same can be said between pH and and quality, except that this relationship is less evident. We will investigate further later on when

——————–END OF DATA EXPLORATION + CLEANING

#split data into train/test set
p=ncol(df)
train_index <- createDataPartition(df$quality, p = .9, 
                                   list = FALSE, 
                                   times = 1)
traintot <- df[train_index, ]
test <- df[-train_index, ]

#Here we directly put center = TRUE instead of using skewness, as mentioned above
trainScalenum <- traintot %>% select(-p) %>%  scale(center=TRUE, scale = TRUE)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(p)` instead of `p` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
trainScale <- cbind(trainScalenum, quality=traintot[,p]) 
trainScale
testScalenum <- test %>% select(-p) %>% scale(., center=attr(trainScalenum, "scaled:center"),
                                                           scale=attr(trainScalenum, "scaled:scale"))
testScale <- cbind(testScalenum, quality=test[,p])

trainScale <- data.frame(trainScale)
testScale <- data.frame(testScale)


#Split train (already scaled) into train/validation set
train_index2 <- createDataPartition(trainScale$quality, p = .8, 
                                   list = FALSE, 
                                   times = 1)
trainScaled <- trainScale[train_index2, ]
validationScaled <- trainScale[-train_index2, ]



trainScaled <- data.frame(trainScaled)
validationScaled <- data.frame(validationScaled)
set.seed(1)
#Analysis of some variables with respect to quality to get some interesting insights
#as requested by you and as I did in the midterm. Lower dimensional model
myfirstfit = lm(quality~alcohol*pH, data=trainScaled)
summary(myfirstfit)
## 
## Call:
## lm(formula = quality ~ alcohol * pH, data = trainScaled)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5435 -0.5768  0.0192  0.4286  2.7147 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.94079    0.01448 410.210  < 2e-16 ***
## alcohol      0.35230    0.01465  24.044  < 2e-16 ***
## pH           0.02759    0.01433   1.926   0.0543 .  
## alcohol:pH   0.09295    0.01484   6.265 4.27e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7753 on 2889 degrees of freedom
## Multiple R-squared:  0.1743, Adjusted R-squared:  0.1735 
## F-statistic: 203.3 on 3 and 2889 DF,  p-value: < 2.2e-16

Let’s analyze what we get: Suppose X1: alcohol, X2: pH model is: b0 + b1(X1)’ + b2(X2)’ + b3(X1)‘(X2)’ b0 = 5.94: This is E(Y) when pH and alcohol are set to their respective means b1 = 0.36: This is E(ΔY) when (X2)’ = 0 so when pH is equal to its mean and when alcohol increases by 1 standard deviation we expect an increase in quality of 0.36, which is not negligible at all b1 + b3 = 0.44 : This is E(ΔY) when (X2)’ = 1 so when pH = mean(pH)+sd(pH) and when alcohol increases by one sd, we expect an increase in quality of 0.44. This improves slightly on the effect caused by increasing alcohol by 1 sd, but it is clear that most of the increase comes from alcohol rather than from the interaction between pH and alcohol. Lastly, when (X2)’ = -1, so when pH = mean(pH)-sd(pH) and when alcohol increases by one sd, we get b1-b3 = 0.28, which means that we expect an increase in quality of 0.28. As mentioned before, we observe that alcohol is the main driving force

#Function that will allow us to easily evaluate our models based on specific metrics. This will allow us to have a similar comparison at the end between the different models
myval = function(pred, actual,title= "") {
  rmse = sqrt(mean((pred - actual)^2))
  mae = mean(abs(pred - actual))
  par(mfrow = c(1,3), oma = c(0, 0, 3, 0))
  resid = pred - actual
  plot(resid)
  plot(jitter(actual, factor = 1), 
       jitter(pred, factor = 0.5), 
       pch = 4, asp = 1,
       xlab = "Truth", ylab = "Predicted") 
  abline(0,1, lty = 2)
  hist(resid, breaks = 20, main = NULL)
  mtext(paste0(title, " predicted vs. actual using test set"), outer = TRUE)
  return(list(rmse = rmse,
              mae = mae
  ))
}

——–Step 2: Vanilla models

#Real talk, start fitting models. The procedure that I will follow is first fit VANILLA models and optimize the best models I got (models with lower mae and rmse)
lm.fit = lm(quality~., data = trainScaled)
lm.pred = predict(lm.fit, validationScaled[,-p])
lm.myval = myval(lm.pred, validationScaled$quality, title = "linear model: predicted vs true"); unlist(lm.myval)

#Difference between predicted and true quality is not too bad.



# variable selection using stepwise methods
lm_simple = lm(quality ~ ., data = trainScaled)
lm_step = step(lm_simple, ~ fixed.acidity + volatile.acidity + 
                                      citric.acid + residual.sugar +  chlorides + free.sulfur.dioxide +
                                      total.sulfur.dioxide + density + pH + sulphates + alcohol, 
                            direction = "both",trace = 0)
summary(lm_step)

lm_step_pred = predict(lm_step, validationScaled[,-p])
lm_step_myval = myval(lm_step_pred, validationScaled$quality, title="stepwise model: predicted vs true");unlist(lm_step_myval)

#LASSO + Elastic net + Ridge
dat = traintot %>% as_tibble()
X <- model.matrix(quality~.-1, data = dat)
Y <- dat$quality

n <- nrow(dat)

#Let's resplit to satisy format of lasso
ntr <- round(n*0.8)
nte <- n-ntr
trIdx <- sample(1:n, size=ntr)
Xtr <- X[trIdx,]
Ytr <- Y[trIdx]
Xte <- X[-trIdx,]
Yte <- Y[-trIdx]
XtrS <- Xtr %>% scale()   # This standardizes Xtr
XteS <- Xte %>% scale(., center=attr(XtrS, "scaled:center"),
                            scale=attr(XtrS, "scaled:scale"))
#When standardizing the test (here validation actually) set, we make sure to use 
#center of train set or else it will be catastrophic.

nlambdas <- 100
lambdas <- seq(0.001, 2, length.out = nlambdas)
nfolds <- 5

#Lasso model
Lasso.cv <- cv.glmnet(XtrS, Ytr, lambda=lambdas, alpha=1,
                      family="gaussian")
plot(Lasso.cv)


lambdaStar <- Lasso.cv$lambda.1se
lasso_pred = predict(Lasso.cv, XteS, s=lambdaStar)
lasso_myval = myval(lasso_pred, Yte, title="lasso model: predicted vs true"); unlist(lasso_myval)

#Ridge model
Ridge.cv <- cv.glmnet(XtrS, Ytr, lambda=lambdas, alpha=0,
                      family="gaussian")
plot(Ridge.cv)


lambdaStar <- Ridge.cv$lambda.1se
ridge_pred = predict(Ridge.cv, XteS, s=lambdaStar)
ridge_myval = myval(ridge_pred, Yte, title="Ridge model: predicted vs true"); unlist(ridge_myval)

#Elastic net

unregister_dopar()

nalphas <- 20
alphas <- seq(0, 1, length.out=nalphas)
ctrl <- trainControl(method="cv", number=nfolds)
elastic.caret <- caret::train(XtrS, Ytr, method = "glmnet", trControl = ctrl,
                     tuneGrid = expand.grid(alpha = alphas, 
                                            lambda = lambdas))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
best_alpha = elastic.caret$bestTune$alpha
best_lambda = elastic.caret$bestTune$lambda

elastic_pred = predict(elastic.caret, XteS, s=best_lambda, a = best_alpha)
elastic_myval = myval(elastic_pred, Yte, title="elastic model: predicted vs actual"); unlist(elastic_myval)

#-------Moving on to tree-based algorithms----------

#Let's run a decision tree algorithm. Here I decided to use rpart instead of the 
#method we saw in class, but tested them both and gave same results
dtree_mod = rpart(quality~., data=trainScaled)
rpart.plot(dtree_mod)  
dtree_mod.pred = predict(dtree_mod, validationScaled[,-p])
dtree_mod.myval = myval(dtree_mod.pred, validationScaled$quality, title="decision tree: predicted vs true"); unlist(dtree_mod.myval)

# RF model
rf_mod= randomForest(quality~., data=trainScaled, ntree=1000, mtry=3)
rf_mod_pred = predict(rf_mod, validationScaled[,-p])
rf_mod_myval = myval(rf_mod_pred, validationScaled$quality, title="random forest: predicted vs true"); unlist(rf_mod_myval)

#xgboost
fitControl = trainControl(
  method = "cv",
  number = 10,
  )

tune_grid =
  expand.grid(
    nrounds = c(50, 80),
    eta = c(0.03,0.01),
    max_depth=c(6, 5),
    subsample = c(0.75, 1),
    colsample_bytree = c(0.4, 0.6),
    min_child_weight = c(1, 3),
    gamma = c(0.05, 0.1)
  )
#GAM model with smoothing splines
library(splines)
fit = smooth.spline(trainScaled$density, trainScaled$alcohol,cv=TRUE)
## Warning in smooth.spline(trainScaled$density, trainScaled$alcohol, cv = TRUE):
## cross-validation with non-unique 'x' values seems doubtful
plot(trainScaled$residual.sugar, trainScaled$quality)

# lines(fit,col="red",lwd=2)
#I found optimal degrees of freedom for each variable and applied it. We see improvement
#in results. I will not detail how I found optimal degrees of freedom for each variable
#but I can upon request
ga.model <- gam(quality ~ s(alcohol,12.23) + s(volatile.acidity,4) + s(sulphates,6.43) + s(chlorides,5.23) +
                  s(total.sulfur.dioxide,5.87) + s(free.sulfur.dioxide,8.5) +  s(citric.acid, 2) + s(pH,4.35) +
                  s(residual.sugar,23.16) + s(density,8) + s(fixed.acidity,4.3),
                data = trainScaled)
summary(ga.model)

gam_preds <- predict(ga.model, validationScaled[,-p])
gam_myval = myval(gam_preds, validationScaled$quality, title="GAM: predicted vs true"); unlist(gam_myval)

#ASIDE: LM with top variables, just to see if there is improvement on full LM
lm0 = glm(quality ~ alcohol + density + chlorides + volatile.acidity + total.sulfur.dioxide, data = trainScaled)
lm.fit.reduced = lm(quality~ alcohol + density + chlorides + volatile.acidity + total.sulfur.dioxide, data = trainScaled)
summary(lm.fit)
lm.pred.reduced = predict(lm0, validationScaled[,-p])
lm_reduced_myval = myval(lm.pred.reduced, validationScaled$quality, "LM with best features: predicted vs true"); unlist(lm_reduced_myval)

knitr::kable(cbind(lm = unlist(lm.myval),
                   lm.interac = unlist(lm_step_myval),
                   lasso = unlist(lasso_myval),
                   ridge = unlist(ridge_myval),
                   elastic.net = unlist(elastic_myval),
                   rf = unlist(rf_mod_myval),
                   # rf.cv = unlist(cvrf_myval),
                   # lm.reduced = unlist(lm_reduced_myval),
                   regression.tree = unlist(dtree_mod.myval),
                   xgb = unlist(xg_myval),
                   # xgb_best = unlist(best.xgb.myval),
                   gam = unlist(gam_myval)) ,
             
             
             caption = "Models for regression: ")
Models for regression:
lm lm.interac lasso ridge elastic.net rf regression.tree xgb gam
rmse 0.7277124 0.7276201 0.7557775 0.7621502 0.7459749 0.5961657 0.7452770 0.8190451 0.6855082
mae 0.5773486 0.5772719 0.5962616 0.5959993 0.5895048 0.4487771 0.6045916 0.6359832 0.5553117

–Step 3: Optimizing best model found in 2.

#Now that we found that RF is the best performing, I will further optimize it
#by using RF with CV
ct = trainControl(method = "repeatedcv", number = 10, repeats = 2)
grid_rf = expand.grid(.mtry = c(3, sqrt(p), p/3))

cvrf = train(quality~., data = trainScaled,
                method = 'rf',
                metric = "RMSE",
                trControl = ct,
                tuneGrid = grid_rf)
cvrf_pred = predict(cvrf, validationScaled[,-p])
cvrf_myval = myval(cvrf_pred, validationScaled$quality, title="rf with CV: predicted vs true"); unlist(cvrf_myval)

##      rmse       mae 
## 0.5950688 0.4476521

We do not see significant improvement which means that our model was already optimal: no other hyperparameters to tune. Of course, if we increase number of repeats in ct, we may observe better results, but not significant enough.

Lastly, we see performance on real test set

#fit best model (rf.cv) on TEST set and get predictions
cvrf.off.pred = predict(cvrf, testScale[,-p])
cvrf.off.myval = myval(cvrf.off.pred, testScale$quality, title="rf performance test set"); unlist(cvrf.off.myval)

##      rmse       mae 
## 0.5729061 0.4204933

PART 2: Classification

I already introduced the business problem for classification in the first file. I will state the steps for my classification analysis which is similar to the regression, in some way. My approach: 1. Since EDA is already done, I did not do it here 2.Start by finding best model with metric as accuracy 3. Special twist, by changing the metric from accuracy to higher recall. This might sound confusing now, but will all be clear later on.

Most importantly, ENJOY!

#This is where the magic starts: special categorization (>=6 is good)
new_df = df %>% 
  mutate(good = ifelse(quality >= 6, "1", "0") %>% as.factor()) %>%
  select(-quality); dim(df)
table(new_df$good)

#split into test/train
p=ncol(new_df)
idx0 <- createDataPartition(new_df$good, p = .9, 
                                   list = FALSE, 
                                   times = 1)


off_train.x = new_df[idx0, -p] %>% as.matrix()
dim(off_train.x)
off_train.x = off_train.x %>% scale(center=TRUE)
off_train.y = new_df[idx0, p]
official_train = data.frame(off_train.x, good=off_train.y)
dim(official_train)

off_test.x = new_df[-idx0, -p] %>% as.matrix()
off_test.x = off_test.x %>% scale(., center=attr(off_train.x, "scaled:center"),
                                  scale=attr(off_train.x, "scaled:scale"))
off_test.y = new_df[-idx0, p]
official_test = data.frame(off_test.x, good=off_test.y)


#split train into train/validation
idx = createDataPartition(official_train$good, p = 0.9, list=F)

train.x = official_train[idx, -p] %>% as.matrix(); dim(train.x)
train.y = official_train[idx, p]; table(train.y)
train = data.frame(train.x, good=train.y)

test.x = official_train[-idx, -p] %>% as.matrix(); dim(test.x)
test.y = official_train[-idx, p]; table(test.y)
test = data.frame(test.x, good=test.y)

Start by the reduced model as asked by you

#logistic regression only on few vars for interpretation purposes
set.seed(2)
fitprova = glm(good~alcohol*density,data = train, family = binomial)
summary(fitprova)
## 
## Call:
## glm(formula = good ~ alcohol * density, family = binomial, data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.824  -1.116   0.501   0.902   1.639  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.87885    0.05748  15.289  < 2e-16 ***
## alcohol          1.19082    0.08528  13.964  < 2e-16 ***
## density          0.12950    0.08263   1.567    0.117    
## alcohol:density -0.28271    0.06070  -4.657  3.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3972.8  on 3252  degrees of freedom
## Residual deviance: 3432.7  on 3249  degrees of freedom
## AIC: 3440.7
## 
## Number of Fisher Scoring iterations: 5

Model is the following: log-odds(good) = beta0 + beta1alcohol + beta2density + beta3alcoholdensity e^(b0) = 2.39 ==> odds of wine being good when alcohol and density are set to their means e^(b0+b1) = 7.38 ==> odds of wine being good when alcohol increases by 1 sd and density is equal to its mean e^(b2) = 1.08 ==> odds ratio of wine being good when density increases by 1 sd and alcohol is equal to its mean e^(b2+b3) = 0.81 ==> odds ratio of wine being good when both alcohol and density increase by 1sd. We conclude that when alcohol or density increase individually, it leads to more chance of wine being good, but when they increase together we observe the contrary effect (wine become poorer). We also see that alcohol has a bigger effect on whether the wine is good or not, compared to its counterpart density. Worthy to note that density is not quite significant so take this analysis with a pinch of salt

let’s fit some modelslogistic reg on this classification pb. For now let’s use a treshold of 0.5, but I later came up with an ingenious way to find the optimal treshold for our business problem.

#let's fit some modelslogistic reg on this classification #pb. For now let's use a treshold of 0.5, but I later came #up with an ingenious way to find the optimal treshold for #our business problem.
threshold = 0.5
glm_mod = glm(good ~. , data = train, family = binomial)
glm_prob = predict(glm_mod, data.frame(test.x), type="response")
glm_class = as.factor(ifelse(glm_prob >= threshold, 1, 0))
cm = confusionMatrix(glm_class, test.y, positive = "1")
glm_eval = cm; glm_eval



#Now Lasso with 10-fold CV
cvlasso = cv.glmnet(train.x, train.y, 
                       family = "binomial",
                       type.measure = "auc")
coef(cvlasso, s=cvlasso$lambda.1se) 

cvlasso_prob = predict(cvlasso, 
                          test.x, 
                          type="response", 
                          s=cvlasso$lambda.1se)

lasso_class = as.factor(ifelse(cvlasso_prob >= threshold, 1, 0))
cm_lasso = confusionMatrix(lasso_class, test.y, positive = "1")
cvlasso_eval = cm; cvlasso_eval


#Decision tree but this time classification problem with boosting
dt = C5.0(train.x, train.y, trials = 10)
dt_pred = predict(dt, test.x)
confMat = confusionMatrix(dt_pred, test.y, positive="1")
dtboost_eval = list(confusionMatrix = confMat); dtboost_eval




#Now, let's fit a RF in 2 ways: a VANILLA one and one with CV
rfcat = randomForest(good~., data=train, ntree=1000, mtry=sqrt(p))
rfcat_prob = predict(rfcat, test.x)
confMat = confusionMatrix(rfcat_prob, test.y, positive="1")
rfcat_eval = list(confusionMatrix = confMat)
rfcat_eval


#Now, for the RF with CV
ct = trainControl(method = "repeatedcv", number = 5, repeats = 2)
grid_rf = expand.grid(.mtry = c(p/3, 3.5, sqrt(p)))

cvrfcat = train(good~., data = train,
                     method = 'rf',
                     metric = "Accuracy",
                     trControl = ct,
                     tuneGrid = grid_rf)


cvrfcat_pred = predict(cvrfcat, test.x)
confMat = confusionMatrix(cvrfcat_pred, test.y, positive = "1")
cvrfcat_eval = list(confusionMatrix = confMat); cvrfcat_eval

importance(rfcat)

#Tried to do reduced model, but as expected got a better result with full model
#(even though the most important variables were selected)
cvrfcat.reduced = caret::train(good~alcohol+density+residual.sugar+chlorides, data = train,
                     method = 'rf',
                     metric = "Accuracy",
                     trControl = ct,
                     tuneGrid = grid_rf)


cvrfcat_pred.reduced = predict(cvrfcat.reduced, test.x)
confMat = confusionMatrix(cvrfcat_pred.reduced, test.y, positive = "1")
cvrfcat_eval.reduced = list(confusionMatrix = confMat); cvrfcat_eval.reduced

Following the trouble I got through in the regression problem to only observe an ever-so slight improvement in model performance, I will input the most common hyperparameters that I have observed without doing a 150-line script to find the best possible hyperparameters.

glm_ = c(glm_eval$confusionMatrix$overall, 
        glm_eval$confusionMatrix$byClass)
lasso_cv = c(cvlasso_eval$confusionMatrix$overall, 
             cvlasso_eval$confusionMatrix$byClass)
classif_tree_boost = c(dtboost_eval$confusionMatrix$overall, 
                        dtboost_eval$confusionMatrix$byClass)
rf = c(rfcat_eval$confusionMatrix$overall, 
       rfcat_eval$confusionMatrix$byClass)
cv_rf = c(cvrfcat_eval$confusionMatrix$overall, 
          cvrfcat_eval$confusionMatrix$byClass)
rf_cv_reduced = c(cvrfcat_eval.reduced$confusionMatrix$overall,
          cvrfcat_eval.reduced$confusionMatrix$byClass)
xgb_cv = c(xg_eval$confusionMatrix$overall, 
                          xg_eval$confusionMatrix$byClass)

all = cbind(glm_, lasso_cv, 
             classif_tree_boost,
            rf, cv_rf, rf_cv_reduced, xgb_cv) %>% data.frame()

knitr::kable(all %>% round(3),
             caption = "comparing all models")
comparing all models
classif_tree_boost rf cv_rf rf_cv_reduced xgb_cv
Accuracy 0.828 0.870 0.867 0.820 0.834
Kappa 0.572 0.671 0.667 0.567 0.595
AccuracyLower 0.785 0.831 0.828 0.776 0.791
AccuracyUpper 0.866 0.903 0.900 0.858 0.871
AccuracyNull 0.701 0.701 0.701 0.701 0.701
AccuracyPValue 0.000 0.000 0.000 0.000 0.000
McnemarPValue 0.057 0.004 0.014 0.804 0.366
Sensitivity 0.909 0.949 0.941 0.877 0.897
Specificity 0.639 0.685 0.694 0.685 0.685
Pos Pred Value 0.855 0.876 0.878 0.867 0.870
Neg Pred Value 0.750 0.851 0.833 0.705 0.740
Precision 0.855 0.876 0.878 0.867 0.870
Recall 0.909 0.949 0.941 0.877 0.897
F1 0.881 0.911 0.908 0.872 0.883
Prevalence 0.701 0.701 0.701 0.701 0.701
Detection Rate 0.637 0.665 0.659 0.615 0.629
Detection Prevalence 0.745 0.759 0.751 0.709 0.723
Balanced Accuracy 0.774 0.817 0.818 0.781 0.791

Now we see the result of best model on test set (the official one, not the validation that we were using before)

cvrfcat_off = train(good~., data = official_train,
                method = 'rf',
                metric = "Accuracy",
                trControl = ct,
                tuneGrid = grid_rf)


cvrfcat_pred_off = predict(cvrfcat_off, off_test.x)
confMat = confusionMatrix(cvrfcat_pred_off, off_test.y, positive = "1")
cvrfcat_eval_off = list(confusionMatrix = confMat); cvrfcat_eval_off
## $confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  76  24
##          1  44 257
##                                         
##                Accuracy : 0.8304        
##                  95% CI : (0.79, 0.8658)
##     No Information Rate : 0.7007        
##     P-Value [Acc > NIR] : 1.655e-09     
##                                         
##                   Kappa : 0.5754        
##                                         
##  Mcnemar's Test P-Value : 0.02122       
##                                         
##             Sensitivity : 0.9146        
##             Specificity : 0.6333        
##          Pos Pred Value : 0.8538        
##          Neg Pred Value : 0.7600        
##              Prevalence : 0.7007        
##          Detection Rate : 0.6409        
##    Detection Prevalence : 0.7506        
##       Balanced Accuracy : 0.7740        
##                                         
##        'Positive' Class : 1             
## 

Maximizing accuracy is a very good metric, but let’s think a bit outside the box: How to pick best treshold? Above, I picked threshold based on getting better accuracy.But let us think about it. Is accuracy what really matters? Let me ask you this: how many times did you really think that a wine tasted horrible? Very rarely right? Let’s be real, most of us are not connoisseurs and will not notice a huge difference between a wine that is “not bad”, a wine that is “good” and a wine that is “very good”. This is why I think that at the beginning when marketing the product in front of the jury, we will use accuracy as a metric to show them that we are right more times than not; and when our product goes into the market, we will use our secret tactic, allowing for more FP compared to FN. So our threshold will be less than 0.5 and we will use a cost matrix in a fictional scenario that I will create myself.

  1. Predict wine good - actually is good: -40
  2. Predict wine bad - actually is bad: -40
  3. Predict wine good- actually is bad: 10
  4. Predict wine bad - actually is good: 25

In other words, TP = -40 TN = -40 FN = 25 FP = 10

So we get: Cost = 25FN - 40TP + 10FP - 40TN Let me further explain: when we predict that wine is bad but it is actually good, this doesn’t mean that it is good according to the user, but just that it is good following the parameters we inputted into our model. Indeed, a user won’t make a difference between a wine that is rated 5.8 or 5.9 compared to 6.1 or 6.2. Thus, we can profit from this margin by penalizing more FN. The concept that I have in mind is hard to explain in words, but I hope to convince you about it in my presentation!

par(mfrow=c(1,2))
grid.arrange(plot,plot2, ncol=2)

We look at value on x-axis corresponding to minimum value on y-axis: This will be our threshold

I will not include any blatant conclusions to my business problems because they might seem absurd reading them, but when I explain myself it will be much clearer. I provided all the methodologies and techniques used but will draw the big conclusion on the 3rd. Hope my project was enjoyable and pleasant! Looking forward to presenting it