Coffee Quality Ratings

Coffee can be a huge productivity boost and I can’t imagine working through the day without it. This data set, originally uploaded from Coffee Quality Database, was re-posted to Kaggle and subsequently featured on TidyTuesday.

In it, coffee from different countries are awarded cup points, scored by panelists who sample the coffee and assess it based on a number of factors such as aroma, acidity, uniformity and sweetness. But do other factors, such as country, altitude and processing method, also affect coffee quality scores?

This blog post will set out to investigate the data with exploratory data analysis. Next, utilizing the tidymodels collection, I will create new predictors with feature engineering, and subsequently specify, tune, compare in-sample results based on RMSE for three popular machine learning models (LASSO, random forest and XGBoost). Variable importance of features will also be compared. Finally, the model that’s able to deliver the lowest out-of-sample RMSE when predicting coffee quality points will be selected.

Libraries

library(tidyverse)
library(tidymodels)

theme_set(theme_minimal())

Exploratory Data Analysis

coffee <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')

dim(coffee)
## [1] 1339   43

While not big data, this data set has over 1.3k rows and 43 columns! This is the data dictionary, and full descriptions and examples can be found here.

Distribution of total_cup_points

Let’s first get a sense of the distribution of total_cup_points, which are the rating points given to each cup of coffee on a scale of 0-100.

coffee %>% 
  ggplot(aes(total_cup_points)) +
  geom_histogram(binwidth = 1, fill = "#00AFBB", color="#e9ecef", alpha=0.6) +
  labs(
    x = "Total_cup_points",
    y = "Count",
    title = "Analyzing Distribution of Coffee Quality Points: Histogram",
    subtitle = "Majority of scores are clustered between 80-90,\nwith somesignificant outliers - potentially data errors",
    caption = "Source: Coffee Quality Database"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 20),
    plot.subtitle = element_text(size = 17)
  ) 

coffee %>% 
  arrange(total_cup_points) %>% 
  head()
## # A tibble: 6 x 43
##   total_cup_points species owner country_of_orig~ farm_name lot_number mill 
##              <dbl> <chr>   <chr> <chr>            <chr>     <chr>      <chr>
## 1              0   Arabica bism~ Honduras         los hica~ 103        cigr~
## 2             59.8 Arabica juan~ Guatemala        finca el~ <NA>       bene~
## 3             63.1 Arabica expo~ Nicaragua        finca la~ 017-053-0~ bene~
## 4             67.9 Arabica myri~ Haiti            200 farms <NA>       coeb~
## 5             68.3 Arabica juan~ Mexico           el cente~ <NA>       la e~
## 6             69.2 Arabica cade~ Honduras         cerro bu~ <NA>       cade~
## # ... with 36 more variables: ico_number <chr>, company <chr>, altitude <chr>,
## #   region <chr>, producer <chr>, number_of_bags <dbl>, bag_weight <chr>,
## #   in_country_partner <chr>, harvest_year <chr>, grading_date <chr>,
## #   owner_1 <chr>, variety <chr>, processing_method <chr>, aroma <dbl>,
## #   flavor <dbl>, aftertaste <dbl>, acidity <dbl>, body <dbl>, balance <dbl>,
## #   uniformity <dbl>, clean_cup <dbl>, sweetness <dbl>, cupper_points <dbl>,
## #   moisture <dbl>, category_one_defects <dbl>, quakers <dbl>, color <chr>,
## #   category_two_defects <dbl>, expiration <chr>, certification_body <chr>,
## #   certification_address <chr>, certification_contact <chr>,
## #   unit_of_measurement <chr>, altitude_low_meters <dbl>,
## #   altitude_high_meters <dbl>, altitude_mean_meters <dbl>

There’s one outlier with zero total_cup_points - probably a data entry error. Let’s remove that. At the same time, there does not appear to be a unique identifier for each coffee, so let’s add that.

coffee <- coffee %>% 
  filter(total_cup_points > 0) %>% 
  mutate(id = row_number()) %>% 
  select(id, everything())

coffee
## # A tibble: 1,338 x 44
##       id total_cup_points species owner country_of_orig~ farm_name lot_number
##    <int>            <dbl> <chr>   <chr> <chr>            <chr>     <chr>     
##  1     1             90.6 Arabica meta~ Ethiopia         "metad p~ <NA>      
##  2     2             89.9 Arabica meta~ Ethiopia         "metad p~ <NA>      
##  3     3             89.8 Arabica grou~ Guatemala        "san mar~ <NA>      
##  4     4             89   Arabica yidn~ Ethiopia         "yidneka~ <NA>      
##  5     5             88.8 Arabica meta~ Ethiopia         "metad p~ <NA>      
##  6     6             88.8 Arabica ji-a~ Brazil            <NA>     <NA>      
##  7     7             88.8 Arabica hugo~ Peru              <NA>     <NA>      
##  8     8             88.7 Arabica ethi~ Ethiopia         "aolme"   <NA>      
##  9     9             88.4 Arabica ethi~ Ethiopia         "aolme"   <NA>      
## 10    10             88.2 Arabica diam~ Ethiopia         "tulla c~ <NA>      
## # ... with 1,328 more rows, and 37 more variables: mill <chr>,
## #   ico_number <chr>, company <chr>, altitude <chr>, region <chr>,
## #   producer <chr>, number_of_bags <dbl>, bag_weight <chr>,
## #   in_country_partner <chr>, harvest_year <chr>, grading_date <chr>,
## #   owner_1 <chr>, variety <chr>, processing_method <chr>, aroma <dbl>,
## #   flavor <dbl>, aftertaste <dbl>, acidity <dbl>, body <dbl>, balance <dbl>,
## #   uniformity <dbl>, clean_cup <dbl>, sweetness <dbl>, cupper_points <dbl>,
## #   moisture <dbl>, category_one_defects <dbl>, quakers <dbl>, color <chr>,
## #   category_two_defects <dbl>, expiration <chr>, certification_body <chr>,
## #   certification_address <chr>, certification_contact <chr>,
## #   unit_of_measurement <chr>, altitude_low_meters <dbl>,
## #   altitude_high_meters <dbl>, altitude_mean_meters <dbl>

Investigating missingness

How much missing data is in this data set?

coffee %>% 
  skimr::skim() %>% 
  select(skim_variable, complete_rate) %>% 
  arrange(complete_rate)
## # A tibble: 44 x 2
##    skim_variable        complete_rate
##    <chr>                        <dbl>
##  1 lot_number                   0.206
##  2 farm_name                    0.732
##  3 mill                         0.765
##  4 producer                     0.827
##  5 altitude_low_meters          0.828
##  6 altitude_high_meters         0.828
##  7 altitude_mean_meters         0.828
##  8 altitude                     0.831
##  9 variety                      0.831
## 10 color                        0.837
## # ... with 34 more rows

Most of the columns have more than 80% completeness. I’ll filter for columns with more than 70% completeness, and map() a count() across all columns to let me further investigate columns that could be of interest.

In my EDA I do this for all columns but, for the sake of brevity, I’ll only select a few columns to illustrate the output.

coffee %>%
  select(owner_1:processing_method) %>%
  map(~ count(data.frame(x = .x), x, sort = TRUE)) %>%
  map(~ head(., n = 10))
## $owner_1
##                                                                     x   n
## 1                                           Juan Luis Alvarado Romero 155
## 2                                                  Racafe & Cia S.C.A  60
## 3                                      Exportadora de Cafe Condor S.A  54
## 4                                    Kona Pacific Farmers Cooperative  52
## 5                                                     Ipanema Coffees  50
## 6  CQI Taiwan ICP CQI<U+53F0><U+7063><U+5408><U+4F5C><U+5925><U+4F34>  46
## 7                         Lin, Che-Hao Krude <U+6797><U+54F2><U+8C6A>  29
## 8                                                            NUCOFFEE  29
## 9                                                     CARCAFE LTDA CI  27
## 10                                             The Coffee Source Inc.  23
## 
## $variety
##                 x   n
## 1         Caturra 255
## 2         Bourbon 226
## 3            <NA> 226
## 4          Typica 211
## 5           Other 110
## 6          Catuai  74
## 7   Hawaiian Kona  44
## 8  Yellow Bourbon  35
## 9      Mundo Novo  33
## 10        Catimor  20
## 
## $processing_method
##                           x   n
## 1              Washed / Wet 815
## 2             Natural / Dry 258
## 3                      <NA> 169
## 4 Semi-washed / Semi-pulped  56
## 5                     Other  26
## 6    Pulped natural / honey  14
#what I actually do
cols <- coffee %>%
  skimr::skim() %>%
  select(skim_variable, complete_rate) %>%
  arrange(complete_rate) %>%
  filter(complete_rate > 0.7) %>%
  pull(skim_variable)

#what I actually do
coffee %>%
  select(cols) %>%
  map( ~ count(data.frame(x = .x), x, sort = TRUE)) %>%
  map( ~ head(., n = 10))

Let’s dig further into the data before finalizing the columns.

How are total_cup_points calculated?

From this page in the Coffee Institute’s data base, it appears that total_cup_points is the sum of columns aroma to cupper_points.

coffee %>% 
  select(total_cup_points, aroma:cupper_points)
## # A tibble: 1,338 x 11
##    total_cup_points aroma flavor aftertaste acidity  body balance uniformity
##               <dbl> <dbl>  <dbl>      <dbl>   <dbl> <dbl>   <dbl>      <dbl>
##  1             90.6  8.67   8.83       8.67    8.75  8.5     8.42      10   
##  2             89.9  8.75   8.67       8.5     8.58  8.42    8.42      10   
##  3             89.8  8.42   8.5        8.42    8.42  8.33    8.42      10   
##  4             89    8.17   8.58       8.42    8.42  8.5     8.25      10   
##  5             88.8  8.25   8.5        8.25    8.5   8.42    8.33      10   
##  6             88.8  8.58   8.42       8.42    8.5   8.25    8.33      10   
##  7             88.8  8.42   8.5        8.33    8.5   8.25    8.25      10   
##  8             88.7  8.25   8.33       8.5     8.42  8.33    8.5       10   
##  9             88.4  8.67   8.67       8.58    8.42  8.33    8.42       9.33
## 10             88.2  8.08   8.58       8.5     8.5   7.67    8.42      10   
## # ... with 1,328 more rows, and 3 more variables: clean_cup <dbl>,
## #   sweetness <dbl>, cupper_points <dbl>

The code below verifies just that.

set.seed(123)

coffee %>% 
  group_by(id) %>% 
  mutate(sum = sum(across(aroma:cupper_points))) %>% 
  select(id, total_cup_points, sum) %>% 
  ungroup() %>% 
  slice_sample(1:length(id), n = 5)
## # A tibble: 5 x 3
##      id total_cup_points   sum
##   <int>            <dbl> <dbl>
## 1   415             83.3  83.3
## 2   463             83.2  83.2
## 3   179             84.4  84.4
## 4   526             83    83.0
## 5   195             84.2  84.2

Correlation of total_cup_points

library(GGally)

coffee %>% 
  select(total_cup_points, aroma:cupper_points) %>% 
  ggcorr()

As expected, total_cup_points showcases positive correlation with all of underlying scores, but less so for uniformity, clean_cup and sweetness.

Replacing missingness in altitude data

From some googling,

High altitudes are considered ideal for growing the coffee plant, with cooler temperatures delaying the growth cycle. This allows the bean to go through a longer maturation process, thus creating a much fuller, richer, and more pronounced flavour.

If that is indeed the case, we can expect altitude to be a significant predictor. However, upon inspecting the completeness of the altitude-related columns, around 20% of the data is missing.

Coffee at different altitudes

coffee %>% 
  skimr::skim() %>% 
  select(skim_variable, complete_rate) %>% 
  filter(skim_variable %in% c("altitude", "altitude_low_meters", "altitude_high_meters", "altitude_mean_meters"))
## # A tibble: 4 x 2
##   skim_variable        complete_rate
##   <chr>                        <dbl>
## 1 altitude                     0.831
## 2 altitude_low_meters          0.828
## 3 altitude_high_meters         0.828
## 4 altitude_mean_meters         0.828

Before trying to replace these missing values, let’s ask some questions:

Relationship between altitude and altitude_mean_meters?

set.seed(123)

coffee %>% 
  select(altitude, altitude_mean_meters) %>% 
  slice_sample(1:length(altitude), n = 20)
## # A tibble: 20 x 2
##    altitude              altitude_mean_meters
##    <chr>                                <dbl>
##  1 <NA>                                   NA 
##  2 1800                                 1800 
##  3 <NA>                                   NA 
##  4 1600 - 1950 msnm                     1775 
##  5 1500                                 1500 
##  6 1400 msnm                            1400 
##  7 1250                                 1250 
##  8 750m                                  750 
##  9 4300                                 1311.
## 10 <NA>                                   NA 
## 11 <NA>                                   NA 
## 12 1750 msnm                            1750 
## 13 1200                                 1200 
## 14 934                                   934 
## 15 800++                                 800 
## 16 1100                                 1100 
## 17 1130                                 1130 
## 18 de 1.600 a 1.950 msnm                1775 
## 19 1100                                 1100 
## 20 1                                       1

It looks like altitude_mean_meters is a clean version of altitude - so I’ll focus on using `altitude_mean_meters for now.

Missingness of other altitude columns

When altitude_mean_meters is missing, are altitude_low_meters and altitude_high_meters missing too?

coffee %>% 
  filter(is.na(altitude_mean_meters)) %>% 
  select(contains("meters")) %>% 
  #checking for NAs
  summarise(not_na = 
              sum(!is.na(
                across(everything())
                )))
## # A tibble: 1 x 1
##   not_na
##    <int>
## 1      0

Yes, we can expect altitude_low_meters and altitude_high_meters to show missing values when altitude_mean_metersis missing. I was hoping to use the former two columns to replace missing values in altitude_mean_meters.

Standardizing altitude measurements to meters

After converting all altitude measurements made in foot to meters, are there any inconsistencies? (1 ft = 0.3048 meters)

outlier <- coffee %>% 
  mutate(meters = case_when(
    str_detect(unit_of_measurement, "ft") ~ altitude_mean_meters * 0.3048,
    TRUE ~ altitude_mean_meters),
    country_of_origin = fct_lump(country_of_origin, 4)) %>%
  filter(!is.na(meters)) %>% 
  filter(meters > 8000) %>% 
  pull(id)

library(fishualize)
library(ggforce)

coffee %>%
  mutate(
    meters = case_when(
      str_detect(unit_of_measurement, "ft") ~ altitude_mean_meters * 0.3048,
      TRUE ~ altitude_mean_meters
    ),
    country_of_origin = fct_lump(country_of_origin, 4)
  ) %>%
  filter(!is.na(country_of_origin)) %>%
  ggplot(aes(total_cup_points, meters)) +
  geom_point(aes(colour = country_of_origin),
             size = 1.5, alpha = 0.9) +
  geom_mark_ellipse(aes(
    filter = id %in% outlier)) +
  scale_colour_fish_d(option = "Etheostoma_spectabile") +
  scale_y_log10(labels = comma) +
  labs(
    x = "Total_cup_points",
    y = "Meters (log scale)",
    colour = "Country of Origin",
    title = "Plotting Altitude (meters) against Coffee Quality Points",
    subtitle = "Outliers are circled and are likely data entry errors",
    caption = "Source: Coffee Quality Database"
  ) +
  theme(plot.title = element_text(face = "bold", size = 20),
        plot.subtitle = element_text(size = 17)) 

As a sanity check, the highest mountain in the world is Mount Everest with its peak at 8,848 meters. Clearly there are some data entry errors recording over 100,000 meters in altitude. Let’s exclude any altitude records above 8000m.

coffee <- coffee %>%
  mutate(meters = case_when(
    str_detect(unit_of_measurement, "ft") ~ altitude_mean_meters * 0.3048,
    TRUE ~ altitude_mean_meters
  )) %>%
  #explicitly keep NAs because missing values will be replaced later
  filter(is.na(meters) | meters <= 8000) %>%
  select(-altitude, -altitude_low_meters, -altitude_high_meters, -altitude_mean_meters)

Replacing NAs

Now that all altitude measurements are standardized in the meters column, we can begin to replace NAs.

sum(is.na(coffee$meters))
## [1] 230
coffee %>% 
  filter(is.na(meters)) %>% 
  count(country_of_origin, region, sort = TRUE)
## # A tibble: 67 x 3
##    country_of_origin      region                     n
##    <chr>                  <chr>                  <int>
##  1 United States (Hawaii) kona                      64
##  2 Colombia               huila                     18
##  3 Guatemala              oriente                   14
##  4 Colombia               <NA>                       9
##  5 Thailand               <NA>                       9
##  6 United States (Hawaii) <NA>                       7
##  7 Brazil                 monte carmelo              6
##  8 Ethiopia               sidamo                     5
##  9 Brazil                 campos altos - cerrado     4
## 10 Brazil                 cerrado                    4
## # ... with 57 more rows
coffee %>% 
  filter(country_of_origin == "United States (Hawaii)",
         is.na(meters)) %>% 
  count(country_of_origin, region, sort = T)
## # A tibble: 2 x 3
##   country_of_origin      region     n
##   <chr>                  <chr>  <int>
## 1 United States (Hawaii) kona      64
## 2 United States (Hawaii) <NA>       7

In total we have 230 NAs for meters, of which Hawaii accounts for 71 or 31%.

coffee %>% 
  select(id, country_of_origin, region, meters) %>% 
  filter(str_detect(country_of_origin, "(Hawaii)")) %>% 
  na.omit()
## # A tibble: 2 x 4
##      id country_of_origin      region meters
##   <int> <chr>                  <chr>   <dbl>
## 1    14 United States (Hawaii) kona     186.
## 2    52 United States (Hawaii) kona     130.

Unfortunately, we only have two values for Hawaii. Verifying it with some googling, our two points of data look about right:

The Kona growing region is about 2 miles long and ranges in altitude from 600 ft. (183m) to 2500 ft (762m).

I’ll replace all NAs related to Hawaii with the mean of our existing data points. The function coalesce fills the NA from the first vector with values from the second vector at corresponding positions.

hawaii <- coffee %>% 
  filter(str_detect(country_of_origin, "(Hawaii)")) %>% 
  select(id, meters) %>% 
  mutate(meters = replace_na(meters, (186+130)/2))

coffee_refined <- coffee %>% 
  left_join(hawaii, by = "id") %>%
  mutate(meters = coalesce(meters.x, meters.y)) %>%
  select(-meters.x, -meters.y)
coffee_refined %>% 
  filter(is.na(meters)) %>% 
  count(country_of_origin, region, sort = TRUE)
## # A tibble: 65 x 3
##    country_of_origin region                     n
##    <chr>             <chr>                  <int>
##  1 Colombia          huila                     18
##  2 Guatemala         oriente                   14
##  3 Colombia          <NA>                       9
##  4 Thailand          <NA>                       9
##  5 Brazil            monte carmelo              6
##  6 Ethiopia          sidamo                     5
##  7 Brazil            campos altos - cerrado     4
##  8 Brazil            cerrado                    4
##  9 Brazil            <NA>                       4
## 10 Ethiopia          yirgacheffe                4
## # ... with 55 more rows

Let’s do the same for Huila (Colombia) and Oriente (Guatemala).

huila <- coffee %>% 
  filter(str_detect(region, "huila")) %>% 
  select(id, meters) %>% 
  mutate(meters = replace_na(meters, mean(meters, na.rm = TRUE)))

coffee_refined <- coffee_refined %>% 
  left_join(huila, by = "id") %>%
  mutate(meters = coalesce(meters.x, meters.y)) %>%
  select(-meters.x, -meters.y)

oriente <- coffee %>% 
  filter(str_detect(region, "oriente")) %>% 
  select(id, meters) %>% 
  mutate(meters = replace_na(meters, mean(meters, na.rm = TRUE)))

coffee_refined <- coffee_refined %>% 
  left_join(oriente, by = "id") %>%
  mutate(meters = coalesce(meters.x, meters.y)) %>%
  select(-meters.x, -meters.y)

We’ll replace the remaining missing values shortly using the recipes package during our feature pre-processing stage.

Analyzing variety missingness

variety appears to be another interesting column …

coffee_refined %>% 
  count(variety, sort = T)
## # A tibble: 29 x 2
##    variety            n
##    <chr>          <int>
##  1 Caturra          255
##  2 <NA>             226
##  3 Bourbon          224
##  4 Typica           211
##  5 Other            109
##  6 Catuai            74
##  7 Hawaiian Kona     44
##  8 Yellow Bourbon    35
##  9 Mundo Novo        33
## 10 Catimor           20
## # ... with 19 more rows

… but missing values (NAs) are constitute nearly 17% of the data.

coffee_refined %>% 
  skimr::skim() %>% 
  select(skim_variable, complete_rate) %>% 
  filter(skim_variable == "variety")
## # A tibble: 1 x 2
##   skim_variable complete_rate
##   <chr>                 <dbl>
## 1 variety               0.831

Let’s visualize the missingness of data in variety. Specifically, is there a relationship between country_of_origin and missing data in variety?

coffee_refined %>% 
  group_by(country_of_origin) %>% 
  filter(sum(is.na(variety)) > 10) %>% 
  ungroup() %>% 
  ggplot(aes(total_cup_points, meters, colour = is.na(variety))) +
  geom_point(size = 1.5) +
  scale_colour_fish_d(option = "Etheostoma_spectabile") +
  facet_wrap(~ country_of_origin) +
  labs(
    x = "Total_cup_points",
    y = "Meters",
    colour = "Is Variety Missing?",
    title = "Which countries contain the highest amount of missing data in Variety?",
    subtitle = "Filtering for countries with more than 10 missing variety-related data entries",
    caption = "Source: Coffee Quality Database"
  ) +
  theme(plot.title = element_text(face = "bold", size = 20),
        plot.subtitle = element_text(size = 17)) 

Most of the missing variety data are from a select few countries - Colombia, Ethiopia, India, Thailand and Uganda. The missing values for variety will also be addressed later on using recipes.

Visualizing highest scores across countries

library(ggridges)

country_freq <- coffee_refined %>% 
  group_by(country_of_origin) %>% 
  add_count(name = "total_entry") %>% 
  ungroup() %>% 
  mutate(freq = total_entry / sum(n())) %>% 
  distinct(country_of_origin, .keep_all = TRUE) %>% 
  select(country_of_origin, freq) %>% 
  arrange(desc(freq)) %>% 
  slice(1:10) %>% 
  pull(country_of_origin)

coffee_refined %>% 
  filter(country_of_origin %in% country_freq,
         total_cup_points > 75) %>% 
  mutate(country_of_origin = fct_reorder(country_of_origin, total_cup_points)) %>% 
  ggplot(aes(total_cup_points, country_of_origin, fill = country_of_origin)) +
  geom_density_ridges(scale = 1, alpha = 0.8, show.legend = F) +
  scale_fill_fish_d(option = "Antennarius_multiocellatus", begin = 0.5, end = 0) +
  theme_ridges(center_axis_labels = TRUE) +
  labs(
    x = "Total Cup Points",
    y = NULL,
    fill = "",
    title = "Visualizing Distribution of Coffee Ratings Across Countries",
    subtitle = "Below: Top 10 countries based on absolute # of coffee ratings given,\nsorted according to score",
    caption = "Source: Coffee Quality Database"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 20),
    plot.subtitle = element_text(size = 17),
    strip.background = element_blank(),
    strip.text = element_text(face = "bold", size = 15),
    legend.title = element_text(face = "bold", size = 15)
  ) 

Finalizing the dataset

That was quite a bit of EDA just to replace missing values! Finalizing the data set, we have included:

  • Unique identifier: id
  • Outcome (what we want to predict): total_cup_points
  • Predictors: country_of_origin, in_country_partner, certification_body, variety, processing_method, meters

Recall meters is an engineered feature that was earlier created from various altitude column.

coffee_df <- coffee_refined %>% 
    select(id, total_cup_points, 
           country_of_origin, 
           in_country_partner, certification_body, variety, processing_method,
           #aroma:cupper_points, 
           meters) %>% 
  #converting all character columns to factors
  mutate(across(where(is.character), as.factor))

Excluding aroma:cupper_points

Importantly, you might have realized I have chosen not to include columns aroma:cupper_points; reason being because our outcome is the sum of these columns, and using these predictors will lead to all three models ignoring all other predictors that don’t below in these columns i.e. variable importance for these columns are overwhelmingly high.

Thus, the intention was that by excluding these columns, I wanted to make it more challenging for the models to predict the outcome - somewhat akin to the complexity faced in real-life predictive modeling.

Modeling

The objective is to run three models: LASSO, Random Forest and XGboost, and compare performance in predicting total_cup_points.

Initial Split

set.seed(2020)
coffee_split <- initial_split(coffee_df)
coffee_train <- training(coffee_split)
coffee_test <- testing(coffee_split)

Resampling

All three models will undergo hyperparameter tuning using crossfold validation. Here, I opt to use 10-fold cross validation.

set.seed(2020)
folds <- vfold_cv(coffee_train, v = 10)

LASSO Model

Model Specification

Here I’m specifying the LASSO model I intend to fit. Hyperparameters tagged to tune() will be subsequently tuned using a grid search.

model_lasso <- linear_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet") %>% 
  set_mode("regression") 

Feature Preprocessing

I came across this great illustration by Allison Horst describing the recipes package: recipes

Just a quick check on which columns have missing data:

coffee_train %>% 
  map_df(~ sum(is.na(.))) %>% 
  t()
##                    [,1]
## id                    0
## total_cup_points      0
## country_of_origin     0
## in_country_partner    0
## certification_body    0
## variety             165
## processing_method   128
## meters               91

Here we specify the recipe:

  • step_other: Collapse factors into “other” if they don’t meet a predefined threshold
  • step_dummy: Turns nominal (character/factor) columns into numeric binary data. Necessary because the LASSO can only process numeric data
  • step_knnimpute: Imputes the remainder of missing values using knn (default is 5). Here, after imputing missing values of meters, I used it to impute missing values of variety, and subsequently used both for imputing missing values of processing_method
  • step_normalize: Normalizes numeric data to have a standard deviation of one and a mean of zero. Necessary since the LASSO is sensitive to outliers
coffee_rec <- coffee_train %>%
  recipe(total_cup_points ~ .) %>%
  update_role(id, new_role = "id") %>%
  step_other(
    country_of_origin,
    in_country_partner,
    certification_body,
    variety,
    processing_method,
    threshold = 0.02,
    other = "uncommon"
  ) %>%
  step_dummy(all_nominal(), -variety, -processing_method) %>%
  step_knnimpute(meters,
                 impute_with = imp_vars(contains(c(
                   "country", "certification"
                 )))) %>%
  step_knnimpute(variety,
                 impute_with = imp_vars(contains(c(
                   "country", "certification", "meters"
                 )))) %>%
  step_knnimpute(processing_method,
                 impute_with = imp_vars(contains(
                   c("country", "certification", "meters", "variety")
                 ))) %>%
  step_dummy(variety, processing_method) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>%
  step_knnimpute(all_predictors())

prep() estimates the required parameters from the training set, and juice() applies these parameters on the training data and returns us the data in a tibble. The code below indicates that there are no more missing values after our pre-processing.

coffee_rec %>% 
  prep() %>% 
  juice() %>%
  summarise(is_na = sum(is.na(across(everything())))) 
## # A tibble: 1 x 1
##   is_na
##   <int>
## 1     0

Workflows

The workflows package introduces workflow objects that can help manage modeling pipelines more easily - akin to pieces that fit together like Lego blocks.

lasso_wf <- workflow() %>%
  add_recipe(coffee_rec) %>%
  add_model(model_lasso)

Hyperparameter Tuning

I’m setting up three respective grids for our three models. First up - for the LASSO model, I’ll be using grid_random to generate a random grid.

set.seed(2020)
lasso_grid <- grid_random(penalty(), size = 50)

Once parallel processing has been set up, the tuning can now commence!

all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

set.seed(2020)

lasso_res <- tune_grid(
  object = lasso_wf,
  resamples = folds,
  grid = lasso_grid,
  control = control_grid(save_pred = TRUE)
  )

Training Performance Assessment

The results can be obtained with collect_metrics() and subsequently visualized; and the best tuned hyperparameters associated with the lowest in-sample RMSE can be obtained with show_best().

lasso_res %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean)) +
  geom_line(aes(color = .metric), size = 1.5, show.legend = FALSE) +
  facet_wrap(. ~ .metric, nrow = 2) +
  scale_x_log10(label = scales::number_format()) +
  labs(
    x = "Penalty",
    y = "RMSE",
    title = "LASSO: Assessing In-Sample Performance of Tuned Hyperparameters",
    subtitle = "RMSE appears to be minimized when penalty is <0.01"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 20),
    plot.subtitle = element_text(size = 17),
    strip.background = element_blank(),
    strip.text = element_text(face = "bold", size = 15),
  ) 

lasso_res %>% 
  show_best("rmse")
## # A tibble: 5 x 7
##   penalty .metric .estimator  mean     n std_err .config
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <fct>  
## 1 0.0153  rmse    standard    2.48    10   0.135 Model39
## 2 0.0183  rmse    standard    2.48    10   0.135 Model40
## 3 0.00954 rmse    standard    2.48    10   0.136 Model38
## 4 0.00441 rmse    standard    2.49    10   0.136 Model37
## 5 0.0436  rmse    standard    2.49    10   0.134 Model41

Finalizing Hyperparameters

Let’s use select_best() to obtain the optimal penalty hyperparameters that minimizes RMSE, and finalize_workflow() is used to fit the optimal hyperparameters to the LASSO model and the training data.

lasso_best <- lasso_res %>% 
  select_best("rmse")

lasso_final_wf <- lasso_wf %>% 
  finalize_workflow(lasso_best)

Random Forest & XGBoost Models

Just like what was done for the LASSO model, the random forest & XGBoost models can be specificed the same way. The process is also similar when creating respective workflows, tuning hyperparameters, selecting the hyperparameters that corresponds to the lowest RMSE, and finalizing the workflow.

Model Specification

Let’s set trees = 1000 for both random forest and XGBoost, and tune the remaining hyperparameters.

model_rf <- rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger", importance = "permutation") %>% 
  set_mode("regression")

model_xgboost <- boost_tree(
  trees = 1000,
  #model complexity
  tree_depth = tune(), min_n = tune(), loss_reduction = tune(),
  #randomness
  sample_size = tune(), mtry = tune(),
  #step size
  learn_rate = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

Workflows

The beauty of tidymodels is that we can conveniently re-use the same preprocessing recipe, coffee_rec, in conjunction with the random forest and XGBoost model workflows.

rf_wf <- workflow() %>%
  add_recipe(coffee_rec) %>%
  add_model(model_rf)

xgb_wf <- workflow() %>%
  add_recipe(coffee_rec) %>%
  add_model(model_xgboost)

Hyperparameter Tuning

Similar to the LASSO, a grid_random will be used for the random forest model. Note that finalize() was used, together with our training set, to determine the upper-bound for our mtry() hyperparameter (representing number of predictors that will be randomly sampled at each split when creating the tree models).

For the XGBoost model, however, we are using a space-filling latin hypercube grid that employs a statistical method for generating a near-random sample of parameter values from a multidimensional distribution.

set.seed(2020)
rf_grid <- grid_random(finalize(mtry(), coffee_train),
                       min_n(), size = 50) 

set.seed(2020)
xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), coffee_train),
  learn_rate(),
  size = 50
)

With the grids set up, now we can tune both models’ hyperparameters.

all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

set.seed(2020)
rf_res <- tune_grid(
  object = rf_wf,
  resamples = folds,
  grid = rf_grid,
  control = control_grid(save_pred = TRUE)
  )

set.seed(2020)
xgb_res <- tune_grid(
  object = xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
  )

Training Performance Assessment

We can also visually assess the performance across all tuned hyperparameters and their effect on RMSE for the random forest model

rf_res %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>% 
  pivot_longer(mtry:min_n, 
               names_to = "parameter",
               values_to = "value") %>% 
  ggplot(aes(value, mean)) +
  geom_point(aes(color = parameter), size = 2, show.legend = FALSE) +
  facet_wrap(. ~ parameter, scales = "free_x") +
  labs(x = "", 
       y = "RMSE",
       title = "Random Forest: Assessing In-Sample Performance of Tuned Hyperparameters",
       subtitle = "RMSE appears to be minimized at low levels of min_n and mtry") +
  theme(
    plot.title = element_text(face = "bold", size = 20),
    plot.subtitle = element_text(size = 17),
    strip.background = element_blank(),
    strip.text = element_text(face = "bold", size = 15),
  ) 

rf_res %>% 
  show_best("rmse")
## # A tibble: 5 x 8
##    mtry min_n .metric .estimator  mean     n std_err .config
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <fct>  
## 1     6    18 rmse    standard    2.45    10   0.141 Model45
## 2     5    14 rmse    standard    2.46    10   0.140 Model25
## 3     6    25 rmse    standard    2.46    10   0.140 Model49
## 4     7    17 rmse    standard    2.46    10   0.139 Model36
## 5     6    24 rmse    standard    2.46    10   0.139 Model30

Let’s do the same for the XGBoost model.

xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               names_to = "parameter",
               values_to = "value") %>%
  ggplot(aes(value, mean)) +
  geom_point(aes(color = parameter),
             size = 2,
             show.legend = FALSE) +
  facet_wrap(. ~ parameter, scales = "free_x") +
  labs(
    x = "",
    y = "RMSE",
    title = "XGBoost: Assessing In-Sample Performance of Tuned Hyperparameters",
    subtitle = "Several combinations of parameters do well to minimize RMSE"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 20),
    plot.subtitle = element_text(size = 17),
    strip.background = element_blank(),
    strip.text = element_text(face = "bold", size = 15),
  ) 

xgb_res %>% 
  show_best("rmse")
## # A tibble: 5 x 12
##    mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
##   <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>  
## 1     4     6         14    0.0225        4.08e- 1       0.797 rmse   
## 2     5    11          3    0.0810        3.43e- 4       0.948 rmse   
## 3     8    28          2    0.00648       1.20e-10       0.804 rmse   
## 4     6    29         13    0.0336        2.95e+ 1       0.465 rmse   
## 5     5    13          8    0.0655        5.60e- 1       0.985 rmse   
## # ... with 5 more variables: .estimator <chr>, mean <dbl>, n <int>,
## #   std_err <dbl>, .config <fct>

Finalizing Hyperparameters

The workflows for both the random forest and XGBoost models are now finalized.

rf_best <- rf_res %>% 
  select_best("rmse")

xgb_best <- xgb_res %>% 
  select_best("rmse")

rf_final_wf <- rf_wf %>% 
  finalize_workflow(rf_best)

xgb_final_wf <- xgb_wf %>% 
  finalize_workflow(xgb_best)

Variable Importance

Before running the models on the test data, let’s compare variable importance for our models. It can be useful know which, if any, of the predictors in a fitted model are relatively influential on the predicted outcome.

library(vip) #variable importance plots
library(patchwork) #combining plots

p1 <- lasso_final_wf %>% 
  fit(coffee_train) %>% 
  pull_workflow_fit() %>% 
  vip(geom = "point") +
  ggtitle("LASSO") 

p2 <- rf_final_wf %>% 
  fit(coffee_train) %>% 
  pull_workflow_fit() %>% 
  vip(geom = "point") +
  ggtitle("Random Forest")

p3 <- xgb_final_wf %>% 
  fit(coffee_train) %>% 
  pull_workflow_fit() %>% 
  vip(geom = "point") +
  ggtitle("XGBoost")

p1 + p2 / p3 + plot_annotation(
  title = 'Assessing Variable Importance Across Models',
  subtitle = 'The engineered feature \"meters\" is heavily favoured by the tree-based models,\nwhile also ranking fourth for the LASSO in terms of variable importance',
  theme = theme(
    plot.title = element_text(size = 18),
    plot.subtitle = element_text(size = 14)
  )
) 

Earlier, Ethiopia stood out as the country with the highest mean scores when visualizing coffee rating scores across countries earlier. Here, it makes sense to see country_of_origin_Ethiopia having relatively high variable importance for the LASSO and XGBoost model. The engineered feature, meters, has a significant variable importance contribution too for our tree-based models.

Model Selection

As a recap, here are the corresponding RMSE values for the set of hyperparameters that was selected for our models.

lasso_res %>%
  show_best(metric = "rmse") %>%
  mutate(model = "lasso") %>%
  bind_rows(rf_res %>%
              show_best(metric = "rmse") %>%
              mutate(model = "randomforest")) %>%
  bind_rows(xgb_res %>%
              show_best(metric = "rmse") %>%
              mutate(model = "xgboost")) %>%
  group_by(model) %>%
  summarise(lowest_training_rmse = round(min(mean), 3))
## # A tibble: 3 x 2
##   model        lowest_training_rmse
##   <chr>                       <dbl>
## 1 lasso                        2.48
## 2 randomforest                 2.46
## 3 xgboost                      2.47

As the final step, for all three models, we perform a last_fit using the split data, coffee_split.

This seeks to emulates the process where, after determining the best model, the final fit on the entire training set is used to evaluate the test set, coffee_test (which has not been touched since the initial split).

lasso_final_wf %>%
  last_fit(coffee_split) %>%
  collect_metrics() %>%
  mutate(model = "lasso") %>%
  bind_rows(
    rf_final_wf %>%
      last_fit(coffee_split) %>%
      collect_metrics() %>%
      mutate(model = "randomforest")
  ) %>%
  bind_rows(
    xgb_final_wf %>%
      last_fit(coffee_split) %>%
      collect_metrics() %>%
      mutate(model = "xgboost")
  ) %>% 
  filter(.metric =="rmse")
## # A tibble: 3 x 4
##   .metric .estimator .estimate model       
##   <chr>   <chr>          <dbl> <chr>       
## 1 rmse    standard        2.35 lasso       
## 2 rmse    standard        2.30 randomforest
## 3 rmse    standard        2.33 xgboost

Conclusion

Test results for all three models are lower than their training scores, indicating the absence of overfitting. Both tree-based models also performed slightly better than the LASSO model, which could mean there are interaction effects at play. While I would select the random forest model in this case, the difference in test RMSE is so close that I’d still be inclined to compare all three models’ performance in future.