This will be my first attempt at machine learning using the tidymodels package, with a dataset taken from TidyTuesday. The code used to scrape this data can be found here.

The study analyses data from the Food and Agriculture Organization of the United Nations (FAO) to determine the quantity of produce supplied for consumption of 11 food types for all countries researched. Using CO2 emissions data, the carbon footprint per capita is then calculated for each food type.

I heavily borrow intuition gleaned from this blog post by Julia Silge - who does an amazing job at detailing the intricacies of tidymodels metapackage in her various blog posts.

library(tidyverse)

theme_set(theme_minimal())

food_consumption <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv")
food_consumption
## # A tibble: 1,430 x 4
##    country   food_category            consumption co2_emmission
##    <chr>     <chr>                          <dbl>         <dbl>
##  1 Argentina Pork                           10.5          37.2 
##  2 Argentina Poultry                        38.7          41.5 
##  3 Argentina Beef                           55.5        1712   
##  4 Argentina Lamb & Goat                     1.56         54.6 
##  5 Argentina Fish                            4.36          6.96
##  6 Argentina Eggs                           11.4          10.5 
##  7 Argentina Milk - inc. cheese            195.          278.  
##  8 Argentina Wheat and Wheat Products      103.           19.7 
##  9 Argentina Rice                            8.77         11.2 
## 10 Argentina Soybeans                        0             0   
## # ... with 1,420 more rows

From the Data Dictionary, these are some important definitions:

  • Consumption is measured in kg/person/year
  • Co2 Emission is measured in kg CO2/person/year

Using a preliminary skim, the dataset has 130 unique countries, each tagged with 11 different categories of food, and no missing values.

library(skimr)

skim(food_consumption)
Table 1: Data summary
Name food_consumption
Number of rows 1430
Number of columns 4
_______________________
Column type frequency:
character 2
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 22 0 130 0
food_category 0 1 4 24 0 11 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
consumption 0 1 28.11 49.82 0 2.37 8.89 28.13 430.76 ▇▁▁▁▁
co2_emmission 0 1 74.38 152.10 0 5.21 16.53 62.60 1712.00 ▇▁▁▁▁
food_consumption %>% 
  count(food_category)
## # A tibble: 11 x 2
##    food_category                n
##    <chr>                    <int>
##  1 Beef                       130
##  2 Eggs                       130
##  3 Fish                       130
##  4 Lamb & Goat                130
##  5 Milk - inc. cheese         130
##  6 Nuts inc. Peanut Butter    130
##  7 Pork                       130
##  8 Poultry                    130
##  9 Rice                       130
## 10 Soybeans                   130
## 11 Wheat and Wheat Products   130

Which food item, on average, contributes the most to co2 emmissions?

food_consumption %>% 
  group_by(food_category) %>% 
  summarize(avg.co2 = sum(co2_emmission)/n()) %>% 
  ggplot(aes(fct_reorder(food_category, avg.co2), avg.co2, fill = food_category)) +
  geom_col(show.legend = F) +
  coord_flip() +
  labs(
    x = "Food Categories",
    y = "Average CO2 Emissions"
  )

How about inspecting the median values and country outliers through a boxplot?

library(ggrepel) #ggplot identification of data points

food_consumption %>%
  group_by(food_category) %>%
  ggplot(aes(
    fct_reorder(food_category, co2_emmission),
    co2_emmission,
    fill = food_category
  )) +
  geom_boxplot(show.legend = F) +
  geom_text_repel(
    data = . %>%
      filter(food_category == "Beef") %>% 
      mutate(percentile = co2_emmission >= quantile(co2_emmission, 0.95, na.rm = T)) %>%
      filter(percentile == 1),
    aes(label = country)
  ) +
  coord_flip() +
  labs(
    x = "Food Categories",
    y = "Average CO2 Emissions"
  )

Filtering for Beef only, let’s take a closer look at the top 20 countries.

food_consumption %>% 
  filter(food_category == "Beef") %>%
  arrange(desc(consumption)) %>% 
  top_n(20) %>% 
  ggplot(aes(fct_reorder(country, consumption), consumption, fill = country)) +
  geom_col(show.legend = F) +
  coord_flip() +
  labs(
    x = "Country",
    y = "Beef Consumption"
  )

I did not expect Argentina to be on top of that list by such a significant margin. Doing a bit of googling to fact-check, it does appear Argentinians are lovers of red meat. However, that consumption per person also seems to be coming down, due to a combination of rampant inflation as well as a growing health awareness of non-meat options. But I digress.

Quite a number of North and South America countries America show up as heavy beef consumers. This gives me an idea - can I predict, using co2_emmission, if a country belongs to the Americas or not?

I use the countrycode package to create a new column identifying which continent a country belongs to, and further create a binary classification of whether or not the country belongs to Americas.

library(countrycode)
library(janitor)

food <- food_consumption %>%
  select(-consumption) %>%
  pivot_wider(names_from = food_category,
              values_from = co2_emmission) %>%
  clean_names() %>%
  mutate(continent = countrycode(country,
                                 origin = "country.name",
                                 destination = "continent")) %>%
  mutate(americas = case_when(continent == "Americas" ~ "Americas",
                              TRUE ~ "Other")) %>%
  select(-country,-continent) %>%
  mutate_if(is.character, as_factor)

food %>% 
  select(americas, everything())
## # A tibble: 130 x 12
##    americas  pork poultry  beef lamb_goat   fish  eggs milk_inc_cheese
##    <fct>    <dbl>   <dbl> <dbl>     <dbl>  <dbl> <dbl>           <dbl>
##  1 Americas  37.2    41.5 1712       54.6   6.96 10.5             278.
##  2 Other     85.4    49.5 1045.     346.   28.2   7.82            334.
##  3 Other     38.5    14.2  694.     536.    6.15 11.4             433.
##  4 Other     76.8    28.9  412.     740.  119.    7.57            322.
##  5 Other     78.9    37.6  694.     662.   32.5   9.1             196.
##  6 Americas  97.8    53.7 1118.      15.1  19.7  13.4             363.
##  7 Americas  59.6    29.5  898.     288.   10.4  12.1             300.
##  8 Other    154.     23.0  922.      58.5  36.9  13.4             364.
##  9 Americas  44.6    48.3 1211.      21.7  16.0   8.25            213.
## 10 Other     36.7    19.7  721.     335.    8.32  7.62            410.
## # ... with 120 more rows, and 4 more variables: wheat_and_wheat_products <dbl>,
## #   rice <dbl>, soybeans <dbl>, nuts_inc_peanut_butter <dbl>

Given that all of our variables are numeric, a ggscatmat - a scatterplot matrix - can be used:

  • Diagonals represent density plots
  • Top half represent correlation coefficients
  • Bottom half represents a scatterplot
library(GGally)

food %>% 
  ggscatmat(columns = 1:11, color = "americas", alpha = 0.7)

Nothing too out of the ordinary, although poultry & beef tends to be contribute more CO2 in Americas versus rest of the world - very likely as a function of higher consumption.

With that, let’s dive into modelling.

Data Splitting

We begin by splitting the dataset into training and test sets. I opt to use stratified random sampling - accomplished by selecting samples at random within each class. As Kuhn & Johnson (2019) puts it - this approach ensures that the frequency distribution of the outcome is approximately equal within the training and test sets.

library(tidymodels)
set.seed(2020)

food_split <- initial_split(food, strata = americas)
food_split
## <Training/Validation/Total>
## <99/31/130>
food_train <- training(food_split)
food_test <- testing(food_split)

Data Preprocessing

Next, we use the recipes package. The intuition behind this package is to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. feature engineering).

food_rec <- recipe(americas ~ ., data = food_train) %>%
  step_corr(all_numeric()) %>%
  prep()

food_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         11
## 
## Training data contained 99 data points and no missing data.
## 
## Operations:
## 
## Correlation filter removed no terms [trained]

step_corr: Potentially remove variables that have large absolute correlations with other variables. Even though the correlation recipe was applied, no terms were removed - implying low multicollinearity.

Model Specification

Let’s begin with the specification of two models - Logistic Regression and Random Forest.

log_spec <- logistic_reg(mode = "classification") %>%
  set_engine("glm")

rf_spec <- rand_forest(mode = "classification") %>%
  set_engine("ranger")

Now we fit (train) the models with a juice-d version. juice essentially runs the preprocessing steps mentioned in the recipe, and returns us the processed variables.

log_fit <- log_spec %>%
  fit(americas ~ .,
      data = juice(food_rec))
log_fit
## parsnip model object
## 
## Fit time:  0ms 
## 
## Call:  stats::glm(formula = formula, family = stats::binomial, data = data)
## 
## Coefficients:
##              (Intercept)                      pork                   poultry  
##                -0.367402                  0.010330                 -0.127695  
##                     beef                 lamb_goat                      fish  
##                -0.002816                  0.006490                  0.074879  
##                     eggs           milk_inc_cheese  wheat_and_wheat_products  
##                -0.210197                 -0.000185                  0.302985  
##                     rice                  soybeans    nuts_inc_peanut_butter  
##                 0.013010                 -0.176848                  0.206664  
## 
## Degrees of Freedom: 98 Total (i.e. Null);  87 Residual
## Null Deviance:       102.3 
## Residual Deviance: 45.77     AIC: 69.77
rf_fit <- rf_spec %>% 
  fit(americas ~ .,
      data = juice(food_rec))
rf_fit 
## parsnip model object
## 
## Fit time:  51ms 
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      99 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1171945

Model Evaluation

Now to compare training results versus test results (which we have not touched up until now).

Mostly for my own benefit, I list down the definitions of performance metrics that I’ll be using:

  • Accuracy is the total number of correct predictions divided by the total number of predictions made for a dataset.
  • Precision quantifies the number of positive class predictions that actually belong to the positive class.
  • Recall/Sensitivity quantifies the number of positive class predictions made out of all positive examples in the dataset.
  • F-Measure provides a single score that balances both the concerns of precision and recall in one number.
results_train <- log_fit %>%
  predict(new_data = food_train) %>%
  mutate(truth = food_train$americas) %>%
  conf_mat(truth, .pred_class) %>%
  summary() %>%
  mutate(model = "log") %>%
  bind_rows(
    rf_fit %>%
      predict(new_data = food_train) %>%
      mutate(truth = food_train$americas) %>%
      conf_mat(truth, .pred_class) %>%
      summary() %>%
      mutate(model = "rf")
  )

results_test <- log_fit %>%
  predict(new_data = bake(food_rec, food_test)) %>%
  mutate(truth = food_test$americas) %>%
  conf_mat(truth, .pred_class) %>%
  summary() %>%
  mutate(model = "log") %>%
  bind_rows(
    rf_fit %>%
      predict(new_data = bake(food_rec, food_test)) %>%
      mutate(truth = food_test$americas) %>%
      conf_mat(truth, .pred_class) %>%
      summary() %>%
      mutate(model = "rf")
  )
library(patchwork) #to combine ggplots

p1 <- results_train %>%
  filter(.metric %in% c("accuracy", "precision", "recall", "f_meas")) %>%
  ggplot(aes(.metric, .estimate, fill = model)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = round(.estimate, 2)),
            position = position_dodge(width = 0.9), vjust = -0.5) + 
  labs(
    x = "Performance Metrics (Training Data)",
    y = "Score"
  )

p2 <- results_test %>%
  filter(.metric %in% c("accuracy", "precision", "recall", "f_meas")) %>%
  ggplot(aes(.metric, .estimate, fill = model)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = round(.estimate, 2)),
            position = position_dodge(width = 0.9), vjust = -0.5) + 
  labs(
    x = "Performance Metrics (Test Data)",
    y = "Score"
  )

p1 + p2 + plot_layout(guides = "collect") & theme(legend.position = 'bottom')

In the training results, random forest does a stellar job with close to perfect scores across these metrics.

However, in our test results, our logistic regression model handily outperforms random forest. This difference in performance could imply overfitting of the random forest model. Resampling will address this issue of overfitting.

Resampling

From Kuhn & Johnson (2019), resampling methods can generate different versions of our training set that can be used to simulate how well models would perform on new data.

In each case, a resampling scheme generates a subset of the data to be used for modeling and another that is used for measuring performance. This stems from the need to understand the effectiveness of the model without resorting to the test set. Simply repredicting the training set is problematic so a procedure is needed to get an appraisal using the training set.

Here, we opt for a 10-fold cross validation data frame.

food_cv_folds <- food_train %>%
  vfold_cv()

food_cv_folds
## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits          id    
##    <named list>    <chr> 
##  1 <split [89/10]> Fold01
##  2 <split [89/10]> Fold02
##  3 <split [89/10]> Fold03
##  4 <split [89/10]> Fold04
##  5 <split [89/10]> Fold05
##  6 <split [89/10]> Fold06
##  7 <split [89/10]> Fold07
##  8 <split [89/10]> Fold08
##  9 <split [89/10]> Fold09
## 10 <split [90/9]>  Fold10

Using our 10-fold CV data frame, we re-fit our models.
Subsequently, using the collect_metrics function, we are able to succintly summarize our specified performance metrics for both models, join them into a tibble, and use patchwork once again to visualize them all in a single diagram.

log_refit <- fit_resamples(
  log_spec,
  americas ~ .,
  resamples = food_cv_folds,
  control = control_resamples(save_pred = T),
  metrics = metric_set(accuracy, f_meas, precision, recall)
)

rf_refit <- fit_resamples(
  rf_spec,
  americas ~ .,
  resamples = food_cv_folds,
  control = control_resamples(save_pred = T),
  metrics = metric_set(accuracy, f_meas, precision, recall)
)

results_train_refit <- log_refit %>%
  collect_metrics() %>%
  mutate(model = "log") %>%
  bind_rows(rf_refit %>%
              collect_metrics() %>%
              mutate(model = "rf"))

p3 <- results_train_refit %>%
  ggplot(aes(.metric, mean, fill = model)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = round(mean, 2)),
            position = position_dodge(width = 0.9), vjust = -0.5) + 
  labs(
    x = "Performance Metrics (Training Data, Resampled)",
    y = "Score"
  )

(p1 | p3) / p2 + 
  plot_annotation(title = "Training vs Training (Resampled) vs Test Data",
                  subtitle = "After resampling, overfitting is less apparent\nwith training performance metrics more closely resembling test data") +
  plot_layout(guides = "collect") & theme(legend.position = 'bottom')

Conclusion

I’ve barely explored the tip of the iceberg that is machine learning but I’m excited. There’s plenty of room for improvement, especially with regards to data preprocessing via the recipes package and hyperparameter tuning via the tune package (which I skipped) - which I aim to document in future blog posts.