Food Consumption and CO2 Emissions
This will be my first attempt at machine learning using the tidymodels
package, with a dataset taken from TidyTuesday. The code used to scrape this data can be found here.
The study analyses data from the Food and Agriculture Organization of the United Nations (FAO) to determine the quantity of produce supplied for consumption of 11 food types for all countries researched. Using CO2 emissions data, the carbon footprint per capita is then calculated for each food type.
I heavily borrow intuition gleaned from this blog post by Julia Silge - who does an amazing job at detailing the intricacies of tidymodels
metapackage in her various blog posts.
library(tidyverse)
theme_set(theme_minimal())
food_consumption <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv")
food_consumption
# # A tibble: 1,430 x 4
# country food_category consumption co2_emmission
# <chr> <chr> <dbl> <dbl>
# 1 Argentina Pork 10.5 37.2
# 2 Argentina Poultry 38.7 41.5
# 3 Argentina Beef 55.5 1712
# 4 Argentina Lamb & Goat 1.56 54.6
# 5 Argentina Fish 4.36 6.96
# 6 Argentina Eggs 11.4 10.5
# 7 Argentina Milk - inc. cheese 195. 278.
# 8 Argentina Wheat and Wheat Products 103. 19.7
# 9 Argentina Rice 8.77 11.2
# 10 Argentina Soybeans 0 0
# # ... with 1,420 more rows
From the Data Dictionary, these are some important definitions:
- Consumption is measured in kg/person/year
- Co2 Emission is measured in kg CO2/person/year
Using a preliminary skim
, the dataset has 130 unique countries, each tagged with 11 different categories of food, and no missing values.
library(skimr)
skim(food_consumption)
Name | food_consumption |
Number of rows | 1430 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
country | 0 | 1 | 3 | 22 | 0 | 130 | 0 |
food_category | 0 | 1 | 4 | 24 | 0 | 11 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
consumption | 0 | 1 | 28.11 | 49.82 | 0 | 2.37 | 8.89 | 28.13 | 430.76 | ▇▁▁▁▁ |
co2_emmission | 0 | 1 | 74.38 | 152.10 | 0 | 5.21 | 16.53 | 62.60 | 1712.00 | ▇▁▁▁▁ |
food_consumption %>%
count(food_category)
# # A tibble: 11 x 2
# food_category n
# <chr> <int>
# 1 Beef 130
# 2 Eggs 130
# 3 Fish 130
# 4 Lamb & Goat 130
# 5 Milk - inc. cheese 130
# 6 Nuts inc. Peanut Butter 130
# 7 Pork 130
# 8 Poultry 130
# 9 Rice 130
# 10 Soybeans 130
# 11 Wheat and Wheat Products 130
Which food item, on average, contributes the most to co2 emmissions?
food_consumption %>%
group_by(food_category) %>%
summarize(avg.co2 = sum(co2_emmission)/n()) %>%
ggplot(aes(fct_reorder(food_category, avg.co2), avg.co2, fill = food_category)) +
geom_col(show.legend = F) +
coord_flip() +
labs(
x = "Food Categories",
y = "Average CO2 Emissions"
)
How about inspecting the median values and country outliers through a boxplot?
library(ggrepel) "Beef") %>%
mutate(percentile = co2_emmission >= quantile(co2_emmission, 0.95, na.rm = T)) %>%
filter(percentile == 1),
aes(label = country)
) +
coord_flip() +
labs(
x = "Food Categories",
y = "Average CO2 Emissions"
)
food_consumption %>%
group_by(food_category) %>%
ggplot(aes(
fct_reorder(food_category, co2_emmission),
co2_emmission,
fill = food_category
)) +
geom_boxplot(show.legend = F) +
geom_text_repel(
data = . %>%
filter(food_category ==
Filtering for Beef
only, let’s take a closer look at the top 20 countries.
food_consumption %>%
filter(food_category == "Beef") %>%
arrange(desc(consumption)) %>%
top_n(20) %>%
ggplot(aes(fct_reorder(country, consumption), consumption, fill = country)) +
geom_col(show.legend = F) +
coord_flip() +
labs(
x = "Country",
y = "Beef Consumption"
)
I did not expect Argentina to be on top of that list by such a significant margin. Doing a bit of googling to fact-check, it does appear Argentinians are lovers of red meat. However, that consumption per person also seems to be coming down, due to a combination of rampant inflation as well as a growing health awareness of non-meat options. But I digress.
Quite a number of North and South America countries America show up as heavy beef consumers. This gives me an idea - can I predict, using co2_emmission
, if a country belongs to the Americas or not?
I use the countrycode
package to create a new column identifying which continent a country belongs to, and further create a binary classification of whether or not the country belongs to Americas.
library(countrycode)
library(janitor)
food <- food_consumption %>%
select(-consumption) %>%
pivot_wider(names_from = food_category,
values_from = co2_emmission) %>%
clean_names() %>%
mutate(continent = countrycode(country,
origin = "country.name",
destination = "continent")) %>%
mutate(americas = case_when(continent == "Americas" ~ "Americas",
TRUE ~ "Other")) %>%
select(-country,-continent) %>%
mutate_if(is.character, as_factor)
food %>%
select(americas, everything())
# # A tibble: 130 x 12
# americas pork poultry beef lamb_goat fish eggs milk_inc_cheese
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Americas 37.2 41.5 1712 54.6 6.96 10.5 278.
# 2 Other 85.4 49.5 1045. 346. 28.2 7.82 334.
# 3 Other 38.5 14.2 694. 536. 6.15 11.4 433.
# 4 Other 76.8 28.9 412. 740. 119. 7.57 322.
# 5 Other 78.9 37.6 694. 662. 32.5 9.1 196.
# 6 Americas 97.8 53.7 1118. 15.1 19.7 13.4 363.
# 7 Americas 59.6 29.5 898. 288. 10.4 12.1 300.
# 8 Other 154. 23.0 922. 58.5 36.9 13.4 364.
# 9 Americas 44.6 48.3 1211. 21.7 16.0 8.25 213.
# 10 Other 36.7 19.7 721. 335. 8.32 7.62 410.
# # ... with 120 more rows, and 4 more variables: wheat_and_wheat_products <dbl>,
# # rice <dbl>, soybeans <dbl>, nuts_inc_peanut_butter <dbl>
Given that all of our variables are numeric, a ggscatmat
- a scatterplot matrix - can be used:
- Diagonals represent density plots
- Top half represent correlation coefficients
- Bottom half represents a scatterplot
library(GGally)
food %>%
ggscatmat(columns = 1:11, color = "americas", alpha = 0.7)
Nothing too out of the ordinary, although poultry & beef tends to be contribute more CO2 in Americas versus rest of the world - very likely as a function of higher consumption.
With that, let’s dive into modelling.
Data Splitting
We begin by splitting the dataset into training and test sets. I opt to use stratified random sampling - accomplished by selecting samples at random within each class. As Kuhn & Johnson (2019) puts it - this approach ensures that the frequency distribution of the outcome is approximately equal within the training and test sets.
library(tidymodels)
set.seed(2020)
food_split <- initial_split(food, strata = americas)
food_split
# <Training/Validation/Total>
# <99/31/130>
food_train <- training(food_split)
food_test <- testing(food_split)
Data Preprocessing
Next, we use the recipes
package. The intuition behind this package is to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. feature engineering).
food_rec <- recipe(americas ~ ., data = food_train) %>%
step_corr(all_numeric()) %>%
prep()
food_rec
# Data Recipe
#
# Inputs:
#
# role #variables
# outcome 1
# predictor 11
#
# Training data contained 99 data points and no missing data.
#
# Operations:
#
# Correlation filter removed no terms [trained]
step_corr
: Potentially remove variables that have large absolute correlations with other variables. Even though the correlation recipe was applied, no terms were removed - implying low multicollinearity.
Model Specification
Let’s begin with the specification of two models - Logistic Regression and Random Forest.
log_spec <- logistic_reg(mode = "classification") %>%
set_engine("glm")
rf_spec <- rand_forest(mode = "classification") %>%
set_engine("ranger")
Now we fit (train) the models with a juice-d version. juice
essentially runs the preprocessing steps mentioned in the recipe, and returns us the processed variables.
log_fit <- log_spec %>%
fit(americas ~ .,
data = juice(food_rec))
log_fit
# parsnip model object
#
# Fit time: 0ms
#
# Call: stats::glm(formula = formula, family = stats::binomial, data = data)
#
# Coefficients:
# (Intercept) pork poultry
# -0.367402 0.010330 -0.127695
# beef lamb_goat fish
# -0.002816 0.006490 0.074879
# eggs milk_inc_cheese wheat_and_wheat_products
# -0.210197 -0.000185 0.302985
# rice soybeans nuts_inc_peanut_butter
# 0.013010 -0.176848 0.206664
#
# Degrees of Freedom: 98 Total (i.e. Null); 87 Residual
# Null Deviance: 102.3
# Residual Deviance: 45.77 AIC: 69.77
rf_fit <- rf_spec %>%
fit(americas ~ .,
data = juice(food_rec))
rf_fit
# parsnip model object
#
# Fit time: 51ms
# Ranger result
#
# Call:
# ranger::ranger(formula = formula, data = data, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
#
# Type: Probability estimation
# Number of trees: 500
# Sample size: 99
# Number of independent variables: 11
# Mtry: 3
# Target node size: 10
# Variable importance mode: none
# Splitrule: gini
# OOB prediction error (Brier s.): 0.1171945
Model Evaluation
Now to compare training results versus test results (which we have not touched up until now).
Mostly for my own benefit, I list down the definitions of performance metrics that I’ll be using:
- Accuracy is the total number of correct predictions divided by the total number of predictions made for a dataset.
- Precision quantifies the number of positive class predictions that actually belong to the positive class.
- Recall/Sensitivity quantifies the number of positive class predictions made out of all positive examples in the dataset.
- F-Measure provides a single score that balances both the concerns of precision and recall in one number.
results_train <- log_fit %>%
predict(new_data = food_train) %>%
mutate(truth = food_train$americas) %>%
conf_mat(truth, .pred_class) %>%
summary() %>%
mutate(model = "log") %>%
bind_rows(
rf_fit %>%
predict(new_data = food_train) %>%
mutate(truth = food_train$americas) %>%
conf_mat(truth, .pred_class) %>%
summary() %>%
mutate(model = "rf")
)
results_test <- log_fit %>%
predict(new_data = bake(food_rec, food_test)) %>%
mutate(truth = food_test$americas) %>%
conf_mat(truth, .pred_class) %>%
summary() %>%
mutate(model = "log") %>%
bind_rows(
rf_fit %>%
predict(new_data = bake(food_rec, food_test)) %>%
mutate(truth = food_test$americas) %>%
conf_mat(truth, .pred_class) %>%
summary() %>%
mutate(model = "rf")
)
library(patchwork) #to combine ggplots
p1 <- results_train %>%
filter(.metric %in% c("accuracy", "precision", "recall", "f_meas")) %>%
ggplot(aes(.metric, .estimate, fill = model)) +
geom_col(position = "dodge") +
geom_text(aes(label = round(.estimate, 2)),
position = position_dodge(width = 0.9), vjust = -0.5) +
labs(
x = "Performance Metrics (Training Data)",
y = "Score"
)
p2 <- results_test %>%
filter(.metric %in% c("accuracy", "precision", "recall", "f_meas")) %>%
ggplot(aes(.metric, .estimate, fill = model)) +
geom_col(position = "dodge") +
geom_text(aes(label = round(.estimate, 2)),
position = position_dodge(width = 0.9), vjust = -0.5) +
labs(
x = "Performance Metrics (Test Data)",
y = "Score"
)
p1 + p2 + plot_layout(guides = "collect") & theme(legend.position = 'bottom')
In the training results, random forest does a stellar job with close to perfect scores across these metrics.
However, in our test results, our logistic regression model handily outperforms random forest. This difference in performance could imply overfitting of the random forest model. Resampling will address this issue of overfitting.
Resampling
From Kuhn & Johnson (2019), resampling methods can generate different versions of our training set that can be used to simulate how well models would perform on new data.
In each case, a resampling scheme generates a subset of the data to be used for modeling and another that is used for measuring performance. This stems from the need to understand the effectiveness of the model without resorting to the test set. Simply repredicting the training set is problematic so a procedure is needed to get an appraisal using the training set.
Here, we opt for a 10-fold cross validation data frame.
food_cv_folds <- food_train %>%
vfold_cv()
food_cv_folds
# # 10-fold cross-validation
# # A tibble: 10 x 2
# splits id
# <named list> <chr>
# 1 <split [89/10]> Fold01
# 2 <split [89/10]> Fold02
# 3 <split [89/10]> Fold03
# 4 <split [89/10]> Fold04
# 5 <split [89/10]> Fold05
# 6 <split [89/10]> Fold06
# 7 <split [89/10]> Fold07
# 8 <split [89/10]> Fold08
# 9 <split [89/10]> Fold09
# 10 <split [90/9]> Fold10
Using our 10-fold CV data frame, we re-fit our models.
Subsequently, using the collect_metrics
function, we are able to succintly summarize our specified performance metrics for both models, join them into a tibble, and use patchwork
once again to visualize them all in a single diagram.
log_refit <- fit_resamples(
log_spec,
americas ~ .,
resamples = food_cv_folds,
control = control_resamples(save_pred = T),
metrics = metric_set(accuracy, f_meas, precision, recall)
)
rf_refit <- fit_resamples(
rf_spec,
americas ~ .,
resamples = food_cv_folds,
control = control_resamples(save_pred = T),
metrics = metric_set(accuracy, f_meas, precision, recall)
)
results_train_refit <- log_refit %>%
collect_metrics() %>%
mutate(model = "log") %>%
bind_rows(rf_refit %>%
collect_metrics() %>%
mutate(model = "rf"))
p3 <- results_train_refit %>%
ggplot(aes(.metric, mean, fill = model)) +
geom_col(position = "dodge") +
geom_text(aes(label = round(mean, 2)),
position = position_dodge(width = 0.9), vjust = -0.5) +
labs(
x = "Performance Metrics (Training Data, Resampled)",
y = "Score"
)
(p1 | p3) / p2 +
plot_annotation(title = "Training vs Training (Resampled) vs Test Data",
subtitle = "After resampling, overfitting is less apparent\nwith training performance metrics more closely resembling test data") +
plot_layout(guides = "collect") & theme(legend.position = 'bottom')
Conclusion
I’ve barely explored the tip of the iceberg that is machine learning but I’m excited. There’s plenty of room for improvement, especially with regards to data preprocessing via the recipes
package and hyperparameter tuning via the tune
package (which I skipped) - which I aim to document in future blog posts.