Spotify Data? That's Music To My Ears!
This dataset was taken from the very popular TidyTuesday github repo, and this was my attempt at having a go at visualization given my love for music and this was a Spotify dataset.
In the spirit of “Perfect is the enemy of good”, this will be a short post aimed at answering just a couple of questions with EDA and visualization.
Datasets from TidyTuesday are usually cleaned (or at least there’ll be instructions/hints on what one should first start with), and I begin by importing the data and exploring it via skimr
.
spotify_songs <-
read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
skimr::skim(spotify_songs)
Name | spotify_songs |
Number of rows | 32833 |
Number of columns | 23 |
_______________________ | |
Column type frequency: | |
character | 10 |
numeric | 13 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
track_id | 0 | 1 | 22 | 22 | 0 | 28356 | 0 |
track_name | 5 | 1 | 1 | 144 | 0 | 23449 | 0 |
track_artist | 5 | 1 | 2 | 69 | 0 | 10692 | 0 |
track_album_id | 0 | 1 | 22 | 22 | 0 | 22545 | 0 |
track_album_name | 5 | 1 | 1 | 151 | 0 | 19743 | 0 |
track_album_release_date | 0 | 1 | 4 | 10 | 0 | 4530 | 0 |
playlist_name | 0 | 1 | 6 | 120 | 0 | 449 | 0 |
playlist_id | 0 | 1 | 22 | 22 | 0 | 471 | 0 |
playlist_genre | 0 | 1 | 3 | 5 | 0 | 6 | 0 |
playlist_subgenre | 0 | 1 | 4 | 25 | 0 | 24 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
track_popularity | 0 | 1 | 42.48 | 24.98 | 0.00 | 24.00 | 45.00 | 62.00 | 100.00 | ▆▆▇▆▁ |
danceability | 0 | 1 | 0.65 | 0.15 | 0.00 | 0.56 | 0.67 | 0.76 | 0.98 | ▁▁▃▇▃ |
energy | 0 | 1 | 0.70 | 0.18 | 0.00 | 0.58 | 0.72 | 0.84 | 1.00 | ▁▁▅▇▇ |
key | 0 | 1 | 5.37 | 3.61 | 0.00 | 2.00 | 6.00 | 9.00 | 11.00 | ▇▂▅▅▆ |
loudness | 0 | 1 | -6.72 | 2.99 | -46.45 | -8.17 | -6.17 | -4.64 | 1.27 | ▁▁▁▂▇ |
mode | 0 | 1 | 0.57 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▆▁▁▁▇ |
speechiness | 0 | 1 | 0.11 | 0.10 | 0.00 | 0.04 | 0.06 | 0.13 | 0.92 | ▇▂▁▁▁ |
acousticness | 0 | 1 | 0.18 | 0.22 | 0.00 | 0.02 | 0.08 | 0.26 | 0.99 | ▇▂▁▁▁ |
instrumentalness | 0 | 1 | 0.08 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 0.99 | ▇▁▁▁▁ |
liveness | 0 | 1 | 0.19 | 0.15 | 0.00 | 0.09 | 0.13 | 0.25 | 1.00 | ▇▃▁▁▁ |
valence | 0 | 1 | 0.51 | 0.23 | 0.00 | 0.33 | 0.51 | 0.69 | 0.99 | ▃▇▇▇▃ |
tempo | 0 | 1 | 120.88 | 26.90 | 0.00 | 99.96 | 121.98 | 133.92 | 239.44 | ▁▂▇▂▁ |
duration_ms | 0 | 1 | 225799.81 | 59834.01 | 4000.00 | 187819.00 | 216000.00 | 253585.00 | 517810.00 | ▁▇▇▁▁ |
A lot of interesting labels are associated with the data, some of which include danceability, instrumentalness and valence. Full definitions can be found in the associated data dictionary.
I proceed to wrangle the data by adding my own labels to indicate the decades in which the track/album appears in.
spotify <- spotify_songs %>%
distinct(track_name, track_artist, .keep_all = TRUE) %>%
mutate(year = str_extract(track_album_release_date, "^\\d..."))
spotify$decades <- cut(
as.numeric(spotify$year),
c(1956, 1960, 1970, 1980, 1990, 2000, 2010, 2021),
labels = c("50s", "60s", "70s", "80s", "90s", "2000s", "2010s")
)
Using track popularity as a gauge, how have subgenres evolved over the decades?
spotify %>%
group_by(decades, playlist_subgenre) %>%
add_count(playlist_subgenre) %>%
filter(n > 5) %>%
ggplot(aes(
reorder_within(playlist_subgenre, track_popularity, decades),
track_popularity
)) +
geom_boxplot(aes(fill = playlist_genre)) +
coord_flip() +
facet_wrap(decades ~ ., nrow = 2, scales = "free_y") +
scale_x_reordered() +
theme_ipsum() +
labs(
title = "Popularity of Genres Through The Decades",
subtitle = "Recent Decades Saw An Explosion of Music Genres - Led by Rock and R&B",
caption = "\n Source: TidyTuesday
Visualization: Desmond Choy (Twitter @Norest)",
fill = "Music Genres",
x = "Music Sub-Genres",
y = "Track Popularity"
) +
theme(
plot.title = element_text(face = "bold", size = 25),
plot.subtitle = element_text(size = 15),
strip.background = element_blank(),
strip.text = element_text(face = "bold", size = 15),
legend.position = "top",
legend.title = element_text("Music Genres"),
legend.box = "horizontal",
legend.text = element_text(size = 10)
) +
guides(row = guide_legend(nrow = 1))
Permanent wave stood out as a rock sub-genre that, until 2010, stood the test of time in terms of popularity.
Trouble is… as an avid music fan, I’ve not heard of this sub-genre permanent wave at all! Still horrified, let me dig into the dataset a little more. I discover permanent wave actually had a few of my all-time favourite artists and I’ve been a closet permanent wave fan all this while!
spotify %>%
filter(playlist_subgenre == "permanent wave") %>%
count(track_artist, sort = TRUE)
# # A tibble: 471 x 2
# track_artist n
# <chr> <int>
# 1 Muse 19
# 2 The Smiths 19
# 3 David Bowie 13
# 4 Depeche Mode 12
# 5 The Cure 12
# 6 Foo Fighters 11
# 7 New Order 11
# 8 Red Hot Chili Peppers 11
# 9 George Harrison 9
# 10 Oasis 9
# # ... with 461 more rows
How about some suggestions to danceable EDM tracks that I could listen to when out for a run?
We filter by Danceability, as defined as how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
spotify %>%
select(playlist_genre, playlist_subgenre, track_name, danceability) %>%
filter(playlist_genre == "edm") %>%
distinct(track_name, .keep_all = TRUE) %>%
group_by(playlist_subgenre) %>%
top_n(n = 20, wt = danceability) %>%
ggplot(aes(reorder_within(track_name, danceability, playlist_subgenre), danceability)) +
geom_point(aes(colour = playlist_subgenre), size = 3, show.legend = FALSE) +
coord_flip() +
facet_wrap(. ~ playlist_subgenre, nrow = 2, scales = "free_y") +
scale_x_reordered() +
theme_ipsum() +
labs(
title = "What are some of the most danceable EDM tracks?",
subtitle = "Danceability describes how suitable a track is for dancing based on a combination of musical elements\nA value of 0.0 is least danceable and 1.0 is most danceable.",
caption = "\n Source: TidyTuesday
Visualization: Desmond Choy (Twitter @Norest)",
fill = "Music Genres",
x = "Album Name",
y = "Danceability"
) +
theme(
plot.title = element_text(face = "bold", size = 25),
plot.subtitle = element_text(size = 15),
strip.background = element_blank(),
strip.text = element_text(face = "bold", size = 15),
legend.position = "top",
legend.title = element_text("Music Genres"),
legend.box = "horizontal",
legend.text = element_text(size = 10)
) +
guides(row = guide_legend(nrow = 1))
Finally, how about some curated suggestions - Based on the criteria listed below, what are some suggestions for sub-genres?
Instrumentalness
: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
Acousticness
: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
Valence
: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
So my approach was to create a criteria that involved summing up Instrumentalness
, Acousticness
, Valence
. Sub-genres with the highest criteria would then be picked …. right?
spotify %>%
mutate(criteria = instrumentalness + acousticness + valence) %>%
select(playlist_genre, playlist_subgenre, track_album_name, criteria) %>%
distinct(track_album_name, .keep_all = TRUE) %>%
group_by(playlist_subgenre) %>%
summarise(criteria = sum(criteria)/n()) %>%
arrange(desc(criteria))
# # A tibble: 24 x 2
# playlist_subgenre criteria
# <chr> <dbl>
# 1 hip hop 1.08
# 2 tropical 0.945
# 3 reggaeton 0.907
# 4 neo soul 0.888
# 5 latin pop 0.866
# 6 classic rock 0.865
# 7 electro house 0.852
# 8 hip pop 0.850
# 9 urban contemporary 0.830
# 10 latin hip hop 0.823
# # ... with 14 more rows
Hip-hop?? When you think acousticness and instrumental tunes … hip hop doesn’t quite come to mind.
spotify %>%
mutate(criteria = instrumentalness + acousticness + valence) %>%
select(playlist_genre, playlist_subgenre, track_artist, track_album_name, criteria) %>%
distinct(track_album_name, .keep_all = TRUE) %>%
filter(playlist_subgenre == "hip hop") %>%
arrange(desc(criteria)) %>%
head(20)
# # A tibble: 20 x 5
# playlist_genre playlist_subgenre track_artist track_album_name criteria
# <chr> <chr> <chr> <chr> <dbl>
# 1 rap hip hop Goldenninjah Moods 2.86
# 2 rap hip hop oofoe double oo tape 2.73
# 3 rap hip hop luvwn sanya 2.70
# 4 rap hip hop Brenky Previsão 2.7
# 5 rap hip hop Bluedoom 4:20 PM 2.66
# 6 rap hip hop Sarah, the Ill~ Pocket Full of Cry~ 2.62
# 7 rap hip hop Loop Schrauber Repeat 2.62
# 8 rap hip hop Chris Keys Detour 2.62
# 9 rap hip hop Ymori Better Things 2.60
# 10 rap hip hop Leavv essence 2.58
# 11 rap hip hop Flynn Cycles 2.58
# 12 rap hip hop junyii. junyii·dr!p 2.52
# 13 rap hip hop Smeyeul. Bedroom Skits 2.52
# 14 rap hip hop Nathan Kawanis~ Yokohama 2.52
# 15 rap hip hop Brenky Winter Flakes 2.52
# 16 rap hip hop Chill Children bob le head 2.49
# 17 rap hip hop Mr Mantega Fire to Hire 2.44
# 18 rap hip hop jrd. Reflections 2.43
# 19 rap hip hop David Chief Sands EP 2.42
# 20 rap hip hop Made in M Flashlight 2.41
I initally thought there was an error in the data or my code. But I picked a few tunes to sample and it turns out I genuinely enjoyed all of them! This was an amazingly fruitful and productive exploration of new music to widen my aural horizons.
Here’s a Top20 playlist below, based on my criteria.
spotify %>%
mutate(criteria = instrumentalness + acousticness + valence) %>%
select(playlist_genre, playlist_subgenre, track_artist, track_album_name, criteria) %>%
distinct(track_album_name, .keep_all = TRUE) %>%
arrange(desc(criteria)) %>%
head(20)
# # A tibble: 20 x 5
# playlist_genre playlist_subgenre track_artist track_album_name criteria
# <chr> <chr> <chr> <chr> <dbl>
# 1 rap hip hop Goldenninjah Moods 2.86
# 2 latin tropical Kavv Cruise Control 2.77
# 3 latin tropical S-Ilo Targa 2.73
# 4 rap hip hop oofoe double oo tape 2.73
# 5 rap hip hop luvwn sanya 2.70
# 6 rap hip hop Brenky Previsão 2.7
# 7 r&b urban contempora~ Paco de Lucía La Búsqueda (Remas~ 2.68
# 8 latin tropical Reyna Tropical Como Fuego 2.68
# 9 rap hip hop Bluedoom 4:20 PM 2.66
# 10 rock classic rock Booker T. & th~ Green Onions 2.63
# 11 rap hip hop Sarah, the Ill~ Pocket Full of Cry~ 2.62
# 12 rap hip hop Loop Schrauber Repeat 2.62
# 13 rap hip hop Chris Keys Detour 2.62
# 14 rap hip hop Ymori Better Things 2.60
# 15 rap hip hop Leavv essence 2.58
# 16 rap hip hop Flynn Cycles 2.58
# 17 r&b urban contempora~ Grey Goodnight, Universe 2.57
# 18 pop indie poptimism Joe Corfield Chillhop Essential~ 2.54
# 19 latin tropical S-Ilo Ascent 2.53
# 20 rap hip hop junyii. junyii·dr!p 2.52
As always, RMarkdown document can be found in my github should you wish to replicate these results.