Tidy Tuesday - World Cup Edition

tidytuesday
analysis
Author

Sam Weiner

Published

December 11, 2022

TidyTuesday

Tidy Tuesday is an awesome initiative by the R for Data Science Community where data sets are published every week for the R community share any analysis and visualizations that can be conjured from the data!

World Cup Data

This data set contains data on every World Cup since 1930.

library(tidyverse)
library(tidytuesdayR)
library(tidymodels)
theme_set(theme_light())

Load in the World Cup data

tt <- tt_load("2022-11-29")
tt

This episode of Tidy Tuesday contains two data sets. wcmatches (matches from now on) contains 900 rows, one for each game played on the World Cup stage. It has variables for who played who, what stage (or round) of the tournament the game was played, who won, and the goals scored by each team.

matches <- tt$wcmatches
glimpse(matches)
Rows: 900
Columns: 15
$ year           <dbl> 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1…
$ country        <chr> "Uruguay", "Uruguay", "Uruguay", "Uruguay", "Uruguay", …
$ city           <chr> "Montevideo", "Montevideo", "Montevideo", "Montevideo",…
$ stage          <chr> "Group 1", "Group 4", "Group 2", "Group 3", "Group 1", …
$ home_team      <chr> "France", "Belgium", "Brazil", "Peru", "Argentina", "Ch…
$ away_team      <chr> "Mexico", "United States", "Yugoslavia", "Romania", "Fr…
$ home_score     <dbl> 4, 0, 1, 1, 1, 3, 0, 0, 1, 6, 1, 0, 0, 4, 3, 6, 6, 4, 2…
$ away_score     <dbl> 1, 3, 2, 3, 0, 0, 4, 3, 0, 3, 0, 1, 4, 0, 1, 1, 1, 2, 3…
$ outcome        <chr> "H", "A", "A", "A", "H", "H", "A", "A", "H", "H", "H", …
$ win_conditions <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ winning_team   <chr> "France", "United States", "Yugoslavia", "Romania", "Ar…
$ losing_team    <chr> "Mexico", "Belgium", "Brazil", "Peru", "France", "Mexic…
$ date           <date> 1930-07-13, 1930-07-13, 1930-07-14, 1930-07-14, 1930-0…
$ month          <chr> "Jul", "Jul", "Jul", "Jul", "Jul", "Jul", "Jul", "Jul",…
$ dayofweek      <chr> "Sunday", "Sunday", "Monday", "Monday", "Tuesday", "Wed…

worldcups (wcups from now on) has information about each world cup on a high level. It contains information like the name of the host country each World Cup, the names of the top four finishers, total tournament goals scored, and the number of teams, games, and attendants.

wcups <- tt$worldcups
glimpse(wcups)
Rows: 21
Columns: 10
$ year         <dbl> 1930, 1934, 1938, 1950, 1954, 1958, 1962, 1966, 1970, 197…
$ host         <chr> "Uruguay", "Italy", "France", "Brazil", "Switzerland", "S…
$ winner       <chr> "Uruguay", "Italy", "Italy", "Uruguay", "West Germany", "…
$ second       <chr> "Argentina", "Czechoslovakia", "Hungary", "Brazil", "Hung…
$ third        <chr> "USA", "Germany", "Brazil", "Sweden", "Austria", "France"…
$ fourth       <chr> "Yugoslavia", "Austria", "Sweden", "Spain", "Uruguay", "W…
$ goals_scored <dbl> 70, 70, 84, 88, 140, 126, 89, 89, 95, 97, 102, 146, 132, …
$ teams        <dbl> 13, 16, 15, 13, 16, 16, 16, 16, 16, 16, 16, 24, 24, 24, 2…
$ games        <dbl> 18, 17, 18, 22, 26, 35, 32, 32, 32, 38, 38, 52, 52, 52, 5…
$ attendance   <dbl> 434000, 395000, 483000, 1337000, 943000, 868000, 776000, …

Explore the Data

Let’s first look the most common match scores for all the games. It looks like low scoring games are the most common.

matches %>% 
  count(home_score, away_score, sort = TRUE)
# A tibble: 49 × 3
   home_score away_score     n
        <dbl>      <dbl> <int>
 1          1          0    94
 2          2          1    90
 3          1          1    87
 4          0          1    78
 5          0          0    71
 6          2          0    59
 7          1          2    51
 8          0          2    40
 9          3          1    36
10          2          2    34
# … with 39 more rows

I want to visualize how many games there were for total goals scored between the home and away teams. We can do this by creating a new column (total_score) adding the two scores together and counting the number of games with that score.

We can use ggplot2 to easily visualize our data. Also, by turning total_score into a factor, I can reorder the levels of the factors to reflect the number of matches each for each score. This makes the plot easily comprehensible.

library(viridis)

matches %>% 
  mutate(total_score = home_score + away_score) %>%
  count(total_score, sort = TRUE) %>% 
  mutate(total_score = factor(total_score)) %>% 
  mutate(total_score = fct_reorder(total_score, n)) %>% 
  ggplot() +
  geom_col(aes(total_score, n), fill = "dark blue") +
  xlab("Total Match Score") +
  ylab("Number of Matches") +
  ggtitle("Number of World Cup Matches by Total Score") +
  theme(legend.position = "none") +
  coord_flip()

It looks like total score of 1, 2, and 3 are by far the most common score outcomes for world cup matches. Let’s now look at the number of matches each country has one since 1930. We are going to visualize teams that have one more than 5 games at world cup and see who ranks on top.

matches |> 
  filter(outcome != "D") |>
  count(winning_team, sort = TRUE)|>
  filter(n > 5) |> 
  ggplot() +
  geom_col(aes(fct_reorder(winning_team, n, max), n), fill = "dark green") + 
  coord_flip()

Brazil is by far the most winningest team in World Cup history! But I which country has won the most games in the past 20 years?

matches |> 
  filter(outcome != "D", year >= 2000) |>
  count(winning_team, sort = TRUE)|>
  filter(n > 5) |> 
  ggplot() +
  geom_col(aes(fct_reorder(winning_team, n, max), n), fill = "dark red") + 
  coord_flip()

Germany has overcome Brazil!

Just for fun, let’s see which day of the week most World Cup matches are played. Hopefully this information can help you figure out how to spend your vacation days if you want to see the most action.

We are going to also stratify by which stage the games are being played, so if you are more interested in the games deeper in the bracket, we can see on what days those are played most often.

Here I am removing the number associoted with each Group to get an overall Group category. Next, I am releveling the days of week variable to make them look a little nicer.

matches <- matches %>% 
  mutate(stage = if_else(str_detect(stage, "^Group"), "Group",stage)) %>% 
  mutate(stage = if_else(str_detect(stage, "^ Group"), "Group",stage)) %>% 
  mutate(dayofweek = factor(dayofweek, c("Monday", "Tuesday", "Wednesday",
                                         "Thursday", "Friday", "Saturday", 
                                         "Sunday")))

The plot shows that most of the games are played on Saturday and Sunday. So no need to take a vacation day! But Tuesday and Wednesday seem to have the most Semifinal matches.

matches %>%
  filter(stage != "Group")|>
  ggplot(aes(dayofweek, fill = stage)) + 
  geom_bar()

World Cup dataset

As was mentioned above, the World Cup dataset contains more general information about each World Cup.

wcups
# A tibble: 21 × 10
    year host        winner      second third fourth goals…¹ teams games atten…²
   <dbl> <chr>       <chr>       <chr>  <chr> <chr>    <dbl> <dbl> <dbl>   <dbl>
 1  1930 Uruguay     Uruguay     Argen… USA   Yugos…      70    13    18  434000
 2  1934 Italy       Italy       Czech… Germ… Austr…      70    16    17  395000
 3  1938 France      Italy       Hunga… Braz… Sweden      84    15    18  483000
 4  1950 Brazil      Uruguay     Brazil Swed… Spain       88    13    22 1337000
 5  1954 Switzerland West Germa… Hunga… Aust… Urugu…     140    16    26  943000
 6  1958 Sweden      Brazil      Sweden Fran… West …     126    16    35  868000
 7  1962 Chile       Brazil      Czech… Chile Yugos…      89    16    32  776000
 8  1966 England     England     West … Port… Sovie…      89    16    32 1614677
 9  1970 Mexico      Brazil      Italy  West… Urugu…      95    16    32 1673975
10  1974 Germany     West Germa… Nethe… Pola… Brazil      97    16    38 1774022
# … with 11 more rows, and abbreviated variable names ¹​goals_scored,
#   ²​attendance

We can count the number of total goals scored in a World Cup and rank them by which tournament had the most goals scored.

wcups %>% 
  count(year, wt = goals_scored, sort = TRUE)
# A tibble: 21 × 2
    year     n
   <dbl> <dbl>
 1  1998   171
 2  2014   171
 3  2018   169
 4  2002   161
 5  2006   147
 6  1982   146
 7  2010   145
 8  1994   141
 9  1954   140
10  1986   132
# … with 11 more rows

Let’s also visualize the relationship between number of goals scored and the number of matches played in each World Cup.

wcups %>% 
  ggplot(aes(year, goals_scored)) +
  geom_line() +
  geom_line(aes(year, games))