---
title: "Beijing Olympics - Data cleaning"
format: html
---
# Data Cleaning and Summarization Module

In this document, you will be lead through cleaning one of the two data sets associated with the medal counts at the Beijing Olympics. Then, for the other data set, you will be asked to clean it using similar functions. Finally, you will investigate how to combine the two data sets together.

## 1. Loading in a dataset
```{r}
#| output: false
# These are the necessary packages to load in a csv file. Tidyverse contains many different packages that we will be using in this module.
library(tidyverse)
library(here)
```

How to load in csv files into a data frame.
```{r}
#| output: false
athletes <- read_csv("https://github.com/SCOREnetworkorg/sports-data-repository/raw/main/data/beijing_athletes.csv")
medals <- read_csv("https://github.com/SCOREnetworkorg/sports-data-repository/raw/main/data/beijing_medals.csv")
```

## 2. Data Cleaning
Firstly, lets look at an example of what the data looks like starting with the medals data set.
```{r}
View(medals)
```

The first thing that has to be adjusted is the medal_code data and the athlete_sex data. Currently, the data type is a double and character respectively, but it should really be a factored datatype with 3 levels. To do this, we can take advantage of the forcats and dplyr packages in tidyverse.

```{r}
medalsTidy <- 
  medals |>
  mutate(medal_code = as_factor(medal_code),
         athlete_sex = as_factor(
           case_when(
             athlete_sex == 'M' ~ 'M',
             athlete_sex == 'W' ~ 'W',
             .default = 'O'
             )
           )
         )
# mutate allows you to adjust variables in the data set
```

We should also change the medal_date data type as it currently holds minutes that as shown previously are all 0. To do this we will take advantage of the strftime function and the lubridate package. 

```{r}
medalsTidy <-
  medalsTidy |>
  mutate(medal_date = strftime(medal_date,
                               format = '%Y-%m-%d')) |>
  mutate(medal_date = as_date(medal_date))
```

Another good step to take might be to get rid of the athlete_short_name, and then to separate athlete_name into a first name and last name variable.
```{r}
medalsTidy <-
  medalsTidy |>
  select(-athlete_short_name)
```

```{r}
medalsTidy <-
  medalsTidy |>
  separate(athlete_name,
           sep = '^\\S*\\K\\s+',
           into = c('last_name', 'first_name'))
```

Next, we should get rid of some variables that aren't necessary to include. Such as things like athlete_link, country code for simplification, and discipline code as it will be hard to interpret what the discipline is from the code. We can do this with the select function in dplyr. Lets also change ROC to Russian Olympic Committee so as to be the official name.
```{r}
medalsTidy <-
  medalsTidy |>
  select(-(c('athlete_link', 'discipline_code', 'country_code'))) |>
  mutate(country = ifelse(country == 'ROC', 'Russian Olympic Committee', country))
```

Here is what our cleaned data set now looks like.
```{r}
knitr::kable(head(medalsTidy))
```

### Exercise 1
See if you can apply some of these techniques shown above to the athletes data set. Here is a list of what you should do.

1a. Create a first and last name column and get rid of the short_name column.

1b. Make gender a factored variable.

1c. Get rid of country_code, discipline_code, residence_place, residence_country, and url.

1d. See if you can figure out how to separate the height variable into just meters and convert it into a numeric value.

1e. Notice all the missing values in the height variable, calculate the proportion of missing values and comment on how using height in an analysis may be impacted by the missing values.

1f. Calculate the proportion of missing height values for Ice Hockey and discuss the difference from the previous result.

1g. Calculate the proportion of missing height values for each sport, and sort them by descending order.

## 3. Data Summarization
Next lets look at summarizing some statistics using our cleaned medals data set. First, lets see which country has the most medal winners in our data set using dplyr.
```{r}
medalsTidy |>
  group_by(country) |>
  summarise(medal_count = n()) |>
  arrange(desc(medal_count))
```

### Exercise 2
Using the example above and the filter function in the dplyr package, lets see which country has the most gold medal winners using medal_type.

### Exercise 3
Now try finding which country has the most silver medals using medal_code.

### Exercise 4
Finally, figure out in which discipline the United States of America has received the most medals.

## 4. Joining
### Exercise 5
As one last additional exercise, think about how you would go about combining both the athletes and medals data sets. In particular, what would the `left_join` function represent? (Hint: It depends on which data set is assigned to which role.)