---
title: "Examining Analysis of Variance Using the NBA Draft"
---

In this activity, we are going to be using data from the NBA draft to explore Analysis of Variance (ANOVA).

## Part 1 - Introduction

#### The NBA Draft

Each year, the National Basketball Association (NBA) holds a draft, where prospective basketball players are able to be chosen to join one of the 30 professional teams across the United States and Canada.

In order to be eligible for the draft, a player must be at least 19 years old and out of high school for at least one year. Prior to 2006, this rule was not in effect, and players could be drafted during/right out of high school.

Each year, the draft is comprised of 60 players and takes place over two rounds of 30 selections. Teams pick players in an order based on performance from the previous season, with teams that performed poorly getting earlier picks in order to create a seemingly more level playing field. It's important to note that there weren't always 30 players selected in each round, the number made its way up to 30 as more teams were added into the NBA. 

#### Playing Basketball

The goal in basketball is to score as many points as possible by throwing the ball through the other team's hoop.

In the middle of the court there is a half-court line that divides the two sides. On both sides, surrounding the hoops, there is an arch called the three-point line. Within the three-point line stands the free-throw line, where a player would shoot from should there be a foul called.

**Scoring**

Players can shoot towards the hoop from any point on the court. A different number of points is awarded based on where the player is standing when they release the ball. The three shots are explained below.

-   Field Goal: Worth 2 points, scored by shooting within the three point line
-   Three Pointer: Worth 3 points, scored by shooting outside the three point line.
-   Free Throw: Worth one point, taken from the free throw line after a foul.

A rebound is when a player from the team that is on defense obtains the ball after the other team takes a shot attempt at the hoop. At the end of the game, which is played in four twelve-minute quarters, the team with the most points wins.

For this activity, you will be using data from the data set `nba_draft.csv`. The data set includes data from players that were selected in the NBA drafts between the years 1990-2021. Each row represents a different player and statistics from their respective careers in the NBA, some of which are still ongoing. Some key variables include `mins_per_game` the average number of minutes played per game in a player's career, `round_picked`, which shows what round the player was selected in, and `pick_in_round` which shows what number pick a player was in the round in which they were selected. 

## Part 2 - Loading and Cleaning the Data 

**1.  Load in the data file `nba_draft.csv`.** 

```{r echo=TRUE, warning=FALSE, message=FALSE}
# make sure the dataset is downloaded and is saved to the student's computer. 
library(tidyverse,quietly=TRUE)
nba_draft <- read_csv("nba_draft.csv",show_col_types = FALSE)
```

**2.  Right now, we only want to look at players that were selected in the first round of the draft. This information is found in the `round_picked` variable. Write the code that selects only those players who were chosen in the first round. Store it in an object called `nba_round1s`.**

```{r echo=TRUE}
nba_round1s <- nba_draft %>% filter(round_picked == 1)
```

**3. In order to being our ANOVA exploration, we need to look at different groups within the round one selections. Since we have the order in which the players were selected (stored in the `pick_in_round` variable), we can make an additional categorical variable to keep track of where in the draft the player was selected. Make a new variable, called `pick_category`, which follows the following conditions: **

- `pick_category` = "1-10" if the player was selected in picks 1-10, 
- `pick_category` = "11-20" if the player was selected in picks 11-20,
- `pick_category` = "21-30" if the player was selected in picks 21-30

```{r}
nba_round1s <- nba_round1s %>%
  mutate(
    pick_category = case_when(
      pick_in_round >= 1 & pick_in_round <= 10 ~ "1-10",
      pick_in_round > 10 & pick_in_round <= 20 ~ "11-20",
      pick_in_round > 20 & pick_in_round <= 30 ~ "21-30"
    )
  )
```

**4. Make the `pick_category` variable a factor with three levels.**

```{r}
 nba_round1s$pick_category <- factor(nba_round1s$pick_category,
                                  levels = c("1-10", "11-20", "21-30"))
```

## Part 3 - ANOVA

**1. Create a box plot that compares the average minutes played per game (`mins_per_game`) based on a player's selection in the first round of the NBA draft (`pick_category`). Try to include a title, color coded legend, and axis labels.**

```{r}
ggplot(nba_round1s,
       mapping = aes(x = mins_per_game,
                     y = pick_category,
                     fill = pick_category)) +
  geom_boxplot() +
  labs(title = "Boxplot of MPG based on Round 1 NBA draft picks",
       x = "Minutes Per Game",
       y= "Category in Draft",
       fill = "Pick in NBA Draft") +
  theme_minimal()
```

**2. Below is a density plot and its corresponding code comparing the average minutes played per game based on a player's selection in the first round of the NBA draft. Based on this density plot and the box plots you created above, do you feel as though ANOVA is appropriate? What conclusions can you draw upon first glance of the two data visualizations? Are there any concerns? **

![Density Plot](DensityPlot.jpg)

```{r eval=FALSE}
means = nba_round1s %>%
  group_by(pick_category) %>%
  summarize(
    mean_value = mean(mins_per_game))


ggplot(nba_round1s,
       mapping = aes(x = mins_per_game,
                     fill = pick_category)) +
  geom_density(alpha = 0.4) +
  geom_vline(data = means,
             mapping = aes(xintercept = mean_value, color = pick_category),
             linetype = "dashed",
             size = 1,
             show.legend = FALSE) +
  labs(title = "Distribution of MPG based on Round 1 NBA draft picks",
       x = "Minutes Per Game",
       y = "Density",
       fill = "Pick in draft in NBA") +
  theme_minimal()
```

The box plots show that the variance between the three groups seems to be relatively similar, which is necessary for an ANOVA test. There are limited extreme outliers which is good when looking at the distribution and variance. The distribution of the 11-20 and 21-30 groups seem to follow a relatively normal shape and have means somewhat close together. The 1-10 group is skewed slightly to the right which could be slightly concerning. A one-way ANOVA does seem appropriate, the conditions seem met and there could be a significant difference among the groups. 

**3. Assuming an ANOVA test is most likely appropriate for our data, write and interpret in context the null and alternative hypotheses we'll be using to determine if the average number of minutes played per game (`mins_per_game`) is different based on the position in which the player is picked in the first round (`pick_in_round`).**

* $H_0: \mu_1 = \mu_2 = \mu_3$

  (where $\mu_x$ is the population mean for that group)
  
  * The mean number of minutes played in the NBA for players who were picked 1-10, 11-20, and 21-30 in the NBA draft are all equal. 

* $H_a:$ at least one $\mu_1 \neq \mu_2$

  * For at least one group, the mean number of minutes played in the NBA is not equal. (There is at least one group of draft picks that played a different mean number of minutes in the NBA.) 
  
**4. Run an ANOVA test in R and print the summary table** 

```{r}
mpg_aov <- aov(mins_per_game ~ pick_category,
                 data = nba_round1s)

summary(mpg_aov)
```

**5. Is there sufficient evidence that there is a difference in the mean amount of minutes played per game based on the group that player was selected in the first round of the draft?**

  * We can conclude that there is convincing evidence to reject the null hypothesis that the mean number of minutes played per game is the same regardless of when a player was selected in the first round of the draft. The p-value given in the ANOVA table is $2*10^{-16}$ meaning there is strong evidence to reject the null hypothesis. 
  

**6. Between which groups are we *most likely* to see a significant difference. Between which groups are we *least likely* to see a significant difference in minutes played. Provide evidence.**

* From looking at the means in the density plot and box plot, we are most likely to find a difference between those selected 1-10 and those selected 21-30 because there is the largest difference in mean number of minutes played. We are least likely to find a difference between those selected 11-20 and 21-30 because the difference in mean is the smallest. 

**7. Is there evidence that there is a difference in average minutes played per game between players drafted in picks 1-10 compared to playres drafted in picks 11-20? If so, what is the difference? Provide evidence as well as a specific number.**

```{r}
TukeyHSD(mpg_aov)
```

* From the output, we can see that with convincing evidence, there is a significant difference between the number of minutes played by those selected 1-10 and those who were selected 11-20. We know this because the p-value of that specific comparison is significant (0.0000), meaning that there is a difference for those specific groups. From the output, the difference between number of minutes played between those who were selected 1-10 and 11-20 is 7.035, with those who were selected 1-10 playing on average 7.035 more minutes.

**8. Were your predictions in question 6 correct? Are there any comparisons that don't have a significant difference?** 

* The predictions from question six were correct. The difference in minutes played between those picked 1-10 and 21-30 was the largest (8.98), and the difference in minutes played between those picked 11-20 and 21-30 is the smallest (1.95). All of the differences are significant, with p-values below 0.05. 

 