Cluster analysis is a statistical analysis tool that partitions observations into sub-populations of similar characteristics within the data set. This process can be useful, because similar observations often behave and respond to stimuli in similar ways. Identifying clusters can allow researchers to predict and draw conclusions on the behavior of certain groups. There are many popular topics that use cluster analysis: risk analysis, marketing, real estate, insurance, medical research, and earthquakes.
In this module, we’ll use the clustering of NBA players as an example. Suppose you were an NBA General Manager interested in constructing a high-quality team. The best teams use lots of different kinds of players to achieve their goals. Golden State Warriors Guard Stephen Curry is an incredible shooter and ball-handler, but the Warriors need other kinds of players, too. A team comprised completely of Stephen Curry and his clones would struggle to defend or rebound the ball. The team would also struggle to give each Stephen Curry the playing time and shots that he has come to expect. Instead, General Managers can separate potential players into groups, because it helps them to identify their team needs. This is where cluster analysis proves useful.
For this exercise, imagine that you are the General Manager of the Dallas Mavericks. You are tasked with creating a strong, balanced team. Later in the module, you will have an opportunity to create hypothetical trade scenarios that could benefit the team.
Getting Started
Required Packages
We will be using the following packages in this module. Take the time now to make sure these packages are installed and loaded on your computer.
library("parameters")
Warning: package 'parameters' was built under R version 4.4.3
library("factoextra")
Warning: package 'factoextra' was built under R version 4.4.3
Warning: package 'ggplot2' was built under R version 4.4.3
Warning: package 'formatR' was built under R version 4.4.3
library("tidyverse"); theme_set(theme_minimal())
Warning: package 'tibble' was built under R version 4.4.3
Warning: package 'readr' was built under R version 4.4.3
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'dplyr' was built under R version 4.4.3
Warning: package 'lubridate' was built under R version 4.4.3
library("ClusterR")
Warning: package 'ClusterR' was built under R version 4.4.3
library("mclust")
Warning: package 'mclust' was built under R version 4.4.3
library("easystats")
Warning: package 'easystats' was built under R version 4.4.3
library("here")library("knitr")
Warning: package 'knitr' was built under R version 4.4.3
library("kableExtra")library("condformat")
Warning: package 'condformat' was built under R version 4.4.3
library("formattable")
Warning: package 'formattable' was built under R version 4.4.3
library("reactablefmtr")
Warning: package 'reactable' was built under R version 4.4.3
library("scales")
Warning: package 'scales' was built under R version 4.4.3
library("plotly")library("flextable")
The Data
Our data for this exercise comes from the 2021-2022 NBA Season. This season, the Mavericks finished 4th in the Western Conference with 52 wins and 30 losses under coach Jason Kidd. They exceeded expectations and made the Western Conference Finals.
Our data includes 374 players. Each of these 374 players fulfilled our requirements of appearing in at least 25 games and playing an average of at least 12 minutes (a complete game is 48) in those games. Because of midseason trades or acquisitions, some of the players will appear in our data twice. That’s because they fulfilled our playing time requirements for two different teams in the same season. The second iteration of the player will be marked with a 1 following his name (i.e. Smith becomes Smith1). We’ve divided the variables into two data sets.
The first set of variables are focused on determining the influence a player has on the game. Some of these variables are the players’ minutes per game, total games played and started, points and rebounds per game, and field goal attempts per game. This will be helpful in clustering the players into groups of stars, average starters, and reserves. We’ve termed this data set “usage”. Below is a data dictionary for the first set of variables.
Variable
Explanation
Example
Name
nba player's first and last name
Trae Young or Trae Young1
POS
playing position
PG (point guard), SG (shooting guard), SF (small forward), PF (power forward), C (center)
Team
abbreviation of city of player's team
atl (Atlanta), bos (Boston), etc.
GP
total games played
46, 70, etc.
GS
total games started
7, 56, etc.
MIN
minutes per game
18.2, 30.2, etc.
PTS
points per game
6.8, 14.9, etc.
AST
assists per game
1.1, 3.5, etc.
TO
turnovers per game
0.8, 1.7, etc.
STL
steals per game
0.5, 1.1, etc.
OR
offensive rebounds per game
0.5, 1.4, etc.
DR
defensive rebounds per game
2.3, 4.1, etc.
BLK
blocks per game
0.2, 0.6, etc.
PF
personal fouls per game
1.5, 2.4, etc.
FGM
field goals made per game
2.6, 5.5, etc.
FGA
field goals attempted per game
5.4, 12.2, etc.
3PM
3-point field goals¬ made per game
0.6, 1.9, etc.
3PA
3-point field goals attempted per game
1.9, 5.2, etc.
FTM
free throws made per game
0.8, 2.2, etc.
FTA
free throws attempted per game
1.1, 2.8, etc.
PER
player efficiency rating metric
11.74, 17.27, etc.
SC-EFF
scoring efficiency
1.162, 1.332, etc.
SH-EFF
shooting efficiency
0.48, 0.56, etc.
And here is a small slice of the usage data set.
Name
POS
Team
GP
GS
MIN
PTS
AST
TO
STL
OR
DR
BLK
PF
FGM
FGA
3PM
3PA
FTM
FTA
PER
SC-EFF
SH-EFF
Trae Young
PG
atl
76
76
34.9
28.4
9.7
4.0
0.9
0.7
3.1
0.1
1.7
9.4
20.3
3.1
8.0
6.6
7.3
25.48
1.396
0.54
John Collins
PF
atl
54
53
30.8
16.2
1.8
1.1
0.6
1.7
6.1
1.0
3.0
6.3
11.9
1.2
3.3
2.5
3.1
18.75
1.360
0.58
Bogdan Bogdanovic
SG
atl
63
27
29.3
15.1
3.1
1.1
1.1
0.5
3.5
0.2
2.1
5.4
12.6
2.7
7.3
1.5
1.8
15.49
1.196
0.54
De'Andre Hunter
SF
atl
53
52
29.8
13.4
1.3
1.3
0.7
0.5
2.8
0.4
2.9
4.8
10.8
1.4
3.7
2.4
3.1
10.66
1.233
0.51
Kevin Huerter
SG
atl
74
60
29.6
12.1
2.7
1.2
0.7
0.4
3.0
0.4
2.5
4.7
10.3
2.2
5.6
0.6
0.7
11.91
1.174
0.56
The second set of variables are helpful in determining a player’s role or function in the game. Some of these variables are Field Goal Percentage, Height, and Weight. Lots of the common variables have been converted into per minute values in order to isolate their frequency. These players will be divided into sub-groups like scorers, big men, and wings. We’ve termed this data set “role”. Below is a data dictionary for the second set of variables.
Variable
Explanation
Example
Name
nba player's first and last name
Trae Young or Trae Young1
POS
playing position
PG (point guard), SG (shooting guard), SF (small forward), PF (power forward), C (center)
Team
abbreviation of city of player's team
atl (Atlanta), bos (Boston), etc.
Height
height in inches
76, 81, etc.
Weight
weight in pounds
200, 234, etc.
PTSPerMin
points per minute
0.356, 0.515, etc.
ASTPerMin
assists per minute
0.055, 0.133, etc.
TOPerMin
turnovers per minute
0.036, 0.065, etc.
STLPerMin
steals per minute
0.023, 0.038, etc.
ORPerMin
offensive rebounds per minute
0.022, 0.066, etc.
DRPerMin
defensive rebounds per minute
0.101, 0.175, etc.
BLKPerMin
blocks per minute
0.009, 0.027, etc.
PFPerMin
fouls per minute
0.064, 0.099, etc.
FGP
field goal percentage
0.417, 0.496, etc.
FGMPerMin
field goals made per minute
0.131, 0.192, etc.
FGAPerMin
field goals attempted per minute
0.284, 0.419, etc.
3PP
3 point percentage
0.306, 0.379, etc.
3PMPerMin
3 point field goals made per minute
0.029, 0.072, etc.
3PAPerMin
3 point field goals attempted per minute
0.094, 0.192, etc.
FTP
free throw percentage
0.709, 0.842, etc.
FTMPerMin
free throws made per minute
0.039, 0.087, etc.
FTAPerMin
free throws attempted per minute
0.053, 0.112, etc.
And here is a small slice of the role data set.
Name
POS
Team
Height
Weight
PTSPerMin
ASTPerMin
TOPerMin
STLPerMin
ORPerMin
DRPerMin
BLKPerMin
PFPerMin
FGP
FGMPerMin
FGAPerMin
3PP
3PMPerMin
3PAPerMin
FTP
FTMPerMin
FTAPerMin
Trae Young
PG
atl
73
180
0.814
0.278
0.115
0.026
0.020
0.089
0.003
0.049
0.460
0.269
0.582
0.382
0.089
0.229
0.904
0.189
0.209
John Collins
PF
atl
81
235
0.526
0.058
0.036
0.019
0.055
0.198
0.032
0.097
0.526
0.205
0.386
0.364
0.039
0.107
0.793
0.081
0.101
Bogdan Bogdanovic
SG
atl
78
220
0.515
0.106
0.038
0.038
0.017
0.119
0.007
0.072
0.431
0.184
0.430
0.368
0.092
0.249
0.843
0.051
0.061
De'Andre Hunter
SF
atl
80
225
0.450
0.044
0.044
0.023
0.017
0.094
0.013
0.097
0.442
0.161
0.362
0.379
0.047
0.124
0.765
0.081
0.104
Kevin Huerter
SG
atl
79
190
0.409
0.091
0.041
0.024
0.014
0.101
0.014
0.084
0.454
0.159
0.348
0.389
0.074
0.189
0.808
0.020
0.024
Part 1: Idea of similarity/distance - Interactive
Below is a set of ten Dallas Maverick Players from 2021-2022 that met our playing-time restrictions. Kristaps Porzingis was traded in the middle of the season, but he still met our playing-time qualifications for the Dallas Mavericks. For this example, we’ve combined a few of the variables from both the usage and role data sets. Consider the players Sterling Brown, Maxi Kleber, Dwight Powell, and Josh Green.
Name
Height
Weight
MIN
PTS
OR
DR
AST
STL
BLK
TO
2PA
2PP
3PA
3PP
3PAPerMin
ORPerMin
Luka Doncic
79
230
35.4
28.4
0.9
8.3
8.7
1.2
0.6
4.5
12.8
0.528
8.8
0.353
0.249
0.025
Kristaps Porzingis
87
240
29.5
19.2
1.9
5.8
2.0
0.7
1.7
1.6
9.9
0.537
5.1
0.283
0.173
0.064
Jalen Brunson
73
190
31.9
16.3
0.5
3.4
4.8
0.8
0.0
1.6
9.6
0.545
3.2
0.373
0.100
0.016
Tim Hardaway Jr.
77
205
29.6
14.2
0.3
3.4
2.2
0.9
0.1
0.8
5.4
0.473
7.2
0.336
0.243
0.010
Dorian Finney-Smith
79
220
33.1
11.0
1.5
3.2
1.9
1.1
0.5
1.0
3.2
0.599
5.4
0.395
0.163
0.045
Dwight Powell
82
240
21.9
8.7
2.1
2.8
1.2
0.5
0.5
0.8
4.4
0.703
0.5
0.351
0.023
0.096
Reggie Bullock
78
205
28.0
8.6
0.5
3.1
1.2
0.6
0.2
0.6
1.6
0.550
5.8
0.360
0.207
0.018
Maxi Kleber
82
240
24.6
7.0
1.2
4.7
1.2
0.5
1.0
0.8
1.7
0.586
4.3
0.325
0.175
0.049
Josh Green
77
200
15.5
4.8
0.8
1.6
1.2
0.7
0.2
0.7
2.7
0.573
1.2
0.359
0.077
0.052
Sterling Brown
77
219
12.8
3.3
0.5
2.5
0.7
0.3
0.1
0.5
1.3
0.492
1.9
0.304
0.148
0.039
Exercise 1
For these four players, compare their available statistics.
Which of the four players are most similar kinds of players? Which variables make them similar?
Which variables do they most differ? Which of the four players are the most “different”? Which variables differentiate them the most? Are they similar in any of the categories?
One common and effective way to compare the similarity of two points (or in this case, players) is the euclidean distance formula. The distance formula is found by the following formula:
You can visualize this as drawing the shortest line possible between two points and then measuring it. Right now, our variables are in different units (inches, pounds, points, percentage, etc.), so we’ll standardize (more on this later) each of the variables, so the units are equal. This helps each variable to have equal importance in our distance formula.
Below is a table of the distances between each of the players. Match up the player in the column with the player in the row and you’ll find the distance between them. The smaller the value, the more similar the players are.
Dwight Powell Maxi Kleber Josh Green
Maxi Kleber 4.269475
Josh Green 4.554980 4.270914
Sterling Brown 5.846063 3.940473 3.102775
Below is a visualization of the distances. As the distances increase, the color changes from red to blue. Players matched with themselves will be dark red, because their distance is 0.
fviz_dist(Distance, gradient =list(low ="indianred3",mid ="white", high ="dodgerblue3"))
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Exercise 2
Do the tabulated results agree with your previous assessment?
Which is more accurate: your original assessment or the similarity metric?
Part 2: Performing a Cluster Analysis
Calculating the distance between points is the first step in a distance-based cluster analysis. The players with the smallest distance (or with the most similarity) between them are naturally placed in a cluster together.
How does the clustering actually work? As an illustration, we’ll use a basic plot of the Offensive Rebounds and 3-Point Shooting of our Dallas Mavericks players. We’ve standardized the results by adjusting them to per-minute values.
Exercise 3
What do you notice about the data? How would you group the players?
How would you describe these groupings?
In a cluster analysis, every point needs to belong to a cluster. Do any points not seem to have a cluster?
Cluster analysis is the process of partitioning the data into sub-populations or clusters. This is done so that observations in the same cluster are more similar to each other than observations in a different group. These clusters then can be analyzed.
One common method to divide the data into these clusters is distance based and uses the K-Means Algorithm. The k-means algorithm partitions the data into clusters which can then be analyzed. Furthermore, this is performed in an unsupervized fashion. This means that the clusters are found by the algorithm and not predetermined by the researcher. In the NBA example, we cannot determine our clusters beforehand. The algorithm may confirm our original intuition, but this is not guaranteed.
The K-Means Algorithm assigns the data into clusters so that the sum squared distance between the center (or mean) of the clusters and each observation is minimized. At the end, the variance of the all the points within each cluster is as small as possible. One downside of the K-Means Algorithm is that users must predetermine the number of clusters they’d like to create. This is entered as the parameter, K. Let’s say we want to separate our data into K = 2 clusters. The K-Means algorithm will go through four basic steps:
Randomly select two initial cluster centers.
Assign each observation to the closest center.
Calculate the mean of all the observations within each cluster. These cluster means become the new center of each cluster.
Repeat steps 2-3 until no further changes are made.
As these steps are followed, the clusters will move closer and closer to their final positions. Since the first step is to randomly assign cluster centers, the K-Means approach can occasionally yield different results. It’s worth trying it a few different times with different starting points.
Before you look below, provide your estimation of the two clusters of our Dallas Mavericks players. Where would you anticipate the cluster centers to be located?
The code below runs the k-means algorithm. In the kmeans function, the first argument is the data, the second is the number of clusters to be fit (i.e. \(k\)) and nstart is the number of random starting points to use for the algorithm.
Notice the large points in the middle of each cluster. These are the cluster centers. Are they where you expected?
How do you think the groupings will change with three clusters?
How do you think the groupings will change with three clusters? We can easily tell K-Means to randomly assign three centers, and the process of assigning points to cluster means will continue exactly as before.
set.seed(3)dallas3Means <-kmeans(dallasKMeans_prep, centers =3, nstart =50)dallas3fviz <-fviz_cluster(dallas3Means, dallasKMeans_prep,show.clust.cent =TRUE, stand =FALSE,labelsize =7, pointsize =1,main ="Mavericks K = 3 Clusters",xlab ="3 Point Attempts Per Minute",ylab ="Offensive Rebounds Per Minute")dallas3fviz
Or four clusters?
set.seed(22329)dallas4Means <-kmeans(dallasKMeans_prep, centers =4, nstart =50)dallas4fviz <-fviz_cluster(dallas4Means, dallasKMeans_prep,show.clust.cent =TRUE, stand =FALSE,labelsize =7, pointsize =1,main ="Mavericks K = 4 Clusters",xlab ="3 Point Attempts Per Minute",ylab ="Offensive Rebounds Per Minute")dallas4fviz
Exercise 5
What happens to Dwight Powell when we increase $k$ to 4?
Would Dwight be considered an outlier? Why? Is this helpful from a clustering perspective?
Now consider five clusters.
set.seed(102)dallas5Means <-kmeans(dallasKMeans_prep, centers =5, nstart =50)dallas5fviz <-fviz_cluster(dallas5Means, dallasKMeans_prep,show.clust.cent =TRUE, stand =FALSE,labelsize =7, pointsize =1,main ="Mavericks K = 5 Clusters",xlab ="3 Point Attempts Per Minute",ylab ="Offensive Rebounds Per Minute")dallas5fviz
At some point, the power of clustering the points begins to fade. Does Dwight Powell deserve to be in a cluster of his own? Possibly. Does Reggie Bullock? Definitely not.
Exercise 6
Which of the four values of K did you find most useful or accurate?
Were there ever too few or too many clusters?
Part 4: Choosing the Number of Clusters
So, how can we choose the optimal number of clusters?
It’s helpful to evaluate the effectiveness of the clusters for each value K. There are plenty of ways to test this effectiveness, but we’ll walk through a common example called the Elbow Method. The Elbow Method totals up the distance between the centers of each cluster and their observations. This is called the Total Within Summed Squares (TWSS). As K increases and more clusters are added to the model, the sum squared distance will decrease. Eventually, the value of each additional cluster diminishes. The Elbow Method plots the results, and the user can look for a point when increasing the number of clusters no longer proves useful. Often, this point looks like an Elbow.
The graph demonstrates that the value of each additional cluster decreases as more clusters are added. The bends in the graph indicate that clusters beyond four have little value. Despite being common, the Elbow Method is often ambiguous and difficult to interpret. Look for the bend in the Elbow Plot. K = 2, K = 3, and K = 4 all seem like reasonable conclusions.
The Elbow plot is just one test to determine the optimal number of clusters. Two other popular methods are the Average Silhouette Method and the Gap Statistic Method. In all, there are dozens of methods to determine the ideal number of clusters and they often disagree. We’ll take a consensus of 27 methods and proceed from there.
The tests give varied estimates for the optimal clusters, but it is up to the user to decide how many clusters you will include in your K-Mean Algorithm. It’s common practice to choose several and compare the results of each.
From there, we would conduct our analysis of each cluster and examine the results.
After the clustering is completed, how can we analyze our clustering solution?
We want to reduce the Total Within Summed Squares (TWSS) or distance from each observation to its cluster mean, but we also want to minimize the total number of clusters used.
Two helpful measurements to summarize these preferences for our clusters are intra-class similarity and inter-class similarity.
Intra-class similarity tests the relationship between observations of the same cluster. We want this similarity to be high. We want all the observations in a cluster to exhibit similar features.
Inter-class similarity tests the relationship between different clusters. We want this relationship to be low. Ideally, each cluster is distinct and the observations within can be clearly assigned to a cluster.
As we increase the number of clusters, K. The intra-class similarity will increase, because observations will be assigned to smaller clusters that a more representative. However, the inter-class similarity will also increase, because the cluster centers are now closer together. This is why it is impractical to choose a large value for K.
Recall our clustering for the Dallas Mavericks players.
Which value of K has the highest intra-class similarity?
Which cluster specifically?
Which value of K has the highest inter-class similarity?
Part 5: A Larger Dataset
Let’s focus now on our larger data set with many more variables and observations. It seems like it’d be more complicated, but the process is almost exactly the same. One important distinction to remember is that the large number of dimensions make the data difficult to visualize. There are different methods that aid in this visualization. We’ll walk you through the usage data set and demonstrate appropriate analysis, and then allow you to work through the role data set.
Remember the usage data set? It contains variables aimed at categorizing the workload and skill of the players. We hope to divide players into sub-groups like stars and bench players.
It is very important that we standardize the data first. Lots of our variables have different units. Games played and Blocks per game are hard to compare without scaling. Without standardizing, the large values- like Games Started or Games Played- will exert too much influence on the data. Now, each value is described in relation to the other observations. After standardizing, Trae Young’s assist total is 3.656, so we know that he has a lot more assists than the average player in our data set. Often, the standardized data is difficult to contextualize, so we’ll want to convert the data back for analysis. Below is a small glimpse into what our standardized data looks like.
Name
POS
Team
GP
GS
MIN
PTS
AST
TO
STL
OR
DR
BLK
PF
FGM
FGA
3PM
3PA
FTM
FTA
PER
SC-EFF
SH-EFF
Trae Young
PG
atl
1.212
1.690
1.478
2.819
3.656
3.186
0.346
-0.422
-0.191
-0.961
-0.433
2.418
2.446
2.049
1.883
3.457
2.948
2.481
0.910
0.085
John Collins
PF
atl
-0.221
0.820
0.903
0.802
-0.386
-0.291
-0.479
0.857
1.514
1.296
1.704
0.984
0.623
-0.071
-0.116
0.540
0.502
0.924
0.676
0.797
Bogdan Bogdanovic
SG
atl
0.365
-0.163
0.692
0.620
0.279
-0.291
0.896
-0.678
0.036
-0.710
0.224
0.568
0.775
1.603
1.585
-0.172
-0.255
0.170
-0.389
0.085
De'Andre Hunter
SF
atl
-0.286
0.782
0.762
0.339
-0.642
-0.052
-0.204
-0.678
-0.362
-0.209
1.539
0.290
0.385
0.152
0.054
0.469
0.502
-0.948
-0.149
-0.449
Kevin Huerter
SG
atl
1.082
1.085
0.734
0.124
0.074
-0.172
-0.204
-0.806
-0.248
-0.209
0.882
0.244
0.276
1.045
0.862
-0.812
-0.895
-0.658
-0.532
0.441
Let’s begin by taking a look at the Elbow plot of the usage dataset.
The Elbow plot shows that the algorithm experiences diminishing returns after K = 2 and K = 3. From the Elbow Plot, we would expect that the consensus lies somewhere between 2 and 5 clusters. Now consider the multiple methods for the selection of $k$.
The tests favor three clusters. Some tests also prefer two and four clusters, so those models are worth a look.
But before we begin, let’s first look through the variables in our analysis and see which ones have the most influence on the clustering. If some have little or no influence, we can simplify our analysis by removing them.
The visualization below demonstrates the differences between our two clusters. The variables that have large differences are important in the clustering assignment. They greatly influence the assignment of an observation.
as_tibble(usage2Means$centers, rownames ="cluster") %>%pivot_longer(cols =c(GP:`SH-EFF`), names_to ="variable") %>%group_by(variable) %>%summarise(Influence =abs(mean(value))) %>%mutate(variable =factor(variable, levels = usage_levels) ) %>%ggplot(aes(x = variable, y = Influence)) +geom_bar(stat ="identity", fill ="cadetblue3") +labs(title ="Influence on Cluster Assignment", x ="", y ="") +theme(axis.text.y =element_blank(),legend.position ="none",axis.text.x =element_text(angle =-45, size =9))
This type of exercise is essential for clustering analysis, because it allows one to see which variables are important to consider when classifying an observation.
This visualization scales the centers of the variables for each cluster and contrasts them. Variables with large positive or negative values have a large influence on the clustering. These variables help differentiate the cluster. Variables with an influence close to 0 have less importance.
We see a great diversity in the variables that possess significant influence on the clustering.
Exercise 8
Which variables seem to contribute the most to the clustering result?
Which variables contribute the least to the clustering result?
Scoring Efficiency and Shooting Efficiency both lack influence. Games Played, Offensive Rebounds, and Blocks all also don’t contribute much to our clustering. We chose to remove Shooting Efficiency and keep the other four, but we easily could have removed them from our analysis.
Note for Reviewer. Removing the five variables causes a slight shift in the cluster assignment. This changes some of the analysis and points I was making on the outliers, and it makes comparison between K = 2 and K = 3 more difficult. We don’t remove any of the variables when K = 3. Still, it could make things confusing to not remove variables with very little influence. I’m open to suggestions on what to do here.
Now that we’ve removed some variables. Let’s see how many observations are within each cluster.
Cluster
Size
1
119
2
255
The clusters are not identical in size, and it’s different enough that we should keep an eye on it. It’s important to verify that each of the clusters contain a significant number of observations. Like we saw with Dwight Powell earlier, sometimes small clusters can tell us valuable information about the observations they contain.
The K-Means Algorithm will assign each observation a cluster and print out descriptive statistics of each cluster. This can give us a good idea of what makes up each cluster. We went back and unstandardized the data.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, digits = 3)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
Generally, it looks like cluster 1 contains starter caliber players and cluster 2 includes the bench players. This helps to explain why cluster 1 is a bit smaller than cluster 2.
Now, let’s look at the clusters graphically. This can help us to see how different the clusters really are from each other. The graph is created by combining the values of all the variables in a visually understandable way. This is through a process called Principle Component Analysis (PCA). Link to more defined explanation of PCA.
usage2fviz <-fviz_cluster(usage2Means, usageKMeans_prep,geom ="point",show.clust.cent =TRUE, stand =FALSE,pointsize =1,main ="Usage K = 2 Clusters")usage2fviz
Many of the observations in both clusters lie close to the border. This indicates that the division between the clusters was close and there may be some observations that could have been placed in either cluster. The centers are fairly close and located at about (-3,0) and (2,0).
There are several large outliers in both clusters, but especially in the lower portion of the visualization in both clusters and the left portion cluster 1.
Prototypes
To help us understand the clusters better, let’s look at some players that fall very close to the cluster center. We’ll call the players that represent the cluster well prototype players.
Consider Khris Middleton, Miles Bridges, and Gordon Hayward. The three players all play a similar position; one that allows them to contribute in all areas of the game. There was significant variety in the number of Games Played, but they Started in each game and received a lot of playing time. They all played over 30 Minutes per game and scored about 20 Points a game. Their Rebound, Assist, Block, and Turnover totals vary a little bit, but they are all fairly high. They all took and made roughly the same number of shots per game (15.2-15.9 FGA) and (6.8-7.5 FGM).
Let’s move on to cluster 2. First, notice how much smaller the distances are from the cluster 2 center. More observations lie close to cluster 2’s center than cluster 1. This is not entirely surprising, as there are almost 100 more players in cluster 2 than 1.
Again consider potential prototypes for the second cluster.
Once again, the prototypes look like an average NBA player. They each played around 55 Games and Started in very few of them. They played about 17.1-20.3 Minutes a game and scored from 6.4-8.1 Points a game. Their Rebound, Assist, Steal, Block, Turnover, and Foul values are fairly low and generally close together. They also don’t take as many shots as cluster 1 - only about 6 Field Goal Attempts per game.
Outliers
Now, let’s look through some of the players that fall farthest from the center of their cluster. These players are cluster outliers. In these cases, the clustering least represents the observation. These players are very different from the center. It can be helpful to identify and explain outliers by comparing them to our prototype players. How do they differ? What attributes led to their classification?
Is there a way to only label a few of the points in the visualization
Sometimes, you’ll need to do some digging on the outliers. We chose to show you Khris Middleton and Blake Griffin’s characteristics again for comparison. Joel Embiid, Giannis Antetokounmpo, and Myles Turner represent two very different kinds of outliers. Embiid and Giannis are superstars. They finished second and third in the MVP voting in the 2021-2022 season. They are very far from the prototype of cluster 1, but they are even further from the prototype of cluster 2. These are the points near (-10, -5) in the visualization.
Myles Turner, however, possesses some attributes that could be classified as cluster 1 and cluster 2. He played lots of Minutes, Started most games, and had strong Rebounding values. However, his shooting numbers fall right between the clusters, and he doesn’t tally very many Points, Assists, Steals, or Turnovers. This point is likely the (-5, -9) outlier in the visualization. He is a borderline case. Is there a more statistical word for this?
These cluster 2 outliers are all similar players. Robert Williams III, Mitchell Robinson, and Clint Capela are all big men. Like Myles Turner, they are players that play a lot of Games and Minutes, get lots of Rebounds and Blocks, but don’t shoot very much. Our data emphasizes shooting a lot and perhaps this leaves players like these without an appropriate cluster. They are borderline candidates that perhaps would benefit from another cluster.
# This below is from code that Dr. Sturdivant sent me. The cluster_analysis function produces a different size clusters than we got from the K-Means function# set.seed(121)# res_2means <- cluster_analysis(usage_rm,# n = 2,# method = "kmeans")# # res_2means# summary(res_2means)# # # predict(res_2means) # get clusters# plot(res_2means)
Now, let’s analyze the strength of K = 2 clusters. For reference, we’ve repeated the visualization below.
usage2fviz
The two clusters possess strong inter-class differences. For only two clusters, cluster 1 and cluster 2 are fairly distinct. The centers are far apart and demonstrate two different classifications of players. Cluster 1 is clearly a sub-population of starting, high-volume players and cluster 2 is a sub-population of bench players. Still, we’ve analyzed the outliers and found some players that could fall in either cluster. There could be some confusion for players like Robert Williams and Myles Turner. These players seem more similar to each other than most of the players in their own cluster. These outliers fall around (-2, -7). Check the visualizations again to see the cluster of players near there.
The intra-class similarity is fairly low. The clusters are large and have many outliers in each of the directions. Players like Giannis Antetokounmpo, Khris Middleton, and Myles Turner have little in common, but they are all grouped into cluster 1. Yet, most of cluster 1 produce larger values and most of cluster 2 have smaller numbers.
K = 3 Clusters - Interactive
# keep in case of resetset.seed(4)usage3Means <-kmeans(usageKMeans_prep, centers =3, nstart =50)
Now, let’s look at the consensus tests’ most popular number of clusters: K = 3. Here, we’d like you to produce your own analysis of the results. If you need help, look back at the K = 2 example.
As you progress, fill out this table with descriptors of the three clusters. This will be helpful for you as you try to identify their distinctions.
Once again, let’s first look through the variables in our analysis and see which ones have the most influence on the clustering.
This visualization plots the centers for each variable in a cluster. At a glance, this helps us to understand the characteristics of each cluster. We can see that cluster 2, for example, has high offensive rebounds and blocks per game, but low 3 point attempts and 3 point makes.
It can also tell us what variables are unimportant. If a variable has the similar mean throughout all three clusters, then the variable does not help us to distinguish between the clusters. If a variable has a large positive value in one cluster and a large negative value in another, then that variable is very useful for classifying our data.
# creates a dataset of each variable and the standardized center and graphs itas_tibble(usage3Means$centers, rownames ="cluster") %>%pivot_longer(cols =c(GP:`SH-EFF`), names_to ="variable") %>%mutate(variable =factor(variable, usage_levels)) %>%ggplot(aes(x = variable, y = value, fill = cluster)) +geom_bar(stat ="identity") +facet_grid(rows =vars(cluster)) +theme(axis.text.x=element_text(angle =-45, hjust =0, size =10)) +scale_y_continuous(position ="right") +labs(title ="Influence on the Cluster Assignment", x ="", y ="Cluster") +theme(axis.text.y =element_blank(),legend.position ="none")
Before you analyze, remember that variables with a strong negative value still have large influence. It’s just a negative association with a variable instead of a positive association.
What do you notice about the variables? Which kinds of variables possess significant influence? Some variables have a strong influence in one cluster, but a weak influence in another cluster. Why is this?
After analyzing, would you choose to remove any variables from the data?
Is there a better way to look at the variables and remove the less influential ones?
We chose to remove the Games Played variable, because its influence was close to 0 in all three clusters. All of the other variables had a large effect in some category.
# reproducing K = 3 means without insignificant variablesset.seed(4)usage3Means <- usageKMeans_prep %>%select(-GP) %>%kmeans(centers =3, nstart =50)# creating a second usage without those variables so i don't have to reproduce it 800 million times.usage3 <- usage %>%select(-GP)usage_rm3 <- usage_rm %>%select(-GP)
Now that we’ve removed some variables. Let’s see how many observations are within each cluster.
What do you notice about the cluster size? What could this tell us about the clusters?
The clusters are not identical in size, but the clusters are each large enough that there is no reason to be concerned.
# un-standardizing and calculating the meanusage3centers <-as_tibble(usage3Means$cluster) %>%mutate(Name = usage$Name) %>%rename(Clusters = value) %>%left_join(usage3, by ="Name") %>%group_by(Clusters) %>%summarise(across(where(is.numeric), mean) ) %>%mutate(across(where(is.numeric), round, digits =3))usage3centers %>%flextable() %>%align(align ="center", part ="all") %>%width(j =c(2:15), width = .5)
Clusters
GS
MIN
PTS
AST
TO
STL
OR
DR
BLK
PF
FGM
FGA
3PM
3PA
FTM
FTA
PER
SC-EFF
SH-EFF
1
57.539
33.089
19.346
4.773
2.355
1.071
0.983
4.652
0.513
2.289
6.966
15.224
2.066
5.755
3.342
4.123
17.816
1.267
0.525
2
36.295
22.354
9.580
1.551
1.175
0.649
2.220
4.575
0.954
2.438
3.818
6.684
0.351
1.034
1.611
2.315
18.374
1.455
0.604
3
17.185
20.735
7.992
1.773
0.902
0.667
0.709
2.519
0.333
1.669
2.924
6.708
1.139
3.254
1.005
1.304
12.231
1.193
0.520
What do you notice about the cluster means? Without looking any further, how would you describe the three clusters? Jot down some notes in your table.
Now, let’s look at the clusters graphically.
usage3fviz <-fviz_cluster(usage3Means, usageKMeans_prep,geom ="point",show.clust.cent =TRUE, stand =FALSE,pointsize =1,main ="Usage K = 3 Clusters")usage3fviz
What do you notice about the visualization? Are there a lot of observations that reside on the border? Where are the centers and outliers of each cluster?
Compare the new visualization with the K = 2 visualization. Where did the third cluster come from? What kinds of players?
If you were to create a fourth cluster, what points would you group together?
Let’s look at our prototype and outlier players. We’ve compiled them all into a table for you to compare and contrast.
# standardizing the distances between the playersusage3Means_scale <-as_tibble(usage3Means$centers) %>%mutate(cluster =1:3)# creating appropriate tibble for distance formulausage_fitted3Means <- usage3Means$cluster %>%as_tibble() %>%rename(cluster = value) %>%left_join(usage3Means_scale) %>%select(-cluster)
Joining with `by = join_by(cluster)`
# distance from cluster centerdistances <-sqrt(rowSums((usage_rm3 - usage_fitted3Means)^2)) %>%as_tibble() %>%rename(distance = value) %>%mutate(Name = usage$Name,Cluster = usage3Means$cluster)# creating a master document with all of the prototypes and all of the outliers.master_distances <- distances %>%group_by(Cluster) %>%mutate(outlier_rank =order(order(distance, decreasing=TRUE)),proto_rank =order(order(distance, decreasing =FALSE))) %>%filter(outlier_rank <4| proto_rank <4) %>%mutate(Category =if_else(proto_rank <4, "Prototype", "Outlier") ) %>%select(Name, Cluster, Category) %>%left_join(usage3) %>%arrange(Cluster, desc(Category))
Use the above tables to summarize each of the 6 categories. What kind of players belong in each category? Is there a lot of variation within the prototypes? Is there a lot of variation within the outliers? Which of the outliers are closest to a different cluster? Would you reclassify any of the outliers?
After looking through the clusters, why do you think cluster 2 is so much smaller?
Let’s analyze the overall strength of K = 3 clusters. How does the intra-class similarity compare with K = 2? The inter-class similarity?
# usage3fviz
Comparing K = 2 to K = 3 - Mix
Often, it is interesting to compare the cluster results. Here, we tabulated the cluster assignments between K = 2 and K = 3. This can help us to see how the clustering with K = 2 overlaps with K = 3.
# creating a tibble of the cluster of each player for each Kclusters <-tibble(player = usage$Name,Cluster = usage2Means$cluster,clus3 = usage3Means$cluster,clus4 = usage4Means$cluster)# tabulating K = 2 and K = 3 clusterscompare_K2K3 <-with(clusters, table(Cluster, clus3)) %>%as_tibble() %>%pivot_wider(names_from = clus3, values_from = n)# printing table using kablecompare_K2K3 %>%flextable() %>%align(align ="center", part ="all")
Cluster
1
2
3
1
102
13
4
2
0
48
207
What do you notice about the clustering distribution?
We can see that most players in cluster 1 from K = 2 stayed in cluster 1 when K = 3. We identified both of these clusters as the “starters,” so this makes a lot of intuitive sense. Most of cluster 2 from K = 2 moved into cluster 3 when K = 3. The interesting transition comes with the middle cluster of K = 3. This cluster is full of big men that don’t score a lot. They came from both cluster 1 and cluster 2 of K = 2. We saw this in our outlier analysis earlier.
Exercise 10
What are the benefits and costs of both K = 2 and K = 3? Which would you choose?
Part 6: Role Data Set
Now we move on to a second data set and we want to give you a lot more autonomy to test different clusters or outliers yourself. The data set is different, but the process is almost exactly the same. If you have questions, we’ll give you hints or you can look back to the usage data set for a clear example.
Remember the role data set? It contains variables aimed at categorizing the function and specific characteristics of the players. We hope to divide players into sub-groups like scorers, 3-point shooters, and rebounders.
Even though most of our data has been set to adjusted “per minute” quantities. It is still very important that we standardize the data first. Otherwise common values like points per minute will outweigh the effect of less common characteristics like blocks per minute. Now each variable is on the same scale. Often, the standardized data is difficult to contextualize, so we’ll want to convert the data back for analysis. Below is a small glimpse into what our standardized data looks like.
We could also give a short mini lesson on the importance of standardizing using games started and blocks or something like that.
# initializing our datasets a second time in case student decides to remove a variable.# For some reason, when I round to 3 digits, the elbow plot no longer suggests K = 7. This is very surprising. So I've decided to keep it rounding to 4 digits, because I have done so much work for K = 7.role <- nba %>%select(Name, POS, Team, Height, Weight, PTSPerMin, ASTPerMin, TOPerMin, STLPerMin, ORPerMin, DRPerMin, BLKPerMin, PFPerMin, FGP, FGMPerMin, FGAPerMin, `3PP`, `3PMPerMin`, `3PAPerMin`, FTP, FTMPerMin, FTAPerMin)# standardizing the data for KMeansroleKMeans_prep <- role %>%mutate(across(where(is.numeric), standardize))# displaying the standardized data for studentroleKMeans_prep %>%slice(1:5) %>%mutate(across(where(is.numeric), round, digits =3)) %>%flextable() %>%align(align ="center", part ="all") %>%width(j =1, width =1.3) %>%width(j =c(2:5), width = .6) %>%width(j =c(6:12), width = .95)
Name
POS
Team
Height
Weight
PTSPerMin
ASTPerMin
TOPerMin
STLPerMin
ORPerMin
DRPerMin
BLKPerMin
PFPerMin
FGP
FGMPerMin
FGAPerMin
3PP
3PMPerMin
3PAPerMin
FTP
FTMPerMin
FTAPerMin
Trae Young
PG
atl
-1.655
-1.496
2.824
3.171
2.751
-0.499
-0.698
-0.944
-1.083
-1.288
-0.085
2.237
2.300
0.525
1.354
1.164
1.376
3.112
2.499
John Collins
PF
atl
0.842
0.774
0.620
-0.708
-0.773
-1.019
0.269
1.024
0.734
0.474
0.820
0.857
0.313
0.357
-0.356
-0.443
0.259
0.360
0.275
Bogdan Bogdanovic
SG
atl
-0.094
0.155
0.539
0.129
-0.692
0.469
-0.780
-0.392
-0.840
-0.457
-0.483
0.426
0.757
0.394
1.468
1.427
0.762
-0.405
-0.529
De'Andre Hunter
SF
atl
0.530
0.362
0.036
-0.970
-0.420
-0.689
-0.788
-0.852
-0.435
0.471
-0.332
-0.069
0.069
0.497
-0.081
-0.219
-0.022
0.343
0.344
Kevin Huerter
SG
atl
0.218
-1.083
-0.277
-0.129
-0.558
-0.676
-0.877
-0.719
-0.429
0.005
-0.168
-0.118
-0.078
0.590
0.857
0.637
0.410
-1.193
-1.304
# finishing prepping data for KMeans procedureroleKMeans_prep <- roleKMeans_prep %>%column_to_rownames(var ="Name") %>%select(-Team, -POS)
Let’s check our Elbow plot to get an idea of the clustering.
# removing text for visualizations and standardizingrole_rm <- role %>%select(-Name, -POS, -Team) %>%mutate(across(where(is.numeric), standardize))fviz_nbclust(role_rm, kmeans, method ="wss", k.max =24) +theme_minimal() +labs(title ="The Elbow Method")
Exercise 11
a) What do you see from the Elbow plot? At what point do the returns diminish?
plot(roleClust) +labs(title ="Optimal Number of Clusters", x ="")
There’s a lot of variation in the preferred number of clusters. How many clusters would you choose to analyze? How many values of K would you like to analyze? This is totally up to you. Feel free to move back and forth through this section to analyze the data as much as you like.
Exercise 12 (Maybe a final analysis for them to do?)
We will be using K = 7 for the trade scenario portion, so we recommend you review through K = 7.
give them space to choose
# assume that they want K = 7.stu_cluster <-7
Ok, you’ve chosen K = 7. Here is an empty table for you to describe each of the clusters. As you grow in understanding of each of the clusters, fill it out with a few distinguishing words. Make sure you can glance at the table and understand what separates one cluster from another.
We’ll begin by looking at the mean for each variable of a cluster. Remember, this can help us identify variables that are not useful and get a general understanding of the characteristics of each cluster.
There may be a lot of variables, so we flipped the coordinates of the plot to make it easier to read. A bar to the right indicates a positive association and a bar to the left indicates a negative association.
# creating factor levels for rolerole_levels <-colnames(role)# creates a dataset of each variable and the standardized center and graphs itas_tibble(roleKMeans$centers, rownames ="cluster") %>%pivot_longer(cols =c(Height:FTAPerMin), names_to ="variable") %>%mutate(variable =factor(variable, role_levels)) %>%ggplot(aes(x = variable, y = value, fill = cluster)) +geom_bar(stat ="identity") +coord_flip() +geom_hline(yintercept =0) +facet_grid(cols =vars(cluster), switch ="both") +labs(title ="Influence on the Cluster Assignment", x ="", y ="Cluster") +theme(axis.text.x =element_blank(),legend.position ="none")
Sift through the variables to see if any are unused throughout the clusters. If so, this indicates that the variable does not help differentiate the data into clusters. You can remove it here:
# if the student wants to remove a variable enter it hererole_var_rm <-0# reproducing roleKMeans without the removed variablesset.seed(100)roleKMeans <- roleKMeans_prep %>%select(-all_of(role_var_rm)) %>%kmeans(centers = stu_cluster, nstart =50)role <- role %>%select(-role_var_rm)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(role_var_rm)
# Now:
data %>% select(all_of(role_var_rm))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
role_rm <- role_rm %>%select(-role_var_rm)
If you chose a large number of clusters, it may be difficult to use this visualization to remove unimportant variables. Instead, you should be able to see some of the important attributes of each of the clusters. Be thinking of identifiers for each cluster. Which variables are important throughout?
Let’s begin to analyze the numeric values of the centers. Look through each cluster’s characteristics. What sticks out to you?
Which clusters are scorers? Which are rebounders? Which have higher assist numbers? Higher 3-point shooting? Are any two clusters similar? What differentiates them?
At this point, give a short descriptor of each cluster. Each cluster should be uniquely described.
Does this surprise you? Which clusters are large and small? Does this fit with your perception of the makeup of NBA teams?
Let’s look at the distribution of the players.
rolefviz <-fviz_cluster(roleKMeans, roleKMeans_prep,geom ="point",show.clust.cent =TRUE, stand =FALSE,pointsize =1,main ="Role K Clusters")rolefviz
What do you notice from the visualization? Remember, the dimensions cannot represent all the data, so we may have clusters that overlap. Imagine that there is a third dimension “Z” that explains another 30%-40% of the data.
Where are the cluster centers and outliers? Which clusters seem to be the closest together? Furthest away? Are any clusters more isolated than others? Is this supported by your previous analysis?
If you had to add another cluster where would it be? If you had to remove a cluster, where would it be?
Let’s look at our prototype and outlier analysis.
First, we need to verify that our prototypes and outliers are prototypes and outliers. Now that we can change the number of clusters, its possible that you have some pretty small clusters. With a smaller sample size, we want to ensure that all our prototypes are indeed close to the cluster center and that all our outliers are indeed far away. In our K = 2 usage analysis, our prototypes were about 1-2.3 units away from the center. Our outliers were about 6-8.5. However, as K increases, the outlier distances should fall. Let’s look at the distances from the center of our top 3 prototypes and outliers from each cluster to see how they compare.
# standardizing the distances between the playersroleKMeans_scale <-as_tibble(roleKMeans$centers) %>%mutate(cluster =1:n())# creating appropriate tibble for distance formularole_fittedKMeans <- roleKMeans$cluster %>%as_tibble() %>%rename(cluster = value) %>%left_join(roleKMeans_scale) %>%select(-cluster)
Which prototypes are the strongest prototypes? Which prototypes do you trust the most? Which are the strongest outliers? Would you disqualify any outliers or prototypes from the analysis (i.e. a supposed outlier is not far enough from the center or a labeled prototype is too far from the center).
Is this too long? I could remove the two long outliers table and only use the shorter one?
If you wish to disqualify a player from analysis, do it here:
Provide a space for the student to remove player’s from the analysis. Assume student disqualifies Nic Claxton. Just for the heck of it.
Look again at the size of each cluster. Does this help explain any of your findings?
These outliers can be very different from each other. We’ll need to look into them to see what kind of players they are. Once again, we’ll show you the top 3 of each category first, and afterward a smaller table with only the top player.
# creating a master document with all of the prototypes and all of the outliers.mast_dist_slice <- distances %>%group_by(Cluster) %>%mutate(outlier_rank =order(order(distance, decreasing=TRUE)),proto_rank =order(order(distance, decreasing =FALSE))) %>%filter(outlier_rank <4| proto_rank <4) %>%mutate(Category =if_else(proto_rank <4, "Prototype", "Outlier") ) %>%select(Name, Cluster, Category) %>%left_join(role) %>%arrange(Cluster, desc(Category)) %>%filter(Name != disqualify)
Look through the prototypes and outliers. Compare their results with your previous findings. Do the prototypes of each cluster match up with your summary of the cluster? How do the outliers fit in? Two outliers can be very different. Pick a few outliers and determine their closest two clusters.
rolefviz
Analyze the K = 7 clusters as a whole. Are the clusters good? Do they have high intra-class similarity? What about a low intra-class similarity? If you were to do the analysis again, would you choose the same amount of clusters?
Compare lots of Ks
Select two values of K (between 2 and 10) to compare. This table can become very complex. Remember, the rows are the cluster assignment with the first value of K and the columns are the cluster assignment with the second value. Isolate and analyze one row or column at a time.
# let's say the student wants to compare K = 3 and K = 7stu_clus1 <-7stu_clus2 <-3# ensures that the first chosen cluster is lower.if(stu_clus1 > stu_clus2) { space = stu_clus1 stu_clus1 = stu_clus2 stu_clus2 = space}set.seed(100)roleKMeans <-kmeans(roleKMeans_prep, centers = stu_clus1, nstart =50)set.seed(100)roleK2Means <-kmeans(roleKMeans_prep, centers = stu_clus2, nstart =50)# creating a tibble of the cluster of each player for each Kclusters <-tibble(player = role$Name,Cluster = roleKMeans$cluster,clusK2 = roleK2Means$cluster)compare_table <-with(clusters, table(Cluster, clusK2)) %>%as_tibble() %>%pivot_wider(names_from = clusK2, values_from = n)# tabulating clusterscompare_table %>%flextable() %>%align(align ="center", part ="all")
Cluster
1
2
3
4
5
6
7
1
0
9
17
0
8
38
0
2
1
35
79
0
0
0
90
3
50
1
0
26
12
0
8
Part 7: GM of Dallas Mavericks
Returning back to the Dallas Mavericks. Let’s take a look at how the Mavericks players were clustered in our role dataset. Let’s use K = 7. If you did not analyze K = 7 earlier, it is worth a look.
Below are a few visual reminders of each cluster’s characteristics.
# initializing our datasets a third time in case student decided to remove a variablerole <- nba %>%select(Name, POS, Team, Height, Weight, FGP, `3PP`, FTP, PTSPerMin, ORPerMin, DRPerMin, ASTPerMin, STLPerMin, BLKPerMin, TOPerMin, PFPerMin, FGMPerMin, FGAPerMin, `3PMPerMin`, `3PAPerMin`, FTMPerMin, FTAPerMin) %>%mutate(across(where(is.numeric), round, digits =4))# standardizing the data for KMeansroleKMeans_prep <- role %>%mutate(across(where(is.numeric), standardize)) %>%column_to_rownames(var ="Name") %>%select(-Team, -POS)# creating K = 7 K-Meansset.seed(100)role7Means <-kmeans(roleKMeans_prep, centers =7, nstart =50)# bar graph of centersas_tibble(role7Means$centers, rownames ="cluster") %>%pivot_longer(cols =c(Height:FTAPerMin), names_to ="variable") %>%mutate(variable =factor(variable, role_levels)) %>%ggplot(aes(x = variable, y = value, fill = cluster)) +geom_bar(stat ="identity") +geom_hline(yintercept =0) +coord_flip() +facet_grid(cols =vars(cluster), switch ="both") +labs(title ="Influence on the Cluster Assignment", x ="", y ="Cluster") +theme(axis.text.x =element_blank(),legend.position ="none")
Before moving on, fill out this table to describe each cluster. Write a few descriptive words that distinguish each cluster. This will help you to organize your thoughts on each cluster. If you already completed this for K = 7 in the role dataset, then you are free to proceed.
What do you notice about the player assignments? How many clusters do the Mavericks have represented? Which cluster is the most common on the Mavericks team?
Why is cluster 7 the most common? What kind of player is in cluster 7?
The Mavericks experienced a bit of turnover in the 2022 offseason. They’d already traded away C Kristaps Porzingis for SG Spencer Dinwiddie at the end of the 2022 season, and they lost productive SG Jalen Brunson to free agency. They traded away SF Sterling Brown and other assets for C Christian Wood during the 2022 Summer.
Let’s assess the offseason moves of the Dallas Mavericks by looking at the opening day roster for 2023 and its cluster distribution. Below are the eleven players on the Dallas Mavericks roster at Game 1 of the 2023 season, a loss against the Phoenix Suns.
dallas_role2023 <- role7Means_players %>%filter(Name =="JaVale McGee"| Name =="Reggie Bullock"| Name =="Dorian Finney-Smith"| Name =="Spencer Dinwiddie"| Name =="Luka Doncic"| Name =="Tim Hardaway Jr."| Name =="Maxi Kleber"| Name =="Christian Wood"| Name =="Josh Green"| Name =="Dwight Powell"| Name =="Davis Bertans") %>%select(-Team) %>%arrange(Cluster)dallas_role2023 %>%flextable() %>%align(align ="center", part ="all") %>%width(j =1, width =1.3) %>%width(j =c(2:10), width = .6) %>%width(j =c(11:14), width = .95)
Name
Cluster
POS
MIN
Height
Weight
FGP
3PP
FTP
PTSPerMin
ORPerMin
DRPerMin
ASTPerMin
STLPerMin
BLKPerMin
TOPerMin
PFPerMin
FGMPerMin
FGAPerMin
3PMPerMin
3PAPerMin
FTMPerMin
FTAPerMin
Dwight Powell
1
C
21.9
82
240
0.671
0.351
0.783
0.3973
0.0959
0.1279
0.0548
0.0228
0.0228
0.0365
0.1233
0.1507
0.2237
0.0091
0.0228
0.0913
0.1187
Tim Hardaway Jr.
3
SF
29.6
77
205
0.394
0.336
0.757
0.4797
0.0101
0.1149
0.0743
0.0304
0.0034
0.0270
0.0608
0.1689
0.4257
0.0811
0.2432
0.0642
0.0845
Spencer Dinwiddie
3
PG
30.2
77
215
0.376
0.310
0.811
0.4172
0.0265
0.1291
0.1921
0.0199
0.0066
0.0563
0.0795
0.1391
0.3709
0.0530
0.1689
0.0861
0.1093
Davis Bertans
3
SF
14.7
82
225
0.351
0.319
0.933
0.3878
0.0136
0.1088
0.0340
0.0204
0.0136
0.0272
0.1088
0.1224
0.3401
0.0952
0.2857
0.0544
0.0612
JaVale McGee
4
C
15.8
84
270
0.629
0.222
0.699
0.5823
0.1392
0.2848
0.0380
0.0190
0.0696
0.0823
0.1519
0.2468
0.3924
0.0000
0.0063
0.0886
0.1266
Christian Wood
5
C
30.8
82
214
0.501
0.390
0.623
0.5812
0.0519
0.2760
0.0747
0.0260
0.0325
0.0617
0.0812
0.2110
0.4188
0.0617
0.1591
0.0974
0.1591
Luka Doncic
6
PG
35.4
79
230
0.457
0.353
0.744
0.8023
0.0254
0.2345
0.2458
0.0339
0.0169
0.1271
0.0621
0.2797
0.6102
0.0876
0.2486
0.1582
0.2119
Dorian Finney-Smith
7
PF
33.1
79
220
0.471
0.395
0.675
0.3323
0.0453
0.0967
0.0574
0.0332
0.0151
0.0302
0.0695
0.1239
0.2628
0.0665
0.1631
0.0211
0.0302
Reggie Bullock
7
SF
28.0
78
205
0.401
0.360
0.833
0.3071
0.0179
0.1107
0.0429
0.0214
0.0071
0.0214
0.0571
0.1071
0.2643
0.0750
0.2071
0.0214
0.0250
Maxi Kleber
7
PF
24.6
82
240
0.398
0.325
0.708
0.2846
0.0488
0.1911
0.0488
0.0203
0.0407
0.0325
0.0935
0.0976
0.2439
0.0569
0.1748
0.0325
0.0447
Josh Green
7
SG
15.5
77
200
0.508
0.359
0.689
0.3097
0.0516
0.1032
0.0774
0.0452
0.0129
0.0452
0.1097
0.1226
0.2452
0.0258
0.0774
0.0323
0.0452
The roster looks somewhat similar, but what classification of player did the Mavericks lose in the 2022 season and not return in the 2023 season? What classification of player did the Mavericks gain in the 2023 season?
Answer: They lost a cluster 2 player, lost a cluster 7 player, gained two cluster 3 players, and a cluster 4 player.
What kind of player is in cluster 2? What would losing this kind of player do to a team?
Dallas Mavericks Trade
Let’s say you’re the GM of the Dallas Mavericks after game 1 of the 2022-2023 season. Which players would you consider trading and what cluster of player would you hope to acquire? Which players are you willing to give up?
Answer: I think the correct answer here is give up any of cluster 3 or 7 for a cluster 2. Maxi Kleber is the most expendable because he has some features of 1,4,5 and some of 7. And they have excess of these players.
Select four players you are willing to trade and one cluster that you are looking for.
# let's say the student is smart and choosestrading <-c("Davis Bertans", "Spencer Dinwiddie", "Maxi Kleber", "Dwight Powell")# and is looking for a player in cluster...looking <-2looking_clus <- role7Means_players %>%filter(Cluster == looking)looking_clus %>%flextable() %>%align(align ="center", part ="all") %>%width(j =1, width =1.3) %>%width(j =c(2:5), width = .6) %>%width(j =c(6:12), width = .95)
Name
Cluster
POS
MIN
Team
Height
Weight
FGP
3PP
FTP
PTSPerMin
ORPerMin
DRPerMin
ASTPerMin
STLPerMin
BLKPerMin
TOPerMin
PFPerMin
FGMPerMin
FGAPerMin
3PMPerMin
3PAPerMin
FTMPerMin
FTAPerMin
Lou Williams
2
SG
14.3
atl
73
175
0.391
0.363
0.859
0.4406
0.0210
0.0909
0.1329
0.0350
0.0070
0.0559
0.0629
0.1538
0.3986
0.0490
0.1259
0.0839
0.0979
Dennis Schroder
2
PG
29.2
bos
75
172
0.440
0.349
0.848
0.4932
0.0205
0.0959
0.1438
0.0274
0.0034
0.0719
0.0822
0.1781
0.4075
0.0479
0.1336
0.0856
0.1027
Marcus Smart
2
PG
32.3
bos
75
220
0.418
0.331
0.793
0.3746
0.0186
0.0991
0.1827
0.0526
0.0093
0.0681
0.0712
0.1300
0.3127
0.0526
0.1579
0.0619
0.0774
Ish Smith
2
PG
13.8
cha
72
175
0.395
0.400
0.632
0.3261
0.0217
0.0870
0.1884
0.0362
0.0217
0.0725
0.0652
0.1449
0.3623
0.0217
0.0507
0.0217
0.0362
Lonzo Ball
2
PG
34.6
chi
78
190
0.423
0.423
0.750
0.3757
0.0289
0.1272
0.1474
0.0520
0.0260
0.0665
0.0694
0.1329
0.3150
0.0896
0.2139
0.0173
0.0231
Alex Caruso
2
SG
28.0
chi
76
186
0.398
0.333
0.795
0.2643
0.0286
0.1000
0.1429
0.0607
0.0143
0.0500
0.0929
0.0893
0.2214
0.0357
0.1107
0.0500
0.0643
Ricky Rubio
2
PG
28.5
cle
75
190
0.363
0.339
0.854
0.4596
0.0140
0.1298
0.2316
0.0491
0.0070
0.0912
0.0772
0.1544
0.4246
0.0596
0.1789
0.0912
0.1053
Brandon Goodwin
2
G
13.9
cle
72
180
0.416
0.345
0.632
0.3453
0.0288
0.1079
0.1799
0.0504
0.0000
0.0719
0.0791
0.1295
0.3094
0.0360
0.1079
0.0504
0.0791
Jalen Brunson
2
PG
31.9
dal
73
190
0.502
0.373
0.840
0.5110
0.0157
0.1066
0.1505
0.0251
0.0000
0.0502
0.0596
0.2006
0.4013
0.0376
0.1003
0.0721
0.0846
Facundo Campazzo
2
PG
18.2
den
70
195
0.361
0.301
0.769
0.2802
0.0220
0.0769
0.1868
0.0549
0.0220
0.0549
0.1044
0.0879
0.2527
0.0495
0.1648
0.0495
0.0659
Cory Joseph
2
PG
24.6
det
75
200
0.445
0.414
0.885
0.3252
0.0163
0.0894
0.1463
0.0244
0.0122
0.0528
0.0935
0.1098
0.2520
0.0407
0.0976
0.0610
0.0691
Killian Hayes
2
PG
25.0
det
77
195
0.383
0.263
0.770
0.2760
0.0200
0.1040
0.1680
0.0480
0.0200
0.0680
0.1120
0.1080
0.2800
0.0280
0.1000
0.0360
0.0440
Saben Lee
2
PG
16.3
det
74
183
0.390
0.233
0.789
0.3436
0.0307
0.1166
0.1779
0.0613
0.0184
0.0613
0.0736
0.1166
0.2945
0.0245
0.0982
0.0920
0.1166
Draymond Green
2
PF
28.9
gs
78
230
0.525
0.296
0.659
0.2595
0.0346
0.2180
0.2422
0.0450
0.0381
0.1038
0.1038
0.1003
0.1938
0.0104
0.0415
0.0450
0.0692
Kevin Porter Jr.
2
SG
31.3
hou
76
203
0.415
0.375
0.642
0.4984
0.0224
0.1182
0.1981
0.0351
0.0128
0.0990
0.0831
0.1757
0.4217
0.0799
0.2173
0.0639
0.1022
Josh Christopher
2
SG
18.0
hou
77
215
0.448
0.296
0.735
0.4389
0.0389
0.1000
0.1111
0.0500
0.0111
0.0833
0.0722
0.1667
0.3778
0.0444
0.1444
0.0611
0.0833
D.J. Augustin
2
G
15.0
hou
71
183
0.404
0.406
0.868
0.3600
0.0133
0.0667
0.1467
0.0200
0.0000
0.0867
0.0333
0.1067
0.2667
0.0733
0.1867
0.0667
0.0733
Tyrese Haliburton
2
PG
36.1
ind
77
185
0.502
0.416
0.849
0.4848
0.0222
0.0970
0.2659
0.0499
0.0166
0.0886
0.0526
0.1717
0.3435
0.0609
0.1468
0.0776
0.0914
T.J. McConnell
2
PG
24.1
ind
73
190
0.481
0.303
0.826
0.3527
0.0290
0.1079
0.2033
0.0456
0.0166
0.0456
0.0830
0.1535
0.3195
0.0166
0.0498
0.0290
0.0373
Keifer Sykes
2
G
17.7
ind
71
167
0.363
0.300
0.882
0.3164
0.0169
0.0678
0.1073
0.0226
0.0056
0.0565
0.0904
0.1243
0.3333
0.0452
0.1582
0.0282
0.0282
Eric Bledsoe
2
SG
25.2
lac
73
214
0.421
0.313
0.761
0.3929
0.0198
0.1151
0.1667
0.0516
0.0159
0.0833
0.0635
0.1429
0.3452
0.0357
0.1190
0.0635
0.0873
De'Anthony Melton
2
SG
22.7
mem
74
200
0.404
0.374
0.750
0.4758
0.0396
0.1586
0.1189
0.0617
0.0220
0.0661
0.0793
0.1674
0.4185
0.0837
0.2247
0.0529
0.0705
Tyus Jones
2
PG
21.2
mem
72
196
0.451
0.390
0.818
0.4104
0.0094
0.1038
0.2075
0.0425
0.0000
0.0283
0.0189
0.1604
0.3585
0.0519
0.1321
0.0330
0.0425
Kyle Lowry
2
PG
33.9
mia
72
196
0.440
0.377
0.851
0.3953
0.0147
0.1180
0.2212
0.0324
0.0088
0.0796
0.0826
0.1298
0.2950
0.0678
0.1799
0.0678
0.0826
Gabe Vincent
2
PG
23.4
mia
75
200
0.417
0.368
0.815
0.3718
0.0128
0.0641
0.1325
0.0385
0.0085
0.0598
0.0983
0.1325
0.3205
0.0769
0.2051
0.0256
0.0342
Jrue Holiday
2
PG
32.9
mil
75
205
0.501
0.411
0.761
0.5562
0.0304
0.1064
0.2067
0.0486
0.0122
0.0821
0.0608
0.2158
0.4316
0.0608
0.1459
0.0608
0.0821
Patrick Beverley
2
PG
25.4
min
73
180
0.406
0.343
0.722
0.3622
0.0433
0.1220
0.1811
0.0472
0.0354
0.0512
0.1181
0.1220
0.2953
0.0551
0.1654
0.0669
0.0906
Jordan McLaughlin
2
PG
14.5
min
71
185
0.440
0.318
0.750
0.2621
0.0276
0.0828
0.2000
0.0621
0.0138
0.0414
0.0621
0.0966
0.2207
0.0276
0.0966
0.0345
0.0414
Jose Alvarado
2
PG
15.4
no
72
179
0.446
0.291
0.679
0.3961
0.0325
0.0909
0.1818
0.0844
0.0065
0.0455
0.0909
0.1558
0.3506
0.0390
0.1299
0.0455
0.0649
Josh Giddey
2
SG
31.5
okc
80
205
0.419
0.263
0.709
0.3968
0.0571
0.1905
0.2032
0.0286
0.0127
0.1016
0.0508
0.1651
0.3937
0.0317
0.1238
0.0317
0.0476
Theo Maledon
2
PG
17.8
okc
76
175
0.375
0.293
0.790
0.3989
0.0225
0.1236
0.1236
0.0337
0.0112
0.0730
0.0730
0.1292
0.3483
0.0506
0.1629
0.0843
0.1124
Jalen Suggs
2
SG
27.2
orl
76
205
0.361
0.214
0.773
0.4338
0.0184
0.1103
0.1618
0.0441
0.0147
0.1103
0.1103
0.1507
0.4191
0.0331
0.1507
0.0956
0.1250
R.J. Hampton
2
PG
21.9
orl
76
175
0.383
0.350
0.641
0.3470
0.0183
0.1233
0.1142
0.0320
0.0091
0.0639
0.0731
0.1233
0.3242
0.0457
0.1324
0.0548
0.0822
Chris Paul
2
PG
32.9
phx
72
175
0.493
0.317
0.837
0.4468
0.0091
0.1216
0.3283
0.0578
0.0091
0.0729
0.0638
0.1702
0.3435
0.0304
0.0942
0.0790
0.0942
Cameron Payne
2
PG
22.0
phx
73
183
0.409
0.336
0.843
0.4909
0.0182
0.1182
0.2227
0.0318
0.0136
0.0818
0.0955
0.1864
0.4591
0.0545
0.1636
0.0591
0.0682
Dennis Smith Jr.
2
PG
17.3
por
74
205
0.418
0.222
0.656
0.3237
0.0289
0.1040
0.2081
0.0694
0.0173
0.0809
0.0809
0.1214
0.2948
0.0116
0.0405
0.0636
0.0983
Tyrese Haliburton1
2
PG
34.5
sac
77
185
0.457
0.413
0.837
0.4145
0.0232
0.0899
0.2145
0.0493
0.0203
0.0667
0.0406
0.1536
0.3333
0.0580
0.1420
0.0493
0.0580
Davion Mitchell
2
PG
27.7
sac
74
205
0.418
0.316
0.659
0.4152
0.0144
0.0650
0.1516
0.0253
0.0108
0.0542
0.0686
0.1697
0.4043
0.0469
0.1552
0.0253
0.0397
Derrick White1
2
PG
30.3
sa
76
190
0.426
0.314
0.869
0.4752
0.0165
0.0990
0.1848
0.0330
0.0297
0.0594
0.0792
0.1650
0.3828
0.0561
0.1749
0.0924
0.1089
Tre Jones
2
PG
16.6
sa
73
185
0.490
0.196
0.780
0.3614
0.0241
0.1084
0.2048
0.0361
0.0060
0.0422
0.0663
0.1446
0.2952
0.0060
0.0422
0.0602
0.0783
Malachi Flynn
2
PG
12.2
tor
73
175
0.393
0.333
0.625
0.3525
0.0164
0.0984
0.1311
0.0410
0.0082
0.0246
0.0820
0.1311
0.3443
0.0574
0.1639
0.0246
0.0410
Mike Conley
2
PG
28.6
utah
73
175
0.435
0.408
0.796
0.4790
0.0245
0.0839
0.1853
0.0455
0.0105
0.0594
0.0699
0.1678
0.3846
0.0804
0.2028
0.0629
0.0804
Ish Smith1
2
PG
22.0
wsh
72
175
0.457
0.357
0.600
0.3909
0.0227
0.1136
0.2364
0.0455
0.0227
0.0682
0.0727
0.1818
0.4000
0.0227
0.0682
0.0045
0.0091
Raul Neto
2
PG
19.6
wsh
73
180
0.463
0.292
0.769
0.3827
0.0102
0.0867
0.1582
0.0408
0.0000
0.0561
0.0765
0.1480
0.3214
0.0255
0.0867
0.0612
0.0765
Aaron Holiday
2
G
16.2
wsh
72
185
0.467
0.343
0.800
0.3765
0.0123
0.0864
0.1173
0.0370
0.0123
0.0617
0.0926
0.1481
0.3210
0.0370
0.0988
0.0432
0.0556
From the list, choose a player you like from a team that has several of these types of players. They’d be more likely to part ways. Assess the strengths of the pertinent players and propose a trade! How does it look?
Feel free to make the trades as complex as you wish, but try to choose something that the opposing team would agree to.
Defend your proposed trade using the cluster information. You may add in some basketball knowledge if you like.
What do you think of this process? What are the strengths and weaknesses of evaluating a team based on cluster membership?