A Joint Analysis for Field Goal Attempts and Percentages of Professional Basketball Players: Bayesian Nonparametric Resource

Understanding shooting patterns among diﬀerent players is a fundamental problem in basketball game analyses. In this paper, we quantify the shooting pattern via the ﬁeld goal attempts and percentages over twelve non-overlapping regions around the front court. A joint Bayesian nonparametric mixture model is developed to ﬁnd latent clusters of players based on their shooting patterns. We apply our proposed model to learn the heterogeneity among selected players from the National Basketball Association (NBA) games over the 2018–2019 regular season and 2019–2020 bubble season. Thirteen clusters are identiﬁed for 2018–2019 regular season and seven clusters are identiﬁed for 2019–2020 bubble season. We further examine the shooting patterns of players in these clusters and discuss their relation to players’ other available information. The results shed new insights on the eﬀect of NBA COVID bubble and may provide useful guidance for player’s shot selection and team’s in-game and recruiting strategy planning.


Introduction
The National Basketball Association (NBA) is undergoing a massive revolution partly due to the recent advances in data science. Data analytics has attracted a great deal of attentions in many aspects of NBA over the years, including player drafting and evaluation, in-game strategy planning, team and player match-ups, and adaption of the three-point shot. An important question of interest in basketball analytics is to find effective ways to quantify and understand individual player's shooting patterns, which often can be related to analyzing the "hot spots" of players, that is, where the most field goals have been attempted and made over the court. Shot chart is a widely used tool for this purpose because it provides an easy-to-understand graphical representation of shot locations and the associated field goal attempts over different locations. In the literature, several spatial models and machine learning methods have been proposed to study players' shot charts. For example, Miller et al. (2014) used spatial point processes to account for the randomness nature of shot locations. Franks et al. (2015) characterized shot attempt locations by combining spatial and spatio-temporal processes, matrix factorization methods and hierarchical models. Jiao et al. (2021) proposed a joint model for simultaneously making inference on random loca-tions and outcomes of field goal attempts. Hu et al. (2021a) proposed a distance-based clustering approach to model the heterogeneity of shot selection based on the log Gaussian Cox process. Hu et al. (2021b) considered a model-based clustering approach and used a Bayesian zero-inflated Poisson regression to handle large proportion of zero shot attempts over the court. Yin et al. (2022a) proposed a Bayesian mixture model of matrix normal distributions to study the spatial heterogeneity in the shot charts. Yin et al. (2022b) proposed a novel nonparametric Bayesian method for learning the underlying intensity surface of shot charts for selected NBA players.
Compared to fields goals attempted, the field goal percentage, which is defined as the ratio between goals made and attempted, seems to have received less attention in the literature, and it remains an important quantity for investigation. For example, identifying players that share similar "hot spots" and field goal percentages can provide valuable information for coaches and teams that aim to find and trade specific types of offensive players. Defensive strategies can also be planned accordingly based on field goal percentages. Reich et al. (2006) developed a spatially varying coefficients model for shot chart data analysis, where the court is divided into multiple, small, non-overlapping regions, and the field goal percentage in these regions were fitted by a multinomial logit model. However, that paper ignored the player-level variability in their analysis.
Our focus in this paper is to simultaneously model the field goal attempts and shot percentages over different shot locations while taking account for the heterogeneity among players. In other words, players are clustered based on the locations in the basketball court where they shoot most often and how accurately they shoot from there. This information could be useful for roster construction for NBA general managers or avid fantasy basketball fans. To model both the field goal attempts and shot percentages at the same time, a joint modeling framework is adopted. Despite its popularity in statistics literature including environmental health (Xu et al., 2019), and longitudinal and survival data analysis (Ibrahim et al., 2010;Long and Mills, 2018), joint modelling has not been well explored in basketball analytics. The only work that we are aware of is Jiao et al. (2021), where a joint model was first used for shot chart data analysis and then a hierarchical clustering was performed based on the obtained joint model parameter estimates. However, the inherent uncertainty in the estimation of the cluster number was ignored in that paper.
Unlike the aforementioned clustering approaches, we propose a Bayesian nonparametric mixture model under the joint modeling framework to cluster NBA players based on their shooting patterns and goal percentages. Bayesian models such as the Dirichlet process (DP; Ferguson (1973)) and its mixture are well known for their flexibility in providing a natural way to simultaneously estimate the number of groups and the group configurations in clustering problems. Moreover, the joint modeling framework allows us to conveniently learn the player shooting patterns by taking account for the dependence between the shot attempts and field goal percentages using the shared random effects. By combining the strengths from these two resources, our proposed method bypasses the need for pre-specifying the cluster number and provides deeper insights of players' shooting hot spots than what the traditional position classification can offer. The utility of our method is demonstrated by applications to 2018-2019 and 2019-2020 NBA regular season data analysis.
The rest of this paper is organized as follows. We discuss a motivating example of 2018-2019 NBA regular season data in Section 2. In Section 3, we introduce the joint Bayesian nonparametric mixture model framework and provide details of Bayesian inference and posterior sampling implementation. Applications of the proposed method to NBA players data are reported in Section 4. We conclude with a discussion of future directions in Section 5. For ease of exposition, additional numerical results are given in a supplementary material.

Motivating Data
The shot chart information was collected via the publicly available website (https://github. com/swar/nba_api). We only discuss the data collected from the 2018-2019 NBA regular season, where records of all 219,458 shots that were taken by 526 players. Each shot record has information on the game when the shot was made (location, opponents, date), the nature of the shot (jumper, lay-up, floater, etc.), time in the game when the shot was taken, location (Cartesian coordinates), whether or not the shot was made, and who took the shot.
We only considered the shot attempts from the front court and divided the front court into twelve zones as shown in Figure 1. Shot attempts from the back court (beyond the half-court line) were omitted from this analysis because these shots were relatively rare (n = 466) and they did not play a major role in game strategy. Shooting patterns were then defined to be the frequency of shot attempts and field goal percentage from twelve non-overlapping regions around the front court throughout this paper. Only players who had taken at least four shots from each of the twelve front court regions were included in the clustering analysis. There were 167 players remaining which accounted for 123,944 of the total shots. Information on player height, weight, salary, and years of NBA experience was used to assess trends within the clusters. Players who were traded (changed teams mid-season) during the 2018-2019 (or 2019-2020) regular season had their statistics counted across the two teams and combined into a single player record.
To illustrate the shooting pattern heterogeneity, in Figure 2, we show the histogram of shot attempts over 12 regions for three representative players, DeMar DeRozan, Karl-Anthony Towns, and Stephen Curry, during season 2018-2019. These three players took a similar number of shot attempts in total (1313, 1314, and 1340, respectively), but exhibited quite different shooting patterns. DeMar DeRozan had more mid-range shot attempts than the other two players. Most of Karl-Anthony Towns' shot attempts were in the paint or restricted area. Meanwhile Stephen Curry tended to shoot more beyond the three-point line. These findings certainly relate to and will help understand the player's game play styles and team's strategic planning.

Statistical Methodology
After dividing the front court into twelve zones, the number of shots that each player took from each zone is tallied and the percentage of their shots that they successfully made in each zone is computed. Let Y i = Y i,1 , . . . , Y i,12 be the vector of attempted shots from the twelve regions and Z i = Z i,1 , . . . , Z i,12 be the vector of field goal percentages from the twelve regions for player i, respectively. Our Bayesian model is defined as: where i is the player index taking values from 1 to 167 (number of players) and j is the shooting region index ranging from 1 to 12 (number of regions). Here "IW" represents the inverse Wishart distribution, "IG" represents the inverse gamma distribution, θ i is the cluster assignment for player i and θ i ∈ {1, . . . , K}, where K denotes the number of clusters. We define ω i = (ω i,1 , . . . , ω i,12 ) as the collection of random effects for player i and μ i , η i can be defined in a similar way. Note that ω i is the shared random effect that links the outcomes Y ij and Z ij together. In our model, δ i is a scale factor for the random effect on the logit scale of the field goal percentage for player i, is the common covariance matrix for each ω i , X ∈ R 12×5 is the basis function and it can belong to any arbitrary class of spatial basis functions (e.g., thin plates basis function (Yang and Bradley, 2022), radial basis function, and wavelet (Lim, 2021), etc.). We choose the Moran's I basis function (Moran, 1950;Li et al., 2007;Bradley et al., 2015) in our analysis, and this shared spatial basis function expansion can capture dependence over basketball court between these two different types of responses (Xu et al., 2019). And it is possible to be sensitive to the choice of the number of basis functions for the spatial pattern (Bradley et al., 2011). The α θ i and β θ i are the regression coefficients that determine the rate and success probabilities for cluster θ i .
Moran's I basis is a transformation based on the adjacency matrix of spatial locations and it has been commonly used in spatial statistics. It is a class of functions used to model areal spatial processes in a reduced dimensional space. In our case, it is a transformation of the adjacency matrix for the twelve regions over the half court. More specifically, any two regions on the court that share the same border are considered neighbors. For example, the neighbors of left corner three are mid-range left, mid-range left center, and above the break three left. Meanwhile the only neighbor for restricted area is paint (non-restricted area). Considering other options of adjacency matrix (e.g., marking all of the three point regions as neighbors or connecting the corner regions due to the symmetry of the court) is an interesting direction for future investigation. We choose not to pursue this direction because players have tendencies to drive to one side of the court, generally related to their dominant hand. By only connecting the adjacent regions, the asymmetry of player behavior is still captured.
In our model for logit(Z ij ), we assume a constant variance σ 2 j for a fixed shooting region j over different players. This assumption can be relaxed by considering a more general variance term, i.e., replacing σ 2 for some function f (·). In our data analysis, we have checked this assumption using residual plot and found the assumption to hold for the data analysis. Therefore we keep the current variance specification and leave its extension to a more complex form as a future work direction.
Since the number of clusters, K, is unknown, one popular way to model the joint distribution of θ 1 , . . . , θ k is the Dirichlet process mixture model (DPMM) (Antoniak, 1974), which can be written as follows: The process G is parameterized by a base measure G 0 and a concentration parameter γ > 0. With i for i = 1, . . . , n drawn from G, a conditional prior distribution for a newly drawn n+1 can be obtained via integration (Blackwell et al., 1973): with δ i ( i ) = I ( i = i ) being the point mass at i . The model can be equivalently represented by introducing group membership θ i 's and having K, the number of groups, approach infinity (Neal, 2000): where π = (π 1 , . . . , π K ). It can be seen that under this construction, the group-specific distribution F (· | * c ) solely depends on the vector of parameters * c . In Equation 3, the prior distribution of (θ 1 , . . . , θ n ), which would allow for automatic inference on the number of groups K, can be obtained by integrating out π , the mixing proportions. This is also known as the Chinese restaurant process (CRP; Aldous, 1985;Pitman, 1995;Neal, 2000). The conditional distribution for θ i is defined through the metaphor of a Chinese restaurant (Blackwell et al., 1973): where |c| denotes the size of group c.
By adapting DPMM to our model setting for clustering, our model can be written as: where "CRP" represents the Chinese restaurant process in Equation (3). Our model can be conveniently fitted using R-package nimble (de Valpine et al., 2017(de Valpine et al., , 2021a in which a Gibbs sampler is implemented. The implementation code is given in the supplementary materials, which will be available online. Two chains were run for 20,000 iterations each with the first 5,000 discarded as burn-in. Adequacy of the mixing was assessed by viewing trace plots for individual parameters and checking autocorrelation between samples. We also checked the trace plots for convergence and found the convergence results to be satisfactory. Another important step in posterior inference is to obtain clustering labels. Note that the Chinese Restaurant Process prior on the cluster assignments allows for different numbers of clusters between samples and the labels on the clusters do not carry any meaning. Therefore posterior inference for the group configurations needs to be carried out based on posterior samples of {θ 1 , θ 2 , . . . , θ n }. Further complicating matters, taking the average assignment may not lead to a valid configuration of the individuals. In our case, we conduct posterior inference on the clustering labels based on Dahl's method (Dahl and Vannucci, 2006), which proceeds as follows. Define a membership matrix A ( ) as: where the summation is also element-wise. The posterior iteration with the smallest squared distance to A * is obtained by where "F " is the Frobenius norm.

Exploratory Data Analysis
The most natural way to cluster players a priori is by their position. In basketball, each team fields five players and the typical positions are point guard (PG), shooting guard (SG), small forward (SF), power forward (PF), and center (C). At any given time there is typically one player from each of these position groups on the floor. Table 1 shows the height, weight, salary, and years of experience of the players who we have data on for these categories (470 out of 526). Stratifying by position in both the subsample of players who took enough shots to be included and across the entire league shows that there are physical similarities within each position group. On average, point guards are the smallest, followed in an increasing order by shooting guards, small forwards, power forwards, and centers. The physical characteristics within positions between the sample and the whole league are similar, while the mean salary and years of playing experience are greater in the sample than the whole league, but with wide standard deviations.
The position group shooting patterns are shown in Figure 3 for players in the sample and numerically summarized in Table 2 for all players. Figure 3 shows the average location of shots made from each region, the field goal percentage from each region, and a contour density of shooting locations for each position group. Across different positions, it is clear that the proportion of 3-pointers (among all shots) varies. Within each position, shot tendencies also differ between the whole league and the sample. In general the players in the sample took more shots, which is as expected because we only choose players that have made at least the minimum number of shots in each region. In general, the goal percentages are similar between the players in the whole league and the ones we consider in the sample.

Cluster Analysis
We apply the proposed clustering method in Section 3 to the shot chart data of selected 167 NBA players in 2018-2019 regular season. Model diagnostics showed signs of adequate mixing. The final clustering results (e.g., A * ) contain thirteen clusters. Complete cluster assignments are summarized in Table 1 of the supplementary file. When splitting the data by position, there are clear physical differences in the players in each group. Interestingly, in our obtained clusters, the height and weight differences between groups is less pronounced. More specifically, the mean heights/weights are more similar and the standard deviations of height and weight are larger indicating overlaps in height and weight among clusters. Though the physical characteristics are similar, the shooting patterns differ noticeably as shown in Figure 4.
We first discuss clusters 3, 6, 7, 12, and 13. Players from these clusters take relatively high frequency of shots from beyond the three-point arc. Interestingly, the shooting accuracy of players in these cluster is not necessarily is higher than the rest. Clusters 3 and 6 have among the highest accuracy for three-point attempts combined with a high volume of attempts from all around the arc. Cluster 3 features Joe Ingles, Kyle Korver, and Stephen Curry while cluster 6 includes Buddy Hield and Seth Curry. All of these players are known for their shooting abilities from distance. The shot locations of cluster 13 are similar to those of clusters 3 and 6, but the efficiency is lower among this cluster. Cluster 13 has the lowest average height (75.54 in.) and weight (189.54 lbs.). Despite this, Dirk Nowitzki (84 in., 245lbs.) is in this cluster and 2018-2019 is his last season in his career. Clusters 7 and 12 take fewer three-point shots and they are generally located in the corners or on the wings, regions that are considered easier to make shots from. Clusters 2, 8, 9, and 11 focus their shot attempts much more in the interior, closer to the basket. Several players in these clusters are former NBA Slam Dunk Contest winners (cluster 8: Aaron Gordon, Zach LaVine, Blake Griffin; 9: Donovan Mitchell; 11: John Wall). This is consistent with these clusters housing players who tend to take their shots near the basket. Of clusters with more than six players, clusters 8 and 9 on average have the highest salary (however, the standard deviations are quite big) and also take the most shots.
Clusters 1, 4, 5, and 10 all have six or fewer players. Clusters 1 and 10 have high salaried players who on average take many shot attempts over the course of the season. Some notable players in these two clusters are James Harden, Bradley Beal and Damian Lillard. Clusters 4 and 5 are primarily made up of bench players, which is reflected in their lower salaries and fewer shot attempts. Figure 4: The shooting charts for the players in the nine clusters with at least ten players. The colored dots are the centers for each of the zones. Each cluster's collective field goal percentage in each zone is shown on the right. The contour lines show where the players in each cluster tended to shoot from more frequently.

2019-2020 NBA Bubble Season
Another dataset of interest is collected from the 2019-2020 NBA season. The 2019-2020 bubble season is an unique season since it was interrupted by the coronavirus pandemic approximately three quarters of the way into the season. After four months suspension, the game returned with all of the remaining games hosted at a single site in Florida with no fans in attendance. Another interesting feature is the "seeding" games scheme, where twenty-two out of thirty NBA teams were invited to participate in the game in the Florida bubble and each team played eight additional regular season games to determine playoff seeding. As a result, the total number of games for 2019-2020 bubble season is less than that of a regular season. Moreover, players and coaches deemed "high-risk individuals" by their team, or players who had already suffered season-ending injuries prior to season suspension, were not permitted to play. Players that are medically cleared could also decline to participate in the bubble season. With these restrictions, there were only 138 players who met the shot requirement for each shooting zone.
We conduct the clustering analysis for this new data set and find eight clusters among 138 players. In Table 5, we provide the cluster demographics for eight clusters, respectively. Similarly with the findings for the 2018-2019 season, the average height/weight differences between groups  are not obvious. The shooting patterns, on the other hand, show a high level of heterogeneity in Figure 5. And the estimation results shown in Appendix C.   To better understand the shooting pattern heterogeneity, we further examine these obtained clusters. For clusters 2, 3, 6, 7, and 8, their players take more shots from beyond the three-point arc, although the shooting accuracy does not seem higher than the average. In the order, cluster 7 has the highest frequency of shots beyond the three-point arc, followed by the clusters 2, 3, 6 and 8.
Cluster 7 can shoot the ball around all over the three-point arc than the other four clusters. Clusters 2 (e.g., Kentavious Caldwell-Pope and Kent Bazemore), 3 (e.g., Kyrie Irving and Kemba Walker), and 6 (e.g., Lonzo Ball and Fred VanVleet) share similar patterns from left 45 angle to right 45 angle. The main difference is in the corner region beyond three-point arc, i.e., cluster 2 has more shots from the corner than cluster 6 and subsequently cluster 3. For the cluster 8, it only has a limited angle three-point shooting compared to other four clusters.
Cluster 4 and Cluster 5 are very similar. The main difference is the Cluster 4 higher FG% in perimeter area than Cluster 5. Clusters 4 (e.g., LaMarcus Aldridge and Anthony Davis) and 4 (e.g., Carmelo Anthony and Russell Westbrook) are different compared to the rest of clusters since the shots are mainly made from the restricted and are closer to the basket.
Between them, the shooting tendency is still quite different as shown in Table 6, where cluster 4 has almost 50% more shooting attempts than cluster 5. In terms of player demographics, players in cluster 4 has almost double salaries compared to cluster 5. To summarize, compared to the clustering results obtained from the previous season of 2018-2019, the bubble season has a less number of clusters probably due to he less number of games (and players considered) but the clustering patterns remain similar in general. In other words, the COVID bubble does not seem to have a noticeable effect on players' shooting patterns.

Expanded Analysis for 2018-2019 Regular Season with 359 Players
In our previous analysis in Section 4.1, we have excluded some players that are primarily centers and forwards who did not take enough three-point shots. These players tend to play closer to the basket and take fewer shots from long distance (which is why they were omitted). Among them is Giannis Antetokounmpo, the 2018-2019 NBA Most Valuable Player. Here we provide an Figure 6: The shooting charts expanded analysis. The colored dots are the centers for each of the zones. Each cluster's collective field goal percentage in each zone is shown on the right. The contour lines show where the players in each cluster tended to shoot from more frequently. expanded analysis by enrolling the player who made more than 100 shooting attempt over the offensive court. With this threshold, we have 359 players in our analysis. We find nineteen clusters among those players. We provide the shooting pattern in Figure 6, Figure 7, and Figure 8. We also provide the cluster demographics for each cluster in Table 7 and cluster shooting tendencies in Table 8.
These nineteen clusters can be classified into three main groups. The first group includes cluster 1, 2, 4, 11, 15, 16 and 17. The second group includes cluster 3, 5, 6 and 12, and the remaining clusters are in the third group. For the first group, most of shot attempts are in the interior and close to the basket. The shooting tendency has indicated some interesting difference among different clusters as shown in Table 8. For the second group, they have in general a larger number of shots beyond the three point line with limited area than first group, especially on the corner. In the third group, players take relatively high frequency of shots from beyond the three-point arc.
Compared to the clustering results in Section 4.1, the number of players we consider here is doubled, with the cluster number increases by half (from 13 to 19), and the average cluster size increases by about half as well (from 12.8 to 18.9). While a larger number of clusters/players certainly increases the difficulty of interpretation, we do observe some common patterns between these two sets of clusters. For example, the small clusters (with 6 players in the original analysis and 10 in the expanded analysis) often consists of "outlier" players who either make a lot or very few overall shots or 3-pointers. For instance, cluster 1 in Table 4 makes the most number of shot attempts (and 3-pointers) among all 13 clusters, and cluster 5 make the least shot attempts   (and 3-pointers). In Table 8, cluster 1 makes the most number of shot attempts and 4th highest number of 3-pointers among 19 clusters, while cluster 19 has made a good amount of 3-pointers and overall shots, but the overall shooting percentage (.41) is the lowest among all clusters. Some of the large clusters also share similar patterns. For example, the second largest cluster from the original analysis, cluster 11 in Table 3 and 4, takes very similar values in terms of average player height/weight and shooting attempts/percentage with the largest cluster from the expanded analysis, cluster 11 in Table 7 and 8.

Discussion
In this paper, we propose a flexible nonparametric Bayesian clustering approach for jointly modeling the shot attempts and goal percentage in NBA games. The clustering results provide useful insights in addition to player position and demographic information. For example, latent factors such as nature of shots that were taken (driving lay-ups compared with post-ups in the paint) are apparent in the clusters although they cannot be directly measured. Several future research directions remain open. First, we only use raw counts in the data analysis and it will be of interest to adjust for players' number of games (or minutes). For example, two players could appear to be low volume shooters, but one player might be a high Table 8: Cluster Shooting Tendencies for expanded analysis: The proportion of made and attempted 3-pointers and averaged across all shots (2-pointers and 3-pointers). The goal percentage (Made/Att) is the percent across all shots taken by players in that cluster. Attempts (Att) is the average number of attempts per player in that cluster.

3-pointers
All FG volume shooter who was injured or plays few minutes while the other player plays many minutes but does not carry a heavy offensive load for their team. Intuitively these players should not be clustered together, but under the current framework they possibly could be. Secondly, our method can be applied for analyzing other related data sets, such as to investigate clustering patterns in the playoffs. In the NBA, 16 of the 30 teams qualify for the annual postseason knockout tournament that ultimately determines the season champion. During playoffs, the level of competition is increased and the top players tend to play longer minutes while the less "valuable" players may play fewer minutes in the playoffs so the number of shots they take around the court will be reduced. It could be interesting to see if clustering patterns change as competition increases in quality and intensity and how it varies for players of different talent and skill levels. In our model, we use a Dirichlet process mixture for clustering, and it will be of interest to consider other types of nonparametric priors such as the mixture of finite mixtures prior (Miller and Harrison, 2018). Many other statistics such as rebounds (offensive and defensive) and assists are recorded for every game and could be added to the model. This would allow the clusters to capture players' overall offensive contributions. In addition, several defensive statistics could also be added as auxiliary information to help us achieve a better understanding of the relationship between the offense and defense of NBA players. We leave these directions for future exploration.