Community Detection in Google Searches Related to “Coronavirus”

Coronavirus and the COVID-19 pandemic have substantially altered the ways in which people learn, interact, and discover information. In the absence of everyday in-person interaction, how do people self-educate while living in isolation during such times? More specifically, do communities emerge in Google search trends related to coronavirus? Using a suite of network and community detection algorithms, we scrape and mine all Google search trends in America related to an initial search for “coronavirus,” starting with the first Google search on the term (January 16, 2020) to recently (August 11, 2020). Results indicate a near-constant shift in the structure of how people educate themselves on coronavirus. Queries in the earliest days focusing on “Wuhan” and “China”, then shift to “stimulus checks” at the height of the virus in the U.S., and finally shift to queries related to local surges of new cases in later days. A few communities emerge surrounding terms more overtly related to coronavirus (e.g., “cases”, “symptoms”, etc.). Yet, given the shift in related Google queries and the broader information environment, clear community structure for the full search space does not emerge.


Introduction
The impact of COVID-19 on the global society has been striking. From isolation and fear of contracting coronavirus to the slowing of many local economies and limitation of in-person interactions, the effects of COVID-19 have threatened the modern flow of life. People across the world are often living and working in isolation, or at least are more separated from local communities than prior to the pandemic. Thus, how do people process such an unprecedented global crisis? In other words, absent in-person interaction, how do people self-educate and learn about coronavirus?
Despite the still-newness of COVID-19 and coronavirus, there is an impressive amount of extant work on related topics. Much of the research surrounding online behavior and COVID-19 tends to focus on misinformation (Pennycook et al., 2020;Bastani and Bahrami, 2020), fake news (Apuke and Omar, 2020;van der Linden et al., 2020), and smart-phone-enabled selfdiagnosis (Collado-Borrell et al., 2020). Further, research on COVID-19 and online behavior is becoming increasingly creative and nuanced as the virus wears on, such as exploring online shopping (Laato et al., 2020), binge television watching (Dixit et al., 2020), and changes in sexual behavior (Lehmiller et al., 2020).
Yet, beyond explicitly pernicious or unique contexts, online knowledge dissemination (Chan et al., 2020) and information diffusion (Croce et al., 2020;Dinh and Parulian, 2020) surrounding the virus has also informed a good deal of recent work, such as followers and likes on social media , real-time scientific information distribution (Song and Karako, 2020), and even country-specific effects (Meier et al., 2020;Prem et al., 2020;Tuite Bogoch et al., 2020).
This body of rapidly developing research has also given rise to exploration of the social and behavioral effects of COVID-19, which are frequently evolving. For example, Kim and Bostwick (2020) explored the dimension of racial inequality, whereas Sher (2020); Gunnell et al. (2020) explored suicide and Venkatesh and Edirappuli (2020); Saltzman et al. (2020) explored mental health, all of which point to deepening and lasting effects of COVID-19 that will stick around long after the immediate threat of the virus is over. And still further, Waggoner (2020) placed policymaking on COVID-19 into historical context by focusing on the American context, which adds to a growing body of work on COVID-19 in a policymaking context (Cairney and Wellstead, 2020;Hartley and Jarvis, 2020).
In sum, whether focusing on news media , education and social networks (Elmer et al., 2020), infection and mortality rates (Baud et al., 2020), or forecasting (Anastassopoulou et al., 2020;Perc et al., 2020), the COVID-19 pandemic has captured the attention of the global research community. The result is an ever-growing store of research aimed at unpacking and understanding the far-reaching effects of the virus on the global population (Grech, 2020), policymakers (Waggoner, 2020), and medical workers (Grasselli et al., 2020).
Building on, though departing from some of the existing research on COVID-19 and online communities and building on recent similar work (Effenberger et al., 2020;Rovetta and Bhagavathula, 2020), we address this question by focusing on the individual level, starting at the place most people start when interested to learn more about any topic: Google. To construct an exploratory research design to address this question, we leverage a suite of network and community detection algorithms to mine Google search trends related to an initial search for "coronavirus." We are interested in exploring that which people pair with the term "coronavirus" to understand how these individuals self-educate via Google (e.g., "coronavirus symptoms", where "symptoms" is the term of interest).
Results across several stages show that Google behavior related to "coronavirus" have trended with various waves of the virus since its initial outbreak in December 2019. Further, there is clear structure around commonly associated terms such as "cases" and specific regions in the United States, such as "county." Further, supporting results at a monthly interval of time demonstrate that while this clear aggregate structure exists at a relatively intuitive level, the structure of the Google search space has continued to evolve as time spent living with the virus has worn on. For instance, in the earliest days of the virus, search tended to focus on approaches to combatting coronavirus, whereas at the virus's initial peak in March of 2020 in America, a spike in Googling of "stimulus checks" occurred. Community detection results show evidence of a few consistent communities in the aggregate, yet a rapidly evolving set of communities at the periphery, where sparse connections are uncovered.

Empirical Strategy
This analysis is focused on uncovering the structure of this Google search space of terms paired with an initial search for "coronavirus." To take a step in understanding this structure, we first use text mining techniques to build a corpus of all scraped Google search queries in America related to an initial search for "coronavirus" from January 16, 2020 (coronavirus's first Google query in the United States) to August 11, 2020.

Data: The Google Search Space
Prior to fitting and diagnosing networks of the Google search space related to "coronavirus," it is useful to first explore the trend of interest in "coronavirus" over time to both validate selection of the primary search term ("coronavirus" over either "COVID" or "COVID-19"), while also visualizing the contours of relative interest in this term over time. To the latter point, the value is to place the Coronavirus into the context of the American search space of Google. In other words, visualization of Google searches for "coronavirus" relative to "COVID" or "COVID-19" shown in Figure 1, offers a baseline to move forward, and thus justifies focusing only on "coronavirus" moving forward in the remainder of the analysis, with the ability to generalize only for Google usage in America. Figure 1 shows relative, or scaled interest in "Coronavirus" in the aggregate by day of highest level of interest (i.e., divide all days by most frequent day of Googling). A few trends are notable. First, as expected, there are no searches prior to December 2019, which makes sense given the reported beginning and spread of the virus in the global community beginning in December 2019. Also, Figure 1 shows the peak of interest in Googling the virus in America was late March/early February, with a drop in interest after this point. This point is when Coronavirus was at its initial height in America.
Building on the descriptive temporal trends in Figure 1, another helpful way to understand general patterns of Googling interest in "coronavirus" is to examine the geographic spread of the data. These results across the full study period are shown in Figure 2.
Surprisingly, in Figure 2, the highest relative interest in the United States are not in the (initial) hot-spot regions such as New York or Texas. Rather, some of the highest interest in "coronavirus" averaged across the full study period are in Idaho, New Mexico, and Michigan. This suggests that contraction of the virus is not necessarily linked to the interest in the virus. Though this is not a causal claim, these descriptive patterns in Figure 2 suggest that explicit probing of the connection between interest versus virus contraction may lead to unexpected, albeit interesting patterns. For present purposes, the relative interest in "coronavirus" appears to be geographically diffuse. Such an exploratory look at the Google search space, combined with the descriptive temporal pattern shown in Figure 1 sets the stage for diving into the Google searches explicitly to deepen an understanding of the structure of self-education via Google in the time of COVID-19.

Methods: Text Mining and Networks
In this research we are interested in mining these text data using a network science framework. We opt for this approach to explore the space for several reasons. The global use of and access to Google implies some meta-level of connection and community across the full user base. Related, As the users of Google are connected in general through a common interface, the use of Google to explore and learn about COVID-19 and coronavirus should also be community-based, where common threads in searching behavior should emerge given the common use of Google to selfeducate on a variety of topics (e.g., Atlas et al., 2018;Aljilani and Kadobayashi, 2015;Ward et al., 2018). Also, community detection is a power approach to explore (i.e., with a lack of a clear causal question or framework as in our case) some space and learn from it, as with unsupervised machine learning, there are no sets of rules that define the searching and learning process in a formal way (Fortunato and Hric, 2016). As a result, the use of networks and community detection-based mining to home in on commonalities and differences across the Google search space is a reasonable approach to explore and learn the structure of some space. The corpus of Google searches is preprocessed by removing extraneous characters, stopwords (as well as the base search term "coronavirus" in order to focus on related terms), stripping whitespace, and making all terms lowercase.
Upon preprocessing, staging the text data from related Google searches includes three steps. First, we create a term-document matrix (TDM) which gives the frequencies of terms across Google searches. Then, we translate the TDM into a term-term matrix (TTM), which gives the frequencies of co-occurrence across all terms with each other. The TTM is a necessary step allowing for the final stage, which is to create the adjacency matrix, A ij , which is a square matrix Figure 2: Visualization of relative interest in "Coronavirus" from Google trends searches by state across America from January 16, 2020 to August 11, 2020.
with elements as either 0 indicating no connection between vertices, v i and v j , or 1 indicating a connection between vertices (e.g., all elements on the main diagonal are of value 1). That is, Based on the adjacency matrix, we constructed an undirected graph, G = (V , E), on the full search space, where V contains the full set of vertices and E contains all the edges between vertex I and j such that A I = 1. We also built individual networks for each month of Google searches to show the evolution of the structure of the Google search space over time. The month-based results are presented in the Supplementary Material in Figures 2-9. The full network shown in Figure 3, and is used in the final stage of analysis. And, the patterns of node degrees are displayed in Figure 1 in the Supplementary Material.
In the final stage, we fit and compare three widely used community detection algorithms to deepen the exploration of the Google search space. There are many ways to conceptualize community detection (Orman et al., 2012;Fortunato, 2010). For example, we could think of the formation of communities in a single space as a random walk, where modules of vertices in the network are recovered by taking brief, random walks. The algorithm for this approach is called "walktrap" (Pons and Latapy, 2005), and is premised on the assumption that the shorter the walks, the higher the likelihood of local structure, compared to longer walks meaning less local similarity across the network. Though there are many others, we proceed with three widely used approaches and compare results to strengthen reliability of the patterns. If patterns across all three are consistent, then this would be strong evidence that communities likely characterize this search space. If results vary, then perhaps the search space is not well partitioned, meaning people self-educate on the Coronavirus in unique, non-constant ways. The three algorithms are: the Girvan-Newman algorithm (often called "edge betweenness") (Girvan and Newman, 2002;Despalatović et al., 2014), propagating labels (Raghavan et al., 2007), and the Clauset-Newman-Moore algorithm (often called "greedy optimization of modularity") (Clauset et al., 2004). Edge betweenness iteratively removes edges with the shortest number of paths through it, resulting in a rooted tree (dendrogram) structure with labels derived at different splits in the tree. The propagating labels algorithm is local, resembling a neighbor-based approach to clustering vertices, where labels are derived based on the majority of similar vertices in a small neighborhood. These labels, which are derived on the basis of local looks at the data, are iteratively updated and the algorithm stops when the partitioning of the space no longer changes. Greedy optimization of modularity operates by greedily searching the space to derive labels on the basis of the maximum modularity scores. Though a computationally efficient and widely used local method, some have discovered problems with module size and scaling (see, e.g., Fortunato and Barthelemy, 2007).
The value of community detection in this and similar applications is to offer more precise exploration as to whether local communities characterize some network space, which in our case is the Google search space related to "coronavirus." If communities are uncovered, the network would be comprised of more densely connected modules or "clusters", implying uniformity in self-education. This pattern is compared to the alternative of a sparse space, suggesting people are searching for and learning about coronavirus in very different ways, such that no structure emerges. The value of these three algorithms, and thus the justification for selecting them, is their widespread usage, understanding, and theoretical grounding. Understanding of the algorithms and the patterns that emerge, then, will be less burdensome, compared to other less-well known algorithms that may limit interpretability.
The first algorithm calculates edge betweenness as the shortest distance of any path traveling through the calculated edge. Girvan and Newman (2002) applied this definition to edges to locate the shortest paths between modules. By locating the shortest paths (suggesting similarities across vertices) compared to longer paths (indicating greater sparsity and thus less similarity), we are able to iteratively home in on a likely module/community. Through a process of progressively removing paths with the highest edge betweenness scores, the algorithm constructs a hierarchical tree ("dendrogram"), which represents the latent structure of the network.
The second "propagating labels" algorithm finds an optimal representation of community when the labels for all vertices, v ∈ V , have broken ties and thus look like a majority of the other labels surrounding the candidate vertex, v i . Importantly, this algorithm, which is detailed in section three of Raghavan et al. (2007), shuffles the vertex labels to be in random order, and then attempts to recover a version of the network where smaller neighborhoods of vertices surrounded by a majority of like-labeled vertices. This strategy informs the labels assigned to vertices, which is the indication of the community structure.
Finally, for a different approach to community detection, the third algorithm is interested in finding structure of subgraphs within the network based on a maximal number of edges included in a community, compared to some expected value of a random version of the same network. This process of attempting to find structure from a randomized version of the data is indeed common and is similar to other techniques concerned with data representation, e.g., uniform manifold approximation and projection (McInnes et al., 2018). The formalization is not reproduced here for threat of redundancy, but is clearly and simply laid out in "Equations (1)-(7)" in "Section 2" of Clauset et al. (2004).
In sum, all of these algorithms have a common goal: to search for and uncover latent, nonrandom structure in the network. Yet, though the goal is common, the process for uncovering this structure is quite different as previously discussed. The value of this empirical approach for present purposes is to offer three very differently constructed, but similarly motivated methods for a more holistic and thorough search of the Google search space.
The empirical strategy of this analysis is based on uncovering and understanding how people self-educate in the current global pandemic when they are more restricted in in-person interaction, and whether search patterns are more similar or different on average in these searching patterns.

Network of the Full Google Search Space
The network of the full Google search space related to "coronavirus" is shown in Figure 3. There are two notable trends that emerge. The first is that the full search space is partitioned into roughly two halves.
The first half includes a few dense regions around search terms like "cases," "symptoms," and "county." These make sense at an intuitive level given people's desires to know where new cases and outbreaks are occurring, and also in search of information on best to self-diagnose symptoms in line with work showing increased individual behavior online for similar goals (Collado-Borrell et al., 2020). These are indeed core aspects of self-education relating to a rapidly spreading virus like coronavirus in the context of a rapidly growing information environment (Gupta et al., 2020).
The other half of the search space though is much sparser, with very little connection across related searches. For example, there is a range of searches on distinct, nuanced topics like "Tom Hanks," "student loan forgiveness," "ibuprofen," and "Ecuador." This suggests that roughly half of the people Googling coronavirus are interested in more niche aspects often only tangentially related to the virus, but perhaps more directly related to personal contexts ("student loan forgiveness") or interests (e.g., "Tom Hanks" in Figure 4 in the Supplementary Material). At this stage, it appears that the likely communities that characterize the former, denser part of the search space may be concentrated around more overtly related topics of broader interest.

Community Detection
Community detection is most useful to explore whether latent structure exists in a network. Latent and structure in this context translates to unobserved and uncoordinated (in that people are Googling in isolation), yet all the while self-educating in similar ways. Similarity in network structure manifests in the form of modules or "clusters." That which defines a module is twofold: the density of connections within a module, and also the sparsity of connections between other modules. If the former and the latter are high, then the space is considered to have high modularity, substantively suggesting a clear partitioning of the Google search space. Our goal in this final section, then, is to explore whether latent structure exists in this space or not, which will push us closer to understanding how people self-educate in a global pandemic of this sort. The results for all three algorithms, including the base network in the upper left panel, are shown in Figure 4. The layout in all of the networks in Figure 4 is derived using the Fruchterman-Reingold algorithm, which uses a force-directed approach to ensure consistent placement of vertices, which helps in cases such as the current analysis, where different versions of a single graph are displayed and compared (Fruchterman and Reingold, 1991). Color in all plots varies by group labels found from each community detection algorithm.
The structure found across all algorithms mirrors the structure found in the full network in Figure 3, where about half of the space is densely connected suggesting only a few communities. The other half of the space is not as densely connected, where many individual vertices are treated as unique communities in their own rite. Across the full search space, all of the algorithms found around 150 communities. In light of this sparsity, future work might consider constraining the search space to only include frequently occurring search terms relative to some benchmark (e.g., terms appearing at least 20 times). Though such an approach would technically provide a cleaner look at the space, it would also screen out some of the nuance found here. As this Figure 4: Results comparing three community detection algorithms used to explore the Google search space.
analysis is an exploratory first look at the Google search space, no such constraint was employed to allow for natural structure to emerge.
In addition to the sparsity of part of the network, all three algorithms detected some consistent and dense trends, including a community surrounding the search term "county." Further, there is similarity in other densely connected regions surrounding "cases," "symptoms," and "stimulus check." The similarities across the algorithms relating to these key terms are important, because regardless of the specific algorithm, we would expect clear, non-random communities to emerge if such regions are truly present in the data. As these regions are indeed clear and consistent across the algorithms, the structure of this Google search space is at least in part indeed non-random. This is the benefit of selecting three different, but widely understood algorithms to detect community pattern.
Yet, despite the similarities, there is also variance in the labels found by the algorithms across these communities. For example, the propagating labels algorithm treated the "county" community similarly to the "update" and "symptoms" communities, though these few communities remain largely stable across each algorithm. Interestingly, the dense communities partitioning varied across all algorithms, with surprisingly few communities found in the greedy optimization of modularity algorithm in the lower right plot. This is surprising because of the algorithm's focus on locally and greedily searching for communities, rather than a broader and more global focus. A naive expectation, thus, might be more fragmentation in communities from greedy optimization, but this is not what is seen in Figure 4.
In sum, these algorithms are all picking up on Googling interest in geolocations of COVID-19 outbreaks, in line with work showing such variation is fast moving and widespread at both the U.S. county level (Javan et al., 2020), and abroad (Pobiruchin et al., 2020). And from a technical perspective, the value of fitting multiple algorithms to a common data space such as in our case, is that different versions of community construction are able to emerge, which ultimately offer the researcher greater flexibility in interpretation. Such a step is valuable in an exploratory study of this sort to pave the way for future, more targeted and causal studies.

Concluding Remarks
There are a few broad conclusions to draw based on these results. First, people tend to search in an uncoordinated, though still common fashion relating to more obvious terms associated with coronavirus (e.g., "location," "cases," "symptoms"). Yet, for much of the space (and over time), there remain shifting and tangential-type queries that vary widely, suggesting the space is not clearly partitioned.
These two themes make intuitive sense as a start. Where the information environment surrounding the Coronavirus is both in flux with new details emerging frequently but also stable at a higher level regarding common themes like outbreak locations and symptoms, the individual search space takes on a unique character. About half of the search space includes a few prominent communities related to more common and understandable topics previously described, whereas the other half of the search space is less stable, and more rapidly evolving.
There are a number of future studies that could build on these exploratory findings. First, time is a likely driver of structure of this space given the rapidly shifting information environment referenced throughout. For example, do key exogenous events such as the U.S. electoral cycle prompt structural shifts in search patterns or not? Further, future work could pair these search data with social media data to explore whether parallel trends exist. Namely, do we see a similar type of split between a more stable, but narrower battery of search terms/topics paired with a wide array of tangential, shifting search terms/topics? Finally, anomaly detection would be useful to ask similar questions, but with more data as we live with the Coronavirus longer. For example, might these patterns be "seasonal," or could they be anomalous, where people's attention moves away from these types of issues over time?
Ultimately, beyond providing a deeper understanding of the contours of self-education in this unprecedented season of the Coronavirus and COVID-19, this research corroborates the base understanding of the extreme value of Google in modern society as a tool for self-education, which is a critical skill in a time of isolation and fear.

Supplementary Material
The