Automated Linking PUBMED Documents with GO Terms Using SVM

: We have developed an automated linking scheme for PUBMED citations with GO terms using SVM (Support Vector Machine), a classiﬁcation algorithm. The PUBMED database has been essential to life science researchers with over 12 million citations. More recently GO (Gene Ontology) has provided a graph structure for biological process, cellular component, and molecular function of genomic data. By text mining the textual content of PUBMED and associating them with GO terms, we have built up an ontological map for these databases so that users can search PUBMED via GO terms and conversely GO entries via PUBMED classiﬁcation. Conse-quently, some interesting and unexpected knowledge may be captured from them for further data analysis and biological experimentation. This paper reports our results on SVM implementation and the need to parallelize for the training phase.


Introduction
With the exponential growth of biomedical data, life science researchers have met a new challenge -how to exploit systematically the relationships between genes, sequences and the biomedical literature (Yandell and Majoros, 2002). Usually most of known genes are found in the biomedical literature and PUBMED is a worthy database for this kind of information. PUBMED, developed by the U.S. National Library of Medicine (NLM), is a database of indexed bibliographic citations and abstracts (National Library of Medicine). It contains over 4,600 biomedical journals. PUBMED citations and abstracts are searchable via PUBMED 1 or the NLM Gateway 2 . The biomedical literature has much to say about gene sequence, but it also seems that sequence can tell us much about the biomedical literature. Currently, highly trained biologists read the literature and manually select appropriate Gene Ontology (GO) terms to annotate the literature with GO terms. Gene Ontology database has more recently been created to provide an ontological graph structure for biological process, cellular component, and molecular function of genomic data . McCray et al. (2002) show that the GO is suitable as a resource for natural language processing (NLP) applications because a large percentage (79%) of the GO terms have passed the NLP parser. They also show that 35% of the GO terms were found in a corpus collected from the MEDLINE database 3 and 27% of the GO terms were found in the current edition of the Unified Medical Language System (UMLS). A recent research work of Raychaudri et al. employs a "maximum entropy" technique to categorize 21 GO terms using training and test documents extracted from PUBMED using handcrafted keyword queries. Their study reports that their models trained on PUBMED documents published prior to 2001 achieved an accuracy of 72.8% when tested on documents published in 2001 (Raychaudhuri et al., 2002). Another work of T. C. Smith et al. completed in April 2003 shows that about 110,000 PUBMED abstracts can be linked to the Gene Ontology (Smith and Cleary, 2003). In order to compare with (Raychaudhuri et al., 2002), it ran on the same 21 categories achieved an accuracy of 70.5% (at the precision-recall breakeven point).
Although these research works demonstrate that NLP is applicable to GO and PUBMED database can be linked to GO terms, there are inherently challenging issues to fully exploit both PUBMED and GO databases. One of them is that there are too many class categories (i.e. GO terms) in the GO because the GO is a large, complex graph in itself. For example, in the GO database released as of February 2005, there were a total of 17,593 terms (Gene Ontology Consortium 4 ). Furthermore, GO grows in coverage and evolves in a monthly cycle. Finally, PUBMED contains over 12 million article citations. Beginning in 2002, it began to add over 2,000 new references on a daily basis (National Library of Medicine).
In order to organize the PUBMED contents in a systematic and useful way, we believe that text classification and text clustering should be exploited extensively. Perhaps, due to the large scale of PUBMED, it is also important to look for parallel and scaling-up algorithms. Text classification is a "boiling down" of the specific content of a document into a set of one or more pre-defined labels (Hearst, 1999). Text clustering can group similar documents into a set of clusters based on shared features among subsets of the documents (Chakrabarti, 2000;Chen et al., 1996;Kohonen, 1998). In this paper, we have implemented a text classification system using SVM that can automatically link PUBMED citations with GO terms. The performance measure for three data sets of small, medium and large sizes is excellent except training time. Then we examine the scalability of the SVM algorithm for training time. From the performance results of the three dataset sizes, we conclude that SVM must be scaled up using grid computers for its most computation-intensive task: training.

Implementation
First we consider basic terminology of text classification. Given a fixed set The goal in document classification is to infer a classification rule from the training set S so that it classifies new examples with high accuracy (Joachims, 2001).
The naive Bayes (NB) classifier is a probabilistic classification method (Lewis, 1998). NB is based on the Bayes' theorem and the naive Bayes independence assumption. Bayes' theorem says that to achieve the highest classification accuracy, a document d should be assigned to the class c i for which P (c i |d) is highest. The naive Bayes independence assumption states that the probability of a word w i is independent from any other word w j given that the cass is known. Although this assumption is clearly false, it allows the easy estimation of the conditional probability P (W j |c i ). In the learning phase, NB estimates the class prior probabilities P (c i ) and the conditional probability of each attribute w i given the class where |c i | denotes the number of training documents in class c i and |S| is the total number of training documents. Given a new document d =< w 1 , . . . , w m >, NB predicts the class as the one with the highest probability of c i : where |V | is the total number of attributes in V and T F (W, c i ) is the overall number of times word w occurs within the documents in class c i .
At training time, NB requires linear time both to the number of training documents and to the number of features and thus its computational requirements are minimal. At classification time, a new example can be also classified in linear time both to the number of features and to the number of classes. NB is particularly well suited when the dimensionality of the inputs is high and can often outperform more sophisticated classification methods due to its simplicity and effectiveness (Liu et al., 1998). Support Vector Machine (SVM) is an important classification method for a binary classification problem (Joachims, 2001). SVM maps a given set of n dimensional input vectors nonlinearly into a high dimensional feature space and separate the two classes of data with a maximum margin of hyperplane.
For the multi-class classification problem, a binary SVM is generated for each class c i in general. Each SVM is trained for each binary classification problem. Given a new document d to be classified, each SVM estimates P (c i |d). The document is classified into the class c i for which the corresponding P (c i |d) is highest. This reduction of a multi-class problem into m binary tasks is called a one-versus-all method (Joachims, 2001).

Results and Discussion
In this Section, we report our experimental results on the performance of Support Vector Machine (SVM). Experiments were performed on a 2.8GHz Pentium IV PC with 1GB of memory in Linux environment. Algorithms were coded with GNU C/C++. For SVM, we chose a linear SVM due to its popularity and fast training time compared to non-linear SVMs (e.g. polynomial, radial basis function, or sigmoid SVMs) in text classification (Joachims, 2001). It is important to select a good value of C, the amount of training error tolerated by the SVM, for the linear SVM. Among the possible values of C ∈ {0.05, 0.1, 0.5, 1.0, 5, 10, 1000}, we chose C = 5, since the linear SVM with C = 5 performed best on our datasets in terms of classification accuracy. We used the SVM multiclass 5 package by Joachims, which is an implementation of the multi-class SVM In this experiment we constructed three kinds of datasets (small, medium, and large datasets) to evaluate the performance of SVM algorithm. We used the holdout method to randomly divide each dataset into two parts: a training set and a test set. Table 1 lists the detailed information on the data sets used in this experiment and each data set contains 10 classes. After stemming and stop word removal, we obtained a vocabulary of 47,436 unique words for small dataset, a vocabulary of 217,872 distinct words for medium dataset, a vocabulary of 357,953 unique words for large dataset, respectively.
We investigated how many documents are contained in multiple relevant classes in our datasets. Table 2 lists the number of citations in each dataset and the number of documents with N classes (1 ≤ N ≤ 4).
To construct training and test data, we first surveyed how many GO terms are contained in PUBMED citations. For each GO term, we made a query statement, limiting the results to the Medical Subject Heading (MeSH) major topic field and to citations with abstracts in English (National Library of Medicine). After submitting all query statements to PUBMED, we found that a total number of 564 out of 17,593 GO terms found in PUBMED citations. Table 3 lists the top 10 most frequently occurring GO terms in PUBMED. For evaluating the performance, we use the standard recall, precision, and F1 measure. Recall (r) is defined to be the ratio of correct predictions by the classification system divided by the total number of correct predictions. Precision (p) is the ratio of correct predictions by the classification system divided by the total number of the system's predictions. The F 1 measure combines recall and precision into an equally weighted single measure as follows: As a feature selection method, mutual information (or information gain) (Cover and Thomas, 1991) was used to select a total of 200, 600, and 1000 features that have the highest average mutual information with the class variable for each dataset. Table 4 summarized the performance scores, precision (p), recall (r), and F 1 measures on three datasets for vocabulary sizes of 200, 600, and 1000 words. Compared to NB, SVM performs extremely well, except training times of classifiers (time: CPU seconds) ( Table 5). We have carried out the long-waiting training times of 29190, 32071, 50065 CPU seconds. Despite this problem, SVM can be scaled up using grid computers of size 200 and above which are commonly operating in large research labs. Although PUBMED has about 2,000 new entries everyday, we will not retrain the whole data collection.   Table 5 shows the time to train and classify the NB and SVM algorithms for each dataset. The training time of linear SVM tends to increase dramatically with an increasing training set size and feature set size, although a linear SVM can be trained much faster than a nonlinear SVM (Joachims, 2001).
The results were executed on a 2.8GHz Pentium IV PC with 1GB of memory in Linux environment. If we scale up the environment to high performance computing (e.g., 200+ CPU's), we feel that SVM is a viable algorithm to implement the automated linking of PUBMED documents with GO terms. The reason that linear SVM is viable for parallelization is that the two mathematical operations of SVM can be parallelized: A linear mapping of an input vector into a high dimensional feature space that is hidden from the input and output.
Construction of an optimal hyper-plane from features discovered in Step 1.
This hyper-plane is a decision surface that is constructed that separates members of different classes in such a way as to maximize the distance between them. The finding of hyper-plane is a convex optimization problem. The simplest solution is the gradient ascent approach that follows the steepest ascent path to the optimal solution. A more efficient way is to use the chunking decomposition algorithms. The basic idea of parallelism is derived from these two algorithms, which distribute the dataset to different processors and then aggregating results until convergence. The convergence criterion is the Karush-Kuhn-Tucker conditions. It is easy to develop that the parallel algorithm will always converge for distributed data sets.