Biomedical Information Extraction
Biomedical information extraction (IE) is the process of scanning unstructured text for objects of interest (such as genes, proteins, or small molecules) and facts about these objects (such as the location or interactions of proteins within a cell). The large amount of data available coupled with the small number of relevant facts makes this a challenging problem for machine learning. Given a set of Medline journal abstracts manually tagged with biological relationships, our task is to learn a hypothesis that extracts only these relations from a set of abstracts and performs well on unseen abstracts. One of our major contributions in this domain was the creation and re-labeling of large datasets for use in ILP. We constructed background knowledge in first-order logic from many sources: we use the Sundance parser to find the sentence structure, then incorporate semantic properties from the Medical Subject Headings, the Gene Ontology, and the Online Medical Dictionary, as well as lexical properties such as 'alphanumeric' and 'capitalized,' and finally calculated statistical word frequency on the training set. Overall, we defined 215 logical predicates for use in describing biomedical domains, and have released our large protein-localization dataset as a challenge task for other researchers.
Recall-Precision Curves
When applying ILP to large relational datasets with a small number of positive examples, the most common way to measure performance in these domains is to use precision and recall, two evaluation metrics which focus on the correct classification of the positive examples. It is even more useful to create a recall-precision curve to illustrate the trade-off between these two measurements. We investigated the properties of recall-precision curves and found that there were some distinct similarities and differences with ROC curves, another standard machine learning evaluation metric. One can find the 'best' achievable curve in recall-precision space as an analog to the ROC convex hull, however, interpolation in recall-precision space is counter-intuitive and often performed incorrectly, and designing algorithms to optimize one metric will not necessarily optimize the other. Our findings were reported at the International Conference on Machine Learning in 2006.
Inductive Logic Programming
ILP algorithms attempt to construct clauses based on logical background structure that will separate positive and negative examples for a given concept. Gleaner is a parallel randomized-search method that quickly collects good clauses from a broad spectrum of points along the recall dimension in recall-precision curves. Clauses are generated in parallel for many seeds using a Rapid Random Restart search in Aleph, a widely used ILP algorithm, and clauses with high precision on a training set are saved for many different recall points. It then employs a thresholding method to combine the selected clauses and create disjoint theories, each focused on a different area of recall-precision space. Finally we evaluate the predictions of these theories on a testset to create our final recall-precision curve. We have extended the Gleaner algorithm in a number of directions. First, we have learned more diverse sets of clauses while maintaining the parallel nature of Gleaner in hopes of improving overall performance. Second, we have explored alternate combination schemes for the learned clauses in our theories and giving weight to the clause's individual performance. Finally, we can combine the output of Gleaner with other machine learning algorithms to predict probabilities for each testing example, as opposed to our usual recall-precision metrics. Our work with Gleaner was awarded the Best Student Paper award at the 2004 international conference on Inductive Logic Programming; an extended article on our Gleaner algorithm was recently published in the Machine Learning Journal.
Future Research Plans
I have enjoyed the interdisciplinary nature of my thesis research, and I see great value in using problems from outside domains to interest motivated undergraduates in computer science research. One such interdisciplinary field I wish to explore is computational social science, where the complex systems of individuals and their interactions are modeled by a computational simulation. My prior work in this area consists of developing a simulation of voting and political parties, along with modeling the evolutionary success of greedy versus altruistic strategies in a simple bargaining game; other researchers have simulated the foraging behavior of insects, communication within social networks, and biological interactions at the cellular level. Most often, the simulated agents are equipped with very rudimentary rule-based decision-making abilities. I plan to continue research on the application of ILP and other machine learning algorithms to these social simulations, first to enhance the abilities of agents within these simulations, and second to answer questions about the interlinked relationships among agents and their environment.
I am also interested in pursuing research on computer science education. The exponential growth in information and pervasiveness of computers in our society has been strangely accompanied by a marked decline in the number of students pursuing degrees in computer science. I plan to research how computer science is introduced and taught, both in the undergraduate introductory courses and earlier throughout the K-12 levels of education. I believe we need to explore interactive pedagogical methods and environments to engage student's interest in the fundamentals of computer science logic. For this purpose, I have joined the Computer Science Teachers Association, a special interest group of the ACM focused on K-12 computer science education, and I will continue to engage my students to think not only about what they are learning but how they learn.
