Esra Erdem Sabanci University Istanbul, Turkey
Recent advances in health and life sciences have led to generation of a large amount of biomedical data. To facilitate access to its desired parts, such a big mass of data has been represented in structured forms, like biomedical ontologies and databases. On the other hand, representing these ontologies and databases in different forms, constructing them independently from each other, and storing them at different locations have brought about many challenges for answering queries about the knowledge represented in these knowledge resources.
One of the challenges is for the users to be able represent a complex query in a natural language, and get its answers in an understandable form. Another challenge is to answer complex queries that require extraction and appropriate integration of relevant knowledge from different knowledge resources and/or that require auxiliary definitions, such as, chains of drug-drug interactions, cliques of genes based on gene-gene relations, or similarity/diversity of genes/drugs. Furthermore, once an answer is found for a complex query, the experts may need further explanations about the answer. Consider, for instance, the following queries:
- Q1 What are the genes that are targeted by the drug Epinephrine and that interact with the gene DLG4?
- Q2 What are the genes that are targeted by all the drugs that belong to the category Hmg-coa reductase inhibitors?
- Q3 What are the genes related to the gene ADRB1 via a gene-gene relation chain of length at most 3?
- Q4 What are the 3 most similar genes that are targeted by the drug Epinephrine?
Most of the existing biomedical querying systems (e.g, web services built over the available knowledge resources) support keyword search but not complex queries like Q1–Q4. Some of these complex queries, such as Q1 and Q2, can be represented in a formal query language (e.g., SQL/SPARQL) and then answered using Semantic Web technologies. However, queries, like Q3, that require auxiliary recursive definitions (such as transitive closure) cannot be directly represented in these languages; and thus such queries cannot be answered directly using Semantic Web technologies. The experts usually compute auxiliary relations in advance, for instance, by enumerating all drug-drug interaction chains or gene cliques, and then use these auxiliary relations to represent and answer a query like Q3. Similarity/diversity queries, like Q4, cannot be represented directly in these languages either, and require a sophisticated reasoning algorithm. Also, none of the existing systems can provide informative explanations about the answers, but point to related web pages of the knowledge resources available online.
We have introduced methods to handle these challenges using Answer Set Programming (ASP) [1]. To address the first challenge, based on our discussions with some experts, we have developed a controlled natural language for biomedical queries about drug discovery; this language is called BIOQUERY-CNL [2]. For instance, the queries Q1–Q4 are in BIOQUERY-CNL. We have also developed an algorithm to translate a query in BIOQUERY-CNL into ASP [2]. Then we have designed an intelligent user interface that allows users to enter biomedical queries in BIOQUERY-CNL and that presents the answers (possibly with explanations and related links, if requested) in BIOQUERY-CNL [3]. To address the second challenge, we have developed a rule layer over biomedical ontologies and databases, that not only extracts and integrates the concepts in these knowledge resources but also provides definitions of auxiliary concepts [4]. To extract knowledge from ontologies in RDF(S)/OWL, we could use the special external predicates of the ASP solver DLVHEX [5]; for other forms of knowledge resources, we have implemented special transformations into ASP. We have introduced an algorithm to identify the relevant parts of the rule layer and the knowledge resources with respect to the given query, and used ASP solvers to answer queries considering these relevant parts [6]. To address the third challenge, we have developed an algorithm to generate a shortest explanation for a given answer, with respect to the query and the relevant parts of the rule layer and the knowledge resources [6],[7].
Based on these methods, we have developed a software, called BIOQUERY-ASP, to find answers and generate explanations to complex biomedical queries. We have shown the applicability of BIOQUERY-ASP over large biomedical knowledge resources about genes, drugs and diseases, such as PHARMGKB, DRUGBANK, BIOGRID, CTD, and SIDER [6]. For queries that are not concerned about similarity/diversity of genes/drugs, we have used the ASP solver CLASP [8]. For similarity/diversity queries, we have utilized the online methods of Eiter et al. [9] for finding similar/diverse solutions, and used the solver CLASP-NK [10] — a variation of CLASP that computes similar/diverse answer sets for a program with respect to a given distance measure. For similarity/diversity of genes, we have considered the semantic and functional similarity measure that is defined over the gene ontology [11]. A demo of BIOQUERY-ASP is available at http://krr.sabanciuniv.edu/projects/BioQuery-ASP/ .
ACKNOWLEDGMENTS
Our work on biomedical query answering for drug discovery has involved, at various stages, contributions of the previous/current students Zeynep Coban, Mahircan Doganay, Halit Erdogan, Hilal Kosucu, Umut Oztok, Reyyan Yeniterzi, as well as collaborations with the experts Olivier Bodenreider (NIH) and Yelda Erdem (Sanovel Pharmaceuticals), and useful discussions with many other researchers. Parts of this work have been supported by TUBITAK Grant 108E229.
REFERENCES
[1] Lifschitz, V.: What is answer set programming? In: Proc. of AAAI (2008)
[2] Erdem, E., Yeniterzi, R.: Transforming controlled natural language biomedical queries into answer set programs. In: Proc. of the Workshop on BioNLP. pp. 117–124 (2009)
[3] Erdogan, H., Oztok, U., Erdem, Y., Erdem, E.: Querying biomedical ontologies in natural language using answer set programming. In: Proc. of SWAT4LS (2010)
[4] Bodenreider, O., Coban, Z.H., Doganay, M.C., Erdem, E.: A preliminary report on answering complex queries related to drug discovery using answer set programming. In: Proc. of ALPSWS (2008)
[5] Eiter, T., G.Ianni, R.Schindlauer, H.Tompits: Effective integration of declarative rules with external evaluations for Semantic-Web reasoning. In: Proc. of ESWC. (2006) http://www.kr.tuwien.ac.at/research/systems/dlvhex/
[6] Erdem, E., Erdem, Y., Erdogan, H., Oztok, U.: Finding answers and generating explanations for complex biomedical queries. In: Proc. of AAAI (2011)
[7] Oztok, U., Erdem, E.: Generating explanations for complex biomedical queries. In: Proc. of AAAI (2011)
[8] Gebser, M., Kaufmann, B., Neumann, A., Schaub, T.: CLASP: A Conflict-Driven Answer Set Solver. In: Proc. of LPNMR. pp. 260–265 (2007) http://www.cs.uni-potsdam.de/clasp/
[9] Eiter, T., Erdem, E., Erdogan, H., Fink, M.: Finding similar or diverse solutions in answer set programming. In: Proc. of ICLP. pp. 342–356 (2009)
[10] CLASP-NK: http://krr.sabanciuniv.edu/projects/SimilarDiverseSolnsInASP/ .
[11] Wang, J.Z., Du, Z., Payattakool, R., Yu, S.P., Chen, C.F.: A new method to measure the semantic similarity of go terms. Bioinformatics 23, 1274–1281 (2007)