A list of publications authored and co-authored by Robert Hoehndorf.
Note: Group members supervised by Robert are bolded. ✉ indicates corresponding authorship.
Filter by Research Area:
Molecular basis and cellular effects of Janus-class–driven cytoplasmic PYK2 coacervates
Colombo, Giovanni, Salem, Israa, Szczepski, Kacper, Yu, Piao, Alfaiyz, Shaden, Guzmán-Vega, Francisco Javier, Abogosh, Ahmed, Kulmanov, Maxat, Al-Harthi, Samah, Kadaré, Gress, Hoehndorf, Robert, Girault, Jean-Antoine, Jaremko, Łukasz, Momin, Afaque A. and Arold, Stefan T. Communications Biology (2026)
Bioinformatics
Kinase activity is increasingly linked to biomolecular phase separation. Focal adhesion kinase (FAK) forms membrane-associated condensates with paxillin to promote adhesion. Here we show that its paralogue, proline-rich tyrosine kinase 2 (PYK2), undergoes phase separation via a distinct mechanism. PYK2 forms cytoplasmic condensates primarily driven by its kinase-FAT linker (KFL) region. Overexpression of PYK2 induces condensates enriched in its autophosphorylated form, which sequester paxillin from focal adhesions and impair cell adhesion. We uncover an autoregulatory mechanism involving the KFL, linking self-association, autophosphorylation, and condensation. Uncommon among known phase separation drivers, KFL condensation is phosphorylation-independent and its sequence belongs to the "Janus" class. Using a transformer-based protein language model, we identified non-homologous sequences with similar features, many from adhesion and cytoskeletal regulators. We validated the phase-separating potential of several of these sequences in cells. These findings reveal a mechanism linking phase separation with kinase activation, and demonstrate distinct condensation behavior in homologs. Our results also highlight how protein concentration modulates condensate function, with implications for disease, and expand the landscape of phase separation drivers.
VarLand: A pipeline to map the structural landscape of missense variants at the proteome scale
Guzmán-Vega, Francisco J., Cardona-Londoño, Kelly J., González-Álvarez, Ana C., Peña-Guerra, Karla A., Althagafi, Azza, Khan, Tanisha, Hoehndorf, Robert and Arold, Stefan T. Journal of Biological Chemistry, vol. 302(2), pp. 111071 (2026)
BioinformaticsDisease genetics
Genomic diversity and antimicrobial resistance of Staphylococcus aureus in Saudi Arabia: a nationwide study using whole-genome sequencing
Alarawi, Mohammed S., Altammami, Musaad, Abutarboush, Mohammed, Kulmanov, Maxat, Alkuraithy, Dalal M., Kafkas, Senay, Radley, Robert, Abdelhakim, Marwa, Aldakhil, Hind, Bawazeer, Reema A., Alolayan, Mohammed A., Alnafjan, Basel M., Huraysi, Abdulaziz A., Almaabadi, Amani, Suliman, Bandar A., Aljohani, Areej G., Hemeg, Hassan A., Almogbel, Mohammed S., Alazmi, Meshari, Bazaid, Abdulrahman S., Abujamel, Turki S., Hashem, Anwar M., Al-Zahrani, Ibrahim A., Abdoh, Mohammed S., Hobani, Haya I., Felemban, Rakan F., Alhazmi, Wafaa A., Hong, Pei-Ying, Alghoribi, Majed F., Aljohani, Sameera, Balkhy, Hanan, Alswaji, Abdulrahman, Alzayer, Maha, Alalwan, Bassam, Kaaki, Mai M., Hala, Sharif M., Fallatah, Omniya Ahmad, Bahitham, Wesam, Zakri, Samer, Alshehri, Mohammad A., Kameli, Nader, Algaissi, Abdullah, Alamer, Edrous, Alhazmi, Abdulaziz, Shajri, Amjad A., Darraj, Majid Ahmed, Kameli, Bandar, Sufyani, O. O., Rahama, Badreldin S., Bakr, Abrar A., Alhoshani, Fahad M., Alquait, Azzam A., Somily, Ali M., Albarrag, Ahmed M., Alosaimi, Lamia, Aldakeel, Sumayh A., Bahwerth, Fayez S., Khan, Mushtaq A., Abdelrahman, Tamir T., Fanning, Séamus, Tawfik, Essam A., Alyamani, Essam J., Gojobori, Takashi, Miyazaki, Satoru, Al-Fageeh, Mohammed B. and Hoehndorf, Robert Microbial Genomics, vol. 11(11) (2025)
Bioinformatics
Methicillin-resistant Staphylococcus aureus (MRSA) surveillance in regions with mass gatherings presents unique challenges for public health systems. Saudi Arabia, hosting millions of pilgrims annually, provides a distinctive setting for studying how human mobility shapes bacterial populations, yet comprehensive genomic surveillance data from this region remain limited. Here, we present an integrated analysis of S. aureus isolates collected across seven Saudi Arabian regions, combining whole-genome sequencing with extensive antimicrobial susceptibility testing and standardized metadata following findability, accessibility, interoperability and reusability data principles. Our analysis revealed striking differences between pilgrimage and non-pilgrimage cities. Pilgrimage cities showed significantly higher genetic diversity and antimicrobial resistance rates, harbouring numerous international strains, including recognized clones from diverse geographic origins. Reported lineage dynamics are changing, expanding toward community clones. While genomic prediction of antimicrobial resistance showed high accuracy for some antibiotics, particularly beta-lactams, with varying performance for others, it highlights the necessity for phenotypic testing in clinical settings. Our findings demonstrate how mass gatherings drive bacterial population structures and emphasize the importance of integrated surveillance approaches in regions with significant global connectivity and travel.
CAGI6 ID panel challenge: assessment of phenotype and variant predictions in 415 children with neurodevelopmental disorders (NDDs)
Aspromonte, Maria Cristina, Del Conte, Alessio, Zhu, Shaowen, Tan, Wuwei, Shen, Yang, Zhang, Yexian, Li, Qi, Wang, Maggie Haitian, Babbi, Giulia, Bovo, Samuele, Martelli, Pier Luigi, Casadio, Rita, Althagafi, Azza, Toonsi, Sumyyah, Kulmanov, Maxat, Hoehndorf, Robert, Katsonis, Panagiotis, Williams, Amanda, Lichtarge, Olivier, Xian, Su, Surento, Wesley, Pejaver, Vikas, Mooney, Sean D., Sunderam, Uma, Srinivasan, Rajgopal, Murgia, Alessandra, Piovesan, Damiano, Tosatto, Silvio C. E. and Leonardi, Emanuela Human Genetics, vol. 144(2–3), pp. 227–242 (2025)
Disease genetics
Abstract
The Genetics of Neurodevelopmental Disorders Lab in Padua provided a new intellectual disability (ID) Panel challenge for computational methods to predict patient phenotypes and their causal variants in the context of the Critical Assessment of the Genome Interpretation, 6th edition (CAGI6). Eight research teams submitted a total of 30 models to predict phenotypes based on the sequences of 74 genes (VCF format) in 415 pediatric patients affected by Neurodevelopmental Disorders (NDDs). NDDs are clinically and genetically heterogeneous conditions, with onset in infant age. Here, we assess the ability and accuracy of computational methods to predict comorbid phenotypes based on clinical features described in each patient and their causal variants. We also evaluated predictions for possible genetic causes in patients without a clear genetic diagnosis. Like the previous ID Panel challenge in CAGI5, seven clinical features (ID, ASD, ataxia, epilepsy, microcephaly, macrocephaly, hypotonia), and variants (Pathogenic/Likely Pathogenic, Variants of Uncertain Significance and Risk Factors) were provided. The phenotypic traits and variant data of 150 patients from the CAGI5 ID Panel Challenge were provided as training set for predictors. The CAGI6 challenge confirms CAGI5 results that predicting phenotypes from gene panel data is highly challenging, with AUC values close to random, and no method able to predict relevant variants with both high accuracy and precision. However, a significant improvement is noted for the best method, with recall increasing from 66% to 82%. Several groups also successfully predicted difficult-to-detect variants, emphasizing the importance of variants initially excluded by the Padua NDD Lab.
Bouchama, Abderrezak, Gomez, Maria, Abdullah, Mashan L., Al Mahri, Saeed, Malik, Shuja Shafi, Yezli, Saber, Mohammad, Sameer, Lehe, Cynthia, Abuyassin, Bisher and Hoehndorf, Robert Journal of Applied Physiology, vol. 138(4), pp. 964–978 (2025)
Bioinformatics
Heat-related mortality remains health challenges exacerbated by climate change, with sex-based differences in outcomes, yet underlying mechanisms remain poorly understood. This study examined transcriptomic responses to heat exposure in peripheral blood mononuclear cells from 19 patients with heat stroke (HS; 8 males, mean age 64.8 ± 6.6 yr; 11 females, mean age 49.7 ± 11 yr) and 19 controls (11 males, mean age 48.9 ± 9.6 yr; 8 females, mean age 44.9 ± 11.8 yr). At admission, gene expression revealed upregulation of heat shock protein genes, and pathway analysis demonstrated activation of heat shock and unfolded protein responses across both sexes consistent with proteotoxic stress. However, distinct metabolic, oxidative stress, cell cycle control, and immune responses were observed within each sex. Females displayed inhibition of protein synthesis, oxidative phosphorylation, and metabolic pathways, including glucose metabolism, indicative of a hypometabolic state. Males maintained metabolic activity precooling and enhanced adenosine triphosphate production postcooling. Females activated nuclear factor erythroid 2-related factor 2 (NRF2)-mediated oxidative stress responses and inhibited DNA replication and mitosis, potentially mitigating genomic instability, whereas these pathways showed limited regulation in males. Females promoted innate immunity via interleukin (IL)-6, inflammasome, and triggering receptor expressed on myeloid cells 1 (TREM1) signaling, whereas males showed suppression of both innate and adaptive immunity, including IL-12, Th1, and T-cell receptor pathways. Upstream analysis identified over 100 transcription factors in both sexes. Males primarily relied on transcriptional mechanisms, whereas females also exhibited translational regulation via La ribonucleoprotein 1 (LARP1), fragile X messenger ribonucleoprotein 1 (FMR1), insulin-like growth factor 2 mRNA binding protein 1 (IGF2BP1), and eukaryotic translation initiation factor 6 (EIF6). These findings suggest distinct, sex-specific molecular adaptations to heat stroke, underscoring the need for targeted therapeutic strategies to mitigate heat-induced morbidity and mortality.NEW & NOTEWORTHY Heat-related mortality continues to rise with climate change. Our transcriptomic analysis reveals distinct sex-specific metabolic strategies to heat stroke: females enter a hypometabolic state, an evolutionary adaptation that conserves energy, whereas males sustain metabolic activity. Transcription factors and a subset of translation regulators in females modulate proteostasis and bioenergetics, driving these sex-specific pathways. These novel findings highlight the critical need to consider sex-specific differences in heat-related illnesses and inform carefully targeted interventions to improve patient outcomes.
Ontology Embedding: A Survey of Methods, Applications and Resources
Chen, Jiaoyan, Mashkova, Olga, Zhapa-Camacho, Fernando, Hoehndorf, Robert, He, Yuan and Horrocks, Ian IEEE Transactions on Knowledge and Data Engineering, vol. 37(7), pp. 4193–4212 (2025)
Applied Ontology
Ontologies are widely used for representing domain knowledge and meta data, playing an increasingly important role in Information Systems, the Semantic Web, Bioinformatics and many other domains. However, logical reasoning that ontologies can directly support are quite limited in learning, approximation and prediction. One straightforward solution is to integrate statistical analysis and machine learning. To this end, automatically learning vector representation for knowledge of an ontology i.e., ontology embedding has been widely investigated. Numerous papers have been published on ontology embedding, but a lack of systematic reviews hinders researchers from gaining a comprehensive understanding of this field. To bridge this gap, we write this survey paper, which first introduces different kinds of semantics of ontologies and formally defines ontology embedding as well as its property of faithfulness. Based on this, it systematically categorizes and analyses a relatively complete set of over 80 papers, according to the ontologies they aim at and their technical solutions including geometric modeling, sequence modeling and graph propagation. This survey also introduces the applications of ontology embedding in ontology engineering, machine learning augmentation and life sciences, presents a new library mOWL and discusses the challenges and future directions.
Age-related differences in gene expression and pathway activation following heatstroke
Gomez, Maria, Al Mahri, Saeed, Abdullah, Mashan, Malik, Shuja Shafi, Yezli, Saber, Yassin, Yara, Khan, Anas, Lehe, Cynthia, Mohammad, Sameer, Hoehndorf, Robert and Bouchama, Abderrezak Physiological Genomics, vol. 57(2), pp. 65–79 (2025)
Bioinformatics
This study investigates the molecular responses to heatstroke in young and old patients by comparing whole-genome transcriptomes between age groups. We analyzed transcriptomic profiles from patients categorized into two age-defined cohorts: young (mean age = 44.9 ± 6 yr) and old (mean age = 66.1 ± 4 yr). Control subjects, exposed to similar environmental heat conditions but without developing heatstroke, were also included in the analysis to provide a baseline for comparison. Despite uniform heatstroke severity at admission, as indicated by core body temperature, consciousness level, and organ damage markers, notable gene expression differences emerged. Old patients showed 37% fewer differentially expressed genes compared with young patients at admission, with a shift toward gene upregulation, deviating from the usual downregulation seen in heat stress responses. Both age groups exhibited increased heat shock protein gene expression, activated the heat stress, and unfolded protein responses indicating comparable proteotoxic stress. Nonetheless, age-specific differences were evident in critical regulatory pathways like Sirtuin, mTOR, and p53 signaling, along with key pathways related to proteostasis, energy metabolism, oxidative stress, and immune responses. Following cooling, older adults exhibited a decline in the heat stress response and a cessation of the unfolded protein response, in contrast to the sustained responses seen in younger individuals. This pattern suggests an age-related adaptability or a diminished protective response capacity with aging. These findings provide insights into the biological mechanisms that may contribute to age-specific vulnerabilities to heat.NEW & NOTEWORTHY Our study reveals distinct molecular responses to heatstroke across age groups, with older adults showing fewer differentially expressed genes and an atypical pattern of gene upregulation, contrasting with the downregulation in usual heat stress responses. It also uncovers a reduced heat stress response and an abbreviated unfolded protein response in older adults, likely impairing their cellular repair mechanisms. This contributes to increased vulnerability during severe heat waves, underscoring the urgent need for age-specific interventions.
The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients
Kafkas, Şenay, Abdelhakim, Marwa, Althagafi, Azza, Toonsi, Sumyyah, Alghamdi, Malak, Schofield, Paul N. and Hoehndorf, Robert Scientific Reports, vol. 15(1) (2025)
Disease genetics
Computational methods for identifying gene-disease associations can use both genomic and phenotypic information to prioritize genes and variants that may be associated with genetic diseases. Phenotype-based methods commonly rely on comparing phenotypes observed in a patient with databases of genotype-to-phenotype associations using measures of semantic similarity. They are constrained by the quality and completeness of these resources as well as the quality and completeness of patient phenotype annotation. Genotype-to-phenotype associations used by these methods are largely derived from the literature and coded using phenotype ontologies. Large Language Models (LLMs) have been trained on large amounts of text and data and have shown their potential to answer complex questions across multiple domains. Here, we evaluate the effectiveness of LLMs in prioritizing disease-associated genes compared to existing bioinformatics methods. We show that LLMs can prioritize disease-associated genes as well, or better than, dedicated bioinformatics methods relying on pre-defined phenotype similarity, when gene sets range from 5 to 100 candidates. We apply our approach to a cohort of undiagnosed patients with rare diseases and show that LLMs can be used to provide diagnostic support that helps in identifying plausible candidate genes. Our results show that LLMs may offer an alternative to traditional bioinformatics methods to prioritize disease-associated genes based on disease phenotypes. They may, therefore, potentially enhance diagnostic accuracy and simplify the process for rare genetic diseases.
Phased genome assemblies and pangenome graphs of human populations of Japan and Saudi Arabia
Abstract Background Retinoblastoma is the most common intraocular malignancy of childhood, yet its genomic landscape remains incompletely defined, particularly in understudied populations. Beyond RB1 loss, the contribution of additional somatic and germline alterations to disease heterogeneity and clinical behavior is unclear. Methods We performed whole-exome sequencing of 166 retinoblastoma samples from 166 patients with matched germline DNA, representing the largest cohort analyzed to date. Clinical data were available for 160 patients. Variant calling, copy number alteration (CNA) profiling, and integrative analyses were performed to characterize genetic drivers and their associations with clinical features. Results Pathogenic RB1 variants were identified in 120 patients, and MYCN amplification in 6 patients. Additional recurrent alterations involved BCOR, CCND3, ERBB2, and PDGFRB. Copy number gains of 6p (41.3%) and 17q (8.1%) were significantly associated with high-risk features including rubeosis, subretinal seeding, and tumor extension beyond the lamina cribrosa. Germline ERBB2 variants correlated with orbital invasion, while germline PDGFRB variants were associated with second primary cancers. Together, these findings underscore the genetic heterogeneity of retinoblastoma and reveal novel genotype–phenotype correlations. Conclusions This study provides the most comprehensive genomic characterization of retinoblastoma to date, expands the known mutational spectrum, and identifies biomarkers with direct clinical relevance. These insights have the potential to refine risk stratification, inform precision therapeutic strategies, and improve long-term outcomes for children with retinoblastoma.
Nanodesigner: resolving the complex-CDR interdependency with iterative refinement
Rios Zertuche, Melissa Maria, Kafkas, Şenay, Renn, Dominik, Rueping, Magnus and Hoehndorf, Robert Journal of Cheminformatics, vol. 17(1) (2025)
Other
Abstract
Camelid heavy-chain only antibodies consist of two heavy chains and single variable domains (VHHs), which retain antigen-binding functionality even when isolated. The term “nanobody” is now more generally used for describing small, single-domain antibodies. Several antibody generative models have been developed for the sequence and structure co-design of the complementarity-determining regions (CDRs) based on the binding interface with a target antigen. However, these models are not tailored for nanobodies and are often constrained by their reliance on experimentally determined antigen–antibody structures, which are labor-intensive to obtain. Here, we introduce NanoDesigner, a tool for nanobody design and optimization based on generative AI methods. NanoDesigner integrates key stages—structure prediction, docking, CDR generation, and side-chain packing—into an iterative framework based on an expectation maximization (EM) algorithm. The algorithm effectively tackles an interdependency challenge where accurate docking presupposes
a priori
knowledge of the CDR conformation, while effective CDR generation relies on accurate docking outputs to guide its design. NanoDesigner approximately doubles the success rate of de novo nanobody designs through continuous refinement of docking and CDR generation.
Causal knowledge graph analysis identifies adverse drug effects
Toonsi, Sumyyah, Schofield, Paul N and Hoehndorf, Robert Bioinformatics, vol. 42(1), In: Lu, Zhiyong (Ed.) (2025)
Applied OntologyDrug mechanisms
The data is available through https://github.com/bio-ontology-research-group/Mediation-Analysis-using-Causal-Knowledge-Graph.
Lattice-Based ALC Ontology Embeddings With Saturation
Zhapa-Camacho, Fernando and Hoehndorf, Robert Neurosymbolic Artificial Intelligence, vol. 1 (2025)
Applied Ontology
Generating vector representations (embeddings) of OWL ontologies is a growing task due to its applications in predicting missing facts and knowledge-enhanced learning in fields such as bioinformatics. The underlying semantics of OWL ontologies are expressed using Description Logics (DLs). Initial approaches to generate embeddings relied on constructing a graph out of ontologies, neglecting the semantics of the logic therein. Recent semantic-preserving embedding methods often target lightweight DL languages such as EL++ , ignoring more expressive information in ontologies. Although some approaches aim to embed more descriptive DLs such as ALC , those methods require the existence of individuals, while many real-world ontologies are devoid of them. We propose an ontology embedding method for the ALC DL language that considers the lattice structure of concept descriptions. We use connections between DL and Category Theory to materialize the lattice structure and embed it using an order-preserving embedding method. We show that our method outperforms state-of-the-art methods in several knowledge base completion tasks. This is an extended version of our previous work, where we incorporate saturation procedures that increase the information within the constructed lattices. We make our code and data available at https://github.com/bio-ontology-research-group/catE .
Lattice-Based ALC Ontology Embeddings With Saturation
Zhapa-Camacho, Fernando and Hoehndorf, Robert Neurosymbolic Artificial Intelligence, vol. 1 (2025)
Applied Ontology
Generating vector representations (embeddings) of OWL ontologies is a growing task due to its applications in predicting missing facts and knowledge-enhanced learning in fields such as bioinformatics. The underlying semantics of OWL ontologies are expressed using Description Logics (DLs). Initial approaches to generate embeddings relied on constructing a graph out of ontologies, neglecting the semantics of the logic therein. Recent semantic-preserving embedding methods often target lightweight DL languages such as EL++ , ignoring more expressive information in ontologies. Although some approaches aim to embed more descriptive DLs such as ALC , those methods require the existence of individuals, while many real-world ontologies are devoid of them. We propose an ontology embedding method for the ALC DL language that considers the lattice structure of concept descriptions. We use connections between DL and Category Theory to materialize the lattice structure and embed it using an order-preserving embedding method. We show that our method outperforms state-of-the-art methods in several knowledge base completion tasks. This is an extended version of our previous work, where we incorporate saturation procedures that increase the information within the constructed lattices. We make our code and data available at https://github.com/bio-ontology-research-group/catE .
LLM Agent Based Protein Function Prediction
Zhapa-Camacho, Fernando, Mashkova, Olga, Hoehndorf, Robert and Kulmanov, Maxat Biocomputing 2026, pp. 508–519 (2025)
Protein function prediction
Generating vector representations (embeddings) of OWL ontologies is a growing task due to its applications in predicting missing facts and knowledge-enhanced learning in fields such as bioinformatics. The underlying semantics of OWL ontologies are expressed using Description Logics (DLs). Initial approaches to generate embeddings relied on constructing a graph out of ontologies, neglecting the semantics of the logic therein. Recent semantic-preserving embedding methods often target lightweight DL languages such as EL++ , ignoring more expressive information in ontologies. Although some approaches aim to embed more descriptive DLs such as ALC , those methods require the existence of individuals, while many real-world ontologies are devoid of them. We propose an ontology embedding method for the ALC DL language that considers the lattice structure of concept descriptions. We use connections between DL and Category Theory to materialize the lattice structure and embed it using an order-preserving embedding method. We show that our method outperforms state-of-the-art methods in several knowledge base completion tasks. This is an extended version of our previous work, where we incorporate saturation procedures that increase the information within the constructed lattices. We make our code and data available at https://github.com/bio-ontology-research-group/catE .
Sa1216: Development of colorectal cancer and matched healthy organoids from Saudi patients: a case study
Alhattab, Dana, Barakeh, Duna, Khoja, Basma, Elhadi, Ahmad, Miro, Jameel, Alessy, Saleh A., Alharbi, Ahmed, Bokhary, Manal, Alzahrani, May, Ali, Saga, Almohamdi, Wadha, Hefni, Lama, Moretti, Manola, Liu, Yang, Abdelhakim, Marwa, Abdullah, Abeer, Alomaim, Waleed, Hoehndorf, Robert, Hauser, Charlotte and Alqahtani, Saleh A. Gastroenterology, vol. 169(1), pp. S-400 (2025)
Disease genetics
Su1295: Chemically defined peptide-based matrices enabling the development of colorectal organoid models for therapeutic applications and disease modeling
Alhattab, Dana, Barakeh, Duna, Khoja, Basma, Elhadi, Ahmad, Miro, Jameel, Alessy, Saleh A., Alharbi, Ahmed, Bokhary, Manal, Alzahrani, May, Ali, Saga, Almohamdi, Wadha, Hefni, Lama, Moretti, Manola, Abdullah, Abeer, Alomaim, Waleed, Hoehndorf, Robert, Hauser, Charlotte and Alqahtani, Saleh A. Gastroenterology, vol. 169(1), pp. S-734 (2025)
Disease genetics
Neuro-Symbolic AI in Life Sciences
Hoehndorf, Robert, Pesquita, Catia and Zhapa-Camacho, Fernando Handbook on Neurosymbolic AI and Knowledge Graphs (2025)
Neuro-symbolic AI
Life sciences have a long history of driving advancements in various disciplines, including mathematics, philosophy, and logic. In recent years, life sciences have also become a significant application area for Artificial Intelligence (AI) technologies, including for neuro-symbolic AI methods. The life sciences knowledge infrastructure, characterized by its widespread use of ontologies, complex annotation models, large size, and community standards, presents unique challenges and opportunities for neuro-symbolic AI. We outline how neuro-symbolic methods have been applied and developed to address these challenges. We describe semantic similarity measures, knowledge graph embeddings, ontology embeddings, and knowledge-enhanced learning in the context of formalized life science knowledge. While there has been significant progress, we also outline multiple remaining challenges that provide opportunities for future research.
Computational prediction of protein functional annotations
Kulmanov, Maxat and Hoehndorf, Robert Protein Function Prediction, pp. 3-28 (2025)
Protein function prediction
The informatics of developmental phenotypes
Schofield, Paul N., Hoehndorf, Robert, Gkoutos, Georgios V. and Smith, Cynthia L. Kaufman’s Atlas of Mouse Development Supplement, pp. 457–470 (2025)
Disease genetics
Annotating genomes with DeepGO protein function prediction tools
Tawfiq, Rund, Niu, Kexin, Kulmanov, Maxat and Hoehndorf, Robert Protein Function Prediction, pp. 171-189 (2025)
BioinformaticsProtein function prediction
Predicting protein functions using positive-unlabeled ranking with ontology-based priors
Zhapa-Camacho, Fernando, Tang, Zhenwei, Kulmanov, Maxat and Hoehndorf, Robert Bioinformatics, vol. 40(Supplement\_1), pp. i401-i409 (2024)
Applied OntologyProtein function prediction
Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the false negative issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled ranking problem. We apply empirical risk minimization, i.e. we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets.Data and code are available at https://github.com/bio-ontology-research-group/PU-GO.
Prioritizing genomic variants through neuro-symbolic, knowledge-enhanced learning
Althagafi, Azza, Zhapa-Camacho, Fernando and Hoehndorf, Robert Bioinformatics, vol. 40(5) (2024)
Applied OntologyDisease geneticsNeuro-symbolic AI
EmbedPVP and all evaluation experiments are freely available at https://github.com/bio-ontology-research-group/EmbedPVP.
An open source knowledge graph ecosystem for the life sciences
Callahan, Tiffany J., Tripodi, Ignacio J., Stefanski, Adrianne L., Cappelletti, Luca, Taneja, Sanya B., Wyrwa, Jordan M., Casiraghi, Elena, Matentzoglu, Nicolas A., Reese, Justin, Silverstein, Jonathan C., Hoyt, Charles Tapley, Boyce, Richard D., Malec, Scott A., Unni, Deepak R., Joachimiak, Marcin P., Robinson, Peter N., Mungall, Christopher J., Cavalleri, Emanuele, Fontana, Tommaso, Valentini, Giorgio, Mesiti, Marco, Gillenwater, Lucas A., Santangelo, Brook, Vasilevsky, Nicole A., Hoehndorf, Robert, Bennett, Tellen D., Ryan, Patrick B., Hripcsak, George, Kahn, Michael G., Bada, Michael, Baumgartner, William A. and Hunter, Lawrence E. Scientific Data, vol. 11(1) (2024)
Applied Ontology
AbstractTranslational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.
Protein function prediction as approximate semantic entailment
Kulmanov, Maxat, Guzmán-Vega, Francisco J., Duek Roggli, Paula, Lane, Lydie, Arold, Stefan T. and Hoehndorf, Robert Nature Machine Intelligence, vol. 6(2), pp. 220–228 (2024)
Applied OntologyProtein function prediction
Abstract The Gene Ontology (GO) is a formal, axiomatic theory with over 100,000 axioms that describe the molecular functions, biological processes and cellular locations of proteins in three subontologies. Predicting the functions of proteins using the GO requires both learning and reasoning capabilities in order to maintain consistency and exploit the background knowledge in the GO. Many methods have been developed to automatically predict protein functions, but effectively exploiting all the axioms in the GO for knowledge-enhanced learning has remained a challenge. We have developed DeepGO-SE, a method that predicts GO functions from protein sequences using a pretrained large language model. DeepGO-SE generates multiple approximate models of GO, and a neural network predicts the truth values of statements about protein functions in these approximate models. We aggregate the truth values over multiple models so that DeepGO-SE approximates semantic entailment when predicting protein functions. We show, using several benchmarks, that the approach effectively exploits background knowledge in the GO and improves protein function prediction compared to state-of-the-art methods.
A reference quality, fully annotated diploid genome from a Saudi individual
Kulmanov, Maxat, Tawfiq, Rund, Liu, Yang, Al Ali, Hatoon, Abdelhakim, Marwa, Alarawi, Mohammed, Aldakhil, Hind, Alhattab, Dana, Alsolme, Ebtehal A., Althagafi, Azza, Angelov, Angel, Bougouffa, Salim, Driguez, Patrick, Park, Changsook, Putra, Alexander, Reyes-Ramos, Ana M., Hauser, Charlotte A. E., Cheung, Ming Sin, Abedalthagafi, Malak S. and Hoehndorf, Robert Scientific Data, vol. 11(1) (2024)
Bioinformatics
Abstract The Gene Ontology (GO) is a formal, axiomatic theory with over 100,000 axioms that describe the molecular functions, biological processes and cellular locations of proteins in three subontologies. Predicting the functions of proteins using the GO requires both learning and reasoning capabilities in order to maintain consistency and exploit the background knowledge in the GO. Many methods have been developed to automatically predict protein functions, but effectively exploiting all the axioms in the GO for knowledge-enhanced learning has remained a challenge. We have developed DeepGO-SE, a method that predicts GO functions from protein sequences using a pretrained large language model. DeepGO-SE generates multiple approximate models of GO, and a neural network predicts the truth values of statements about protein functions in these approximate models. We aggregate the truth values over multiple models so that DeepGO-SE approximates semantic entailment when predicting protein functions. We show, using several benchmarks, that the approach effectively exploits background knowledge in the GO and improves protein function prediction compared to state-of-the-art methods.
Critical assessment of variant prioritization methods for rare disease diagnosis within the rare genomes project
Stenton, Sarah L., O’Leary, Melanie C., Lemire, Gabrielle, VanNoy, Grace E., DiTroia, Stephanie, Ganesh, Vijay S., Groopman, Emily, O’Heir, Emily, Mangilog, Brian, Osei-Owusu, Ikeoluwa, Pais, Lynn S., Serrano, Jillian, Singer-Berk, Moriel, Weisburd, Ben, Wilson, Michael W., Austin-Tse, Christina, Abdelhakim, Marwa, Althagafi, Azza, Babbi, Giulia, Bellazzi, Riccardo, Bovo, Samuele, Carta, Maria Giulia, Casadio, Rita, Coenen, Pieter-Jan, De Paoli, Federica, Floris, Matteo, Gajapathy, Manavalan, Hoehndorf, Robert, Jacobsen, Julius O. B., Joseph, Thomas, Kamandula, Akash, Katsonis, Panagiotis, Kint, Cyrielle, Lichtarge, Olivier, Limongelli, Ivan, Lu, Yulan, Magni, Paolo, Mamidi, Tarun Karthik Kumar, Martelli, Pier Luigi, Mulargia, Marta, Nicora, Giovanna, Nykamp, Keith, Pejaver, Vikas, Peng, Yisu, Pham, Thi Hong Cam, Podda, Maurizio S., Rao, Aditya, Rizzo, Ettore, Saipradeep, Vangala G., Savojardo, Castrense, Schols, Peter, Shen, Yang, Sivadasan, Naveen, Smedley, Damian, Soru, Dorian, Srinivasan, Rajgopal, Sun, Yuanfei, Sunderam, Uma, Tan, Wuwei, Tiwari, Naina, Wang, Xiao, Wang, Yaqiong, Williams, Amanda, Worthey, Elizabeth A., Yin, Rujie, You, Yuning, Zeiberg, Daniel, Zucca, Susanna, Bakolitsa, Constantina, Brenner, Steven E., Fullerton, Stephanie M., Radivojac, Predrag, Rehm, Heidi L. and O’Donnell-Luria, Anne Human Genomics, vol. 18(1) (2024)
BioinformaticsDisease genetics
Model methodology and performance was highly variable. Models weighing call quality, allele frequency, predicted deleteriousness, segregation, and phenotype were effective in identifying causal variants, and models open to phenotype expansion and non-coding variants were able to capture more difficult diagnoses and discover new diagnoses. Overall, computational models can significantly aid variant prioritization. For use in diagnostics, detailed review and conservative assessment of prioritized variants against established criteria is needed.
DeepGOMeta for functional insights into microbial communities using deep learning-based protein function prediction
Tawfiq, Rund, Niu, Kexin, Hoehndorf, Robert and Kulmanov, Maxat Scientific Reports, vol. 14(1) (2024)
Protein function prediction
AbstractAnalyzing microbial samples remains computationally challenging due to their diversity and complexity. The lack of robust de novo protein function prediction methods exacerbates the difficulty in deriving functional insights from these samples. Traditional prediction methods, dependent on homology and sequence similarity, often fail to predict functions for novel proteins and proteins without known homologs. Moreover, most of these methods have been trained on largely eukaryotic data, and have not been evaluated on or applied to microbial datasets. This research introduces DeepGOMeta, a deep learning model designed for protein function prediction as Gene Ontology (GO) terms, trained on a dataset relevant to microbes. The model is applied to diverse microbial datasets to demonstrate its use for gaining biological insights. Data and code are available at https://github.com/bio-ontology-research-group/deepgometa
Causal relationships between diseases mined from the literature improve the use of polygenic risk scores
Toonsi, Sumyyah, Gauran, Iris Ivy, Ombao, Hernando, Schofield, Paul N and Hoehndorf, Robert Bioinformatics, vol. 40(11) (2024)
Disease genetics
The data are available through https://github.com/bio-ontology-research-group/causal-relations-between-diseases.
Semantic units: organizing knowledge graphs into semantically meaningful units of representation
Vogt, Lars, Kuhn, Tobias and Hoehndorf, Robert Journal of Biomedical Semantics, vol. 15(1) (2024)
Applied Ontology
Abstract
Background
In today’s landscape of data management, the importance of knowledge graphs and ontologies is escalating as critical mechanisms aligned with the FAIR Guiding Principles—ensuring data and metadata are Findable, Accessible, Interoperable, and Reusable. We discuss three challenges that may hinder the effective exploitation of the full potential of FAIR knowledge graphs.
Results
We introduce “semantic units” as a conceptual solution, although currently exemplified only in a limited prototype. Semantic units structure a knowledge graph into identifiable and semantically meaningful subgraphs by adding another layer of triples on top of the conventional data layer. Semantic units and their subgraphs are represented by their own resource that instantiates a corresponding semantic unit class. We distinguish statement and compound units as basic categories of semantic units. A statement unit is the smallest, independent proposition that is semantically meaningful for a human reader. Depending on the relation of its underlying proposition, it consists of one or more triples. Organizing a knowledge graph into statement units results in a partition of the graph, with each triple belonging to exactly one statement unit. A compound unit, on the other hand, is a semantically meaningful collection of statement and compound units that form larger subgraphs. Some semantic units organize the graph into different levels of representational granularity, others orthogonally into different types of granularity trees or different frames of reference, structuring and organizing the knowledge graph into partially overlapping, partially enclosed subgraphs, each of which can be referenced by its own resource.
Conclusions
Semantic units, applicable in RDF/OWL and labeled property graphs, offer support for making statements about statements and facilitate graph-alignment, subgraph-matching, knowledge graph profiling, and for management of access restrictions to sensitive data. Additionally, we argue that organizing the graph into semantic units promotes the differentiation of ontological and discursive information, and that it also supports the differentiation of multiple frames of reference within the graph.
Large-Scale Knowledge Integration for Enhanced Molecular Property Prediction
Ghunaim, Yasir and Hoehndorf, Robert Neural-Symbolic Learning and Reasoning, pp. 89-97 (2024)
Zhapa-Camacho, Fernando and Hoehndorf, Robert Neural-Symbolic Learning and Reasoning, pp. 355-369 (2024)
Bioinformatics
The Impact of Mechanical Cues on the Metabolomic and Transcriptomic Profiles of Human Dermal Fibroblasts Cultured in Ultrashort Self-Assembling Peptide 3D Scaffolds
Sherin Abdelrahman, Rui Ge, Hepi H. Susapto, Yang Liu, Faris Samkari, Manola Moretti, Xinzhi Liu, Robert Hoehndorf, Abdul-Hamid Emwas, Mariusz Jaremko, Ranim H. Rawas and Charlotte A. E. Hauser ACS} Nano, vol. 17(15), pp. 14508-14531 (2023)
Bioinformatics
Cells' interactions with their microenvironment influence their morphological features and regulate crucial cellular functions including proliferation, differentiation, metabolism, and gene expression. Most biological data available are based on in vitro two-dimensional (2D) cellular models, which fail to recapitulate the three-dimensional (3D) in vivo systems. This can be attributed to the lack of cell-matrix interaction and the limitless access to nutrients and oxygen, in contrast to in vivo systems. Despite the emergence of a plethora of 3D matrices to address this challenge, there are few reports offering a proper characterization of these matrices or studying how the cell-matrix interaction influences cellular metabolism in correlation with gene expression. In this study, two tetrameric ultrashort self-assembling peptide sequences, FFIK and FIIK, were used to create in vitro 3D models using well-described human dermal fibroblast cells. The peptide sequences are derived from naturally occurring amino acids that are capable of self-assembling into stable hydrogels without UV or chemical cross-linking. Our results showed that 2D cultured fibroblasts exhibited distinct metabolic and transcriptomic profiles compared to 3D cultured cells. The observed changes in the metabolomic and transcriptomic profiles were closely interconnected and influenced several important metabolic pathways including the TCA cycle, glycolysis, MAPK signaling cascades, and hemostasis. Data provided here may lead to clearer insights into the influence of the surrounding microenvironment on human dermal fibroblast metabolic patterns and molecular mechanisms, underscoring the importance of utilizing efficient 3D in vitro models to study such complex mechanisms.
Improving the classification of cardinality phenotypes using collections
Sarah M. Alghamdi and Robert Hoehndorf Journal of Biomedical Semantics, vol. 14(1) (2023)
Disease genetics
We reformulate the phenotypes of collections of entities using an ontological theory of collections. By reformulating phenotypes of collections in phenotypes ontologies, we avoid potentially incorrect inferences pertaining to the cardinality of these collections. We apply our method to two phenotype ontologies and show that the reformulation not only removes some problematic inferences but also quantitatively improves biological data analysis.
Genomic landscape in Saudi patients with hepatocellular carcinoma using whole-genome sequencing: a pilot study
Mazen Hassanain, Yang Liu, Weam Hussain, Albandri Binowayn, Duna Barakeh, Ebtehal Alsolme, Faisal AlSaif, Ghaida Almasaad, Mohammed AlSwayyed, Maram Alaqel, Rana Aljunidel, Sherin Abdelrahman, Charlotte A. E. Hauser, Saleh Alqahtani, Robert Hoehndorf and Malak Abedalthagafi Frontiers in Gastroenterology, vol. 2 (2023)
BioinformaticsDisease genetics
Our findings indicate that most of the HCC patients possess cancer-related genetic variants, and the altered pathways in these patients exhibit similarities. Notably, resistant patients exhibit a higher frequency of aberrations in sorafenib-related genes than do sensitive patients. Specifically, 4 out of 10 resistant individuals demonstrated 13 somatic mutations, whereas none of the three sensitive patients exhibited any. Similarly, 7 out of 10 resistant patients possessed 30 germline mutations, while none were observed in the sensitive group (two-sided Fisher's exact test; somatic: p=0.50, germline: 0.07). These results contribute to our understanding of the genetic landscape of HCC and highlight potential therapeutic targets that could aid in overcoming treatment resistance.
Starvar: symptom-based tool for automatic ranking of variants using evidence from literature and genomes
Șenay Kafkas, Marwa Abdelhakim, Mahmut Uludag, Azza Althagafi, Malak Alghamdi and Robert Hoehndorf BMC} Bioinformatics, vol. 24(1) (2023)
BioinformaticsDisease genetics
Abstract
Background
Identifying variants associated with diseases is a challenging task in medical genetics research. Current studies that prioritize variants within individual genomes generally rely on known variants, evidence from literature and genomes, and patient symptoms and clinical signs. The functionalities of the existing tools, which rank variants based on given patient symptoms and clinical signs, are restricted to the coverage of ontologies such as the Human Phenotype Ontology (HPO). However, most clinicians do not limit themselves to HPO while describing patient symptoms/signs and their associated variants/genes. There is thus a need for an automated tool that can prioritize variants based on freely expressed patient symptoms and clinical signs.
Results
STARVar is a Symptom-based Tool for Automatic Ranking of Variants using evidence from literature and genomes. STARVar uses patient symptoms and clinical signs, either linked to HPO or expressed in free text format. It returns a ranked list of variants based on a combined score from two classifiers utilizing evidence from genomics and literature. STARVar improves over related tools on a set of synthetic patients. In addition, we demonstrated its distinct contribution to the domain on another synthetic dataset covering publicly available clinical genotype–phenotype associations by using symptoms and clinical signs expressed in free text format.
Conclusions
STARVar stands as a unique and efficient tool that has the advantage of ranking variants with flexibly expressed patient symptoms in free-form text. Therefore, STARVar can be easily integrated into bioinformatics workflows designed to analyze disease-associated genomes.
Availability
STARVar is freely available from https://github.com/bio-ontology-research-group/STARVar.
Klarigi: Characteristic explanations for semantic biomedical data
Luke T. Slater, John A. Williams, Paul N. Schofield, Sophie Russell, Samantha C. Pendleton, Andreas Karwath, Hilary Fanning, Simon Ball, Robert Hoehndorf and Georgios V. Gkoutos Computers in Biology and Medicine, pp. 106425 (2023)
Applied Ontology
Annotation of biomedical entities with ontology classes provides for formal semantic analysis and mobilisation of background knowledge in determining their relationships. To date, enrichment analysis has been routinely employed to identify classes that are over-represented in annotations across sets of groups, such as biosample gene expression profiles or patient phenotypes, and is useful for a range of tasks including differential diagnosis and causative variant prioritisation. These approaches, however, usually consider only univariate relationships, make limited use of the semantic features of ontologies, and provide limited information and evaluation of the explanatory power of both singular and grouped candidate classes. Moreover, they are not designed to solve the problem of deriving cohesive, characteristic, and discriminatory sets of classes for entity groups. We have developed a new tool, called Klarigi, which introduces multiple scoring heuristics for identification of classes that are both compositional and discriminatory for groups of entities annotated with ontology classes. The tool includes a novel algorithm for derivation of multivariable semantic explanations for entity groups, makes use of semantic inference through live use of an ontology reasoner, and includes a classification method for identifying the discriminatory power of candidate sets, in addition to significance testing apposite to traditional enrichment approaches. We describe the design and implementation of Klarigi, including its scoring and explanation determination methods, and evaluate its use in application to two test cases with clinical significance, comparing and contrasting methods and results with literature-based and enrichment analysis methods. We demonstrate that Klarigi produces characteristic and discriminatory explanations for groups of biomedical entities in two settings. We also show that these explanations recapitulate and extend the knowledge held in existing biomedical databases and literature for several diseases. We conclude that Klarigi provides a distinct and valuable perspective on biomedical datasets when compared with traditional enrichment methods, and therefore constitutes a new method by which biomedical datasets can be explored, contributing to improved insight into semantic data.
mOWL}: Python library for machine learning with biomedical ontologies
Fernando Zhapa-Camacho, Maxat Kulmanov and Robert Hoehndorf Bioinformatics, vol. 39(1) (2023)
Applied Ontology
Supplementary data are available at Bioinformatics online.
Exploring the Use of Ontology Components for Distantly-Supervised
Disease and Phenotype Named Entity Recognition
Sumyyah Toonsi and
Senay Kafkas and
Robert Hoehndorf Proceedings of the International Conference on Biomedical Ontologies
2023 together with the Workshop on Ontologies for Infectious and Immune-Mediated
Disease Data Science {(OIIDDS} 2023) and the {FAIR} Ontology Harmonization
and {TRUST} Data Interoperability Workshop {(FOHTI} 2023), Bras{'{i}}lia,
Brazil, August 28 - September 1, 2023, vol. 3603, pp. 13-24, In: Fernanda Farinelli and
Amanda Damasceno de Souza and
Eduardo Ribeiro Felipe (Ed.) (2023)
Applied OntologyDisease genetics
From Axioms over Graphs to Vectors, and Back Again: Evaluating the
Properties of Graph-based Ontology Embeddings
Fernando Zhapa-Camacho and
Robert Hoehndorf Proceedings of the 17th International Workshop on Neural-Symbolic
Learning and Reasoning, La Certosa di Pontignano, Siena, Italy, July
3-5, 2023, vol. 3432, pp. 85-102, In: Artur S. d'Avila Garcez and
Tarek R. Besold and
Marco Gori and
Ernesto Jim'enez-Ruiz (Ed.) (2023)
Applied Ontology
Evaluating Different Methods for Semantic Reasoning Over Ontologies
Fernando Zhapa-Camacho and
Robert Hoehndorf Joint Proceedings of Scholarly {QALD} 2023 and SemREC 2023 co-located
with 22nd International Semantic Web Conference {ISWC} 2023, Athens,
Greece, November 6-10, 2023, vol. 3592, In: Debayan Banerjee and
Ricardo Usbeck and
Nandana Mihindukulasooriya and
Gunjan Singh and
Raghava Mutharaju and
Pavan Kapanipathi (Ed.) (2023)
Applied Ontology
Updating the {CEMO} ontology for future epidemiological challenges
N'uria Queralt-Rosinach and
Paul N. Schofield and
Marco Roos and
Robert Hoehndorf 14th International Conference on Semantic Web Applications and Tools
for Health Care and Life Sciences {(SWAT4HCLS} 2023), Basel, Switzerland,
February 13-16, 2023, vol. 3415, pp. 151-152, In: Atsuko Yamaguchi and
Andrea Splendiani and
M. Scott Marshall and
Chris Baker and
Jerven T. Bolleman and
Albert Burger and
Leyla Jael Castro and
Ole Eigenbrod and
Sabine "Osterle and
Martin Romacker and
Andra Waagmeester (Ed.) (2023)
Applied Ontology
Contribution of model organism phenotypes to the computational identification of human disease genes
Sarah Alghamdi, Paul N. Schofield and Robert Hoehndorf Disease Models & Mechanisms, vol. 15(7) (2022)
Disease genetics
Computing phenotypic similarity helps identify new disease genes and diagnose rare diseases. Genotype-phenotype data from orthologous genes in model organisms can compensate for lack of human data and increase genome coverage. In the past decade, cross-species phenotype comparisons have proven valuble, and several ontologies have been developed for this purpose. The relative contribution of different model organisms to computational identification of disease-associated genes is not fully explored. We used phenotype ontologies to semantically relate phenotypes resulting from loss-of-function mutations in model organisms to disease-associated phenotypes in humans. Semantic machine learning methods were used to measure the contribution of different model organisms to the identification of known human gene-disease associations. We found that mouse genotype-phenotype data provided the most important dataset in the identification of human disease genes by semantic similarity and machine learning over phenotype ontologies. Other model organisms' data did not improve identification over that obtained using the mouse alone, and therefore did not contribute significantly to this task. Our work impacts on the development of integrated phenotype ontologies, as well as for the use of model organism phenotypes in human genetic variant interpretation. This article has an associated First Person interview with the first author of the paper.
Combining biomedical knowledge graphs and text to improve predictions for drug-target interactions and drug-indications
Mona Alshahrani, Abdullah Almansour, Asma Alkhaldi, Maha A. Thafar, Mahmut Uludag, Magbubah Essack and Robert Hoehndorf PeerJ, vol. 10, pp. e13061 (2022)
Applied OntologyDrug mechanisms
Biomedical knowledge is represented in structured databases and published in biomedical literature, and different computational approaches have been developed to exploit each type of information in predictive models. However, the information in structured databases and literature is often complementary. We developed a machine learning method that combines information from literature and databases to predict drug targets and indications. To effectively utilize information in published literature, we integrate knowledge graphs and published literature using named entity recognition and normalization before applying a machine learning model that utilizes the combination of graph and literature. We then use supervised machine learning to show the effects of combining features from biomedical knowledge and published literature on the prediction of drug targets and drug indications. We demonstrate that our approach using datasets for drug-target interactions and drug indications is scalable to large graphs and can be used to improve the ranking of targets and indications by exploiting features from either structure or unstructured information alone.
DeepSVP}: integration of genotype and phenotype for structural variant prioritization using deep learning
Azza Althagafi, Lamia Alsubaie, Nagarajan Kathiresan, Katsuhiko Mineta, Taghrid Aloraini, Fuad Al Mutairi, Majid Alfadhel, Takashi Gojobori, Ahmad Alfares and Robert Hoehndorf Bioinformatics, vol. 38(6), pp. 1677-1684 (2022)
Disease geneticsSemantic interoperability
Supplementary data are available at Bioinformatics online.
A comprehensive update on {CIDO}: the community-based coronavirus infectious disease ontology
Yongqun He, Hong Yu, Anthony Huffman, Asiyah Yu Lin, Darren A. Natale, John Beverley, Ling Zheng, Yehoshua Perl, Zhigang Wang, Yingtong Liu, Edison Ong, Yang Wang, Philip Huang, Long Tran, Jinyang Du, Zalan Shah, Easheta Shah, Roshan Desai, Hsin-hui Huang, Yujia Tian, Eric Merrell, William D. Duncan, Sivaram Arabandi, Lynn M. Schriml, Jie Zheng, Anna Maria Masci, Liwei Wang, Hongfang Liu, Fatima Zohra Smaili, Robert Hoehndorf, Zoë May Pendlington, Paola Roncaglia, Xianwei Ye, Jiangan Xie, Yi-Wei Tang, Xiaolin Yang, Suyuan Peng, Luxia Zhang, Luonan Chen, Junguk Hur, Gilbert S. Omenn, Brian Athey and Barry Smith Journal of Biomedical Semantics, vol. 13(1) (2022)
Applied OntologyBioinformaticsDisease genetics
Abstract
Background
The current COVID-19 pandemic and the previous SARS/MERS outbreaks of 2003 and 2012 have resulted in a series of major global public health crises. We argue that in the interest of developing effective and safe vaccines and drugs and to better understand coronaviruses and associated disease mechenisms it is necessary to integrate the large and exponentially growing body of heterogeneous coronavirus data. Ontologies play an important role in standard-based knowledge and data representation, integration, sharing, and analysis. Accordingly, we initiated the development of the community-based Coronavirus Infectious Disease Ontology (CIDO) in early 2020.
Results
As an Open Biomedical Ontology (OBO) library ontology, CIDO is open source and interoperable with other existing OBO ontologies. CIDO is aligned with the Basic Formal Ontology and Viral Infectious Disease Ontology. CIDO has imported terms from over 30 OBO ontologies. For example, CIDO imports all SARS-CoV-2 protein terms from the Protein Ontology, COVID-19-related phenotype terms from the Human Phenotype Ontology, and over 100 COVID-19 terms for vaccines (both authorized and in clinical trial) from the Vaccine Ontology. CIDO systematically represents variants of SARS-CoV-2 viruses and over 300 amino acid substitutions therein, along with over 300 diagnostic kits and methods. CIDO also describes hundreds of host-coronavirus protein-protein interactions (PPIs) and the drugs that target proteins in these PPIs. CIDO has been used to model COVID-19 related phenomena in areas such as epidemiology. The scope of CIDO was evaluated by visual analysis supported by a summarization network method. CIDO has been used in various applications such as term standardization, inference, natural language processing (NLP) and clinical data integration. We have applied the amino acid variant knowledge present in CIDO to analyze differences between SARS-CoV-2 Delta and Omicron variants. CIDO's integrative host-coronavirus PPIs and drug-target knowledge has also been used to support drug repurposing for COVID-19 treatment.
Conclusion
CIDO represents entities and relations in the domain of coronavirus diseases with a special focus on COVID-19. It supports shared knowledge representation, data and metadata standardization and integration, and has been used in a range of applications.
DeepGOZero}: improving protein function prediction from sequence and zero-shot learning based on ontology axioms
Maxat Kulmanov and Robert Hoehndorf Bioinformatics, vol. 38(Supplement{\_}1), pp. i238-i245 (2022)
Applied OntologyProtein function prediction
Supplementary data are available at Bioinformatics online.
Evaluating semantic similarity methods for comparison of text-derived phenotype profiles
Luke T. Slater, Sophie Russell, Silver Makepeace, Alexander Carberry, Andreas Karwath, John A. Williams, Hilary Fanning, Simon Ball, Robert Hoehndorf and Georgios V. Gkoutos BMC} Medical Informatics and Decision Making, vol. 22(1) (2022)
Abstract
Background
Semantic similarity is a valuable tool for analysis in biomedicine. When applied to phenotype profiles derived from clinical text, they have the capacity to enable and enhance ‘patient-like me’ analyses, automated coding, differential diagnosis, and outcome prediction. While a large body of work exists exploring the use of semantic similarity for multiple tasks, including protein interaction prediction, and rare disease differential diagnosis, there is less work exploring comparison of patient phenotype profiles for clinical tasks. Moreover, there are no experimental explorations of optimal parameters or better methods in the area.
Methods
We develop a platform for reproducible benchmarking and comparison of experimental conditions for patient phentoype similarity. Using the platform, we evaluate the task of ranking shared primary diagnosis from uncurated phenotype profiles derived from all text narrative associated with admissions in the medical information mart for intensive care (MIMIC-III).
Results
300 semantic similarity configurations were evaluated, as well as one embedding-based approach. On average, measures that did not make use of an external information content measure performed slightly better, however the best-performing configurations when measured by area under receiver operating characteristic curve and Top Ten Accuracy used term-specificity and annotation-frequency measures.
Conclusion
We identified and interpreted the performance of a large number of semantic similarity configurations for the task of classifying diagnosis from text-derived phenotype profiles in one setting. We also provided a basis for further research on other settings and related tasks in the area.
Using SPARQL to Unify Queries over Data, Ontologies, and Machine Learning Models in the PhenomeBrowser Knowledgebase
Ali Syed, Şenay Kafkas, Maxat Kulmanov and Robert Hoehndorf Proceedings of the 13th International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences, SWAT4HCLS 2022 (2022)
Applied OntologySemantic interoperability
Positive-Unlabeled Learning with Adversarial Data Augmentation for Knowledge Graph Completion
Zhenwei Tang, Shichao Pei, Zhao Zhang, Yongchun Zhu, Fuzhen Zhuang, Robert Hoehndorf and Xiangliang Zhang Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (2022)
Applied Ontology
Most real-world knowledge graphs (KG) are far from complete and comprehensive. This problem has motivated efforts in predicting the most plausible missing facts to complete a given KG, i.e., knowledge graph completion (KGC). However, existing KGC methods suffer from two main issues, 1) the false negative issue, i.e., the sampled negative training instances may include potential true facts; and 2) the data sparsity issue, i.e., true facts account for only a tiny part of all possible facts. To this end, we propose positive-unlabeled learning with adversarial data augmentation (PUDA) for KGC. In particular, PUDA tailors positive-unlabeled risk estimator for the KGC task to deal with the false negative issue. Furthermore, to address the data sparsity issue, PUDA achieves a data augmentation strategy by unifying adversarial training and positive-unlabeled learning under the positive-unlabeled minimax game. Extensive experimental results on real-world benchmark datasets demonstrate the effectiveness and compatibility of our proposed method.
DeepViral: prediction of novel virus–host interactions from protein sequences and infectious disease phenotypes
Liu-Wei, Wang, Kafkas, Şenay, Chen, Jun, Dimonaco, Nicholas J, Tegnér, Jesper and Hoehndorf, Robert Bioinformatics, vol. 37(17), pp. 2722-2729 (2021)
BioinformaticsDisease genetics
Infectious diseases caused by novel viruses have become a major public health concern. Rapid identification of virus–host interactions can reveal mechanistic insights into infectious diseases and shed light on potential treatments. Current computational prediction methods for novel viruses are based mainly on protein sequences. However, it is not clear to what extent other important features, such as the symptoms caused by the viruses, could contribute to a predictor. Disease phenotypes (i.e., signs and symptoms) are readily accessible from clinical diagnosis and we hypothesize that they may act as a potential proxy and an additional source of information for the underlying molecular interactions between the pathogens and hosts.We developed DeepViral, a deep learning based method that predicts protein–protein interactions (PPI) between humans and viruses. Motivated by the potential utility of infectious disease phenotypes, we first embedded human proteins and viruses in a shared space using their associated phenotypes and functions, supported by formalized background knowledge from biomedical ontologies. By jointly learning from protein sequences and phenotype features, DeepViral significantly improves over existing sequence-based methods for intra- and inter-species PPI prediction.Code and datasets for reproduction and customization are available at https://github.com/bio-ontology-research-group/DeepViral. Prediction results for 14 virus families are available at https://doi.org/10.5281/zenodo.4429824.
Predicting candidate genes from phenotypes, functions and anatomical site of expression
Jun Chen, Azza Althagafi and Robert Hoehndorf Bioinformatics, vol. 37(6), pp. 853-860 (2021)
Disease genetics
Supplementary data are available at Bioinformatics online.
DTI}-Voodoo: machine learning over interaction networks and ontology-based background knowledge predicts drug{textendash}target interactions
Tilman Hinnerichs and Robert Hoehndorf Bioinformatics, vol. 37(24), pp. 4835-4843 (2021)
Bioinformatics
Supplementary data are available at Bioinformatics online.
DeepGOWeb}: fast and accurate protein function prediction on the (Semantic) Web
Maxat Kulmanov, Fernando Zhapa-Camacho and Robert Hoehndorf Nucleic Acids Research, vol. 49(W1), pp. W140-W146 (2021)
Applied OntologyProtein function prediction
Understanding the functions of proteins is crucial to understand biological processes on a molecular level. Many more protein sequences are available than can be investigated experimentally. DeepGOPlus is a protein function prediction method based on deep learning and sequence similarity. DeepGOWeb makes the prediction model available through a website, an API, and through the SPARQL query language for interoperability with databases that rely on Semantic Web technologies. DeepGOWeb provides accurate and fast predictions and ensures that predicted functions are consistent with the Gene Ontology; it can provide predictions for any protein and any function in Gene Ontology. DeepGOWeb is freely available at https://deepgo.cbrc.kaust.edu.sa/.
Multi-faceted semantic clustering with text-derived phenotypes
Luke T. Slater, John A. Williams, Andreas Karwath, Hilary Fanning, Simon Ball, Paul N. Schofield, Robert Hoehndorf and Georgios V. Gkoutos Computers in Biology and Medicine, vol. 138, pp. 104904 (2021)
Applied OntologyDisease genetics
Towards similarity-based differential diagnostics for common diseases
Luke T. Slater, Andreas Karwath, John A. Williams, Sophie Russell, Silver Makepeace, Alexander Carberry, Robert Hoehndorf and Georgios V. Gkoutos Computers in Biology and Medicine, vol. 133, pp. 104360 (2021)
Disease geneticsSemantic similarity
Ontology-based phenotype profiles have been utilised for the purpose of differential diagnosis of rare genetic diseases, and for decision support in specific disease domains. Particularly, semantic similarity facilitates diagnostic hypothesis generation through comparison with disease phenotype profiles. However, the approach has not been applied for differential diagnosis of common diseases, or generalised clinical diagnostics from uncurated text-derived phenotypes. In this work, we describe the development of an approach for deriving patient phenotype profiles from clinical narrative text, and apply this to text associated with MIMIC-III patient visits. We then explore the use of semantic similarity with those text-derived phenotypes to classify primary patient diagnosis, comparing the use of patient-patient similarity and patient-disease similarity using phenotype-disease profiles previously mined from literature. We also consider a combined approach, in which literature-derived phenotypes are extended with the content of text-derived phenotypes we mined from 500 patients. The results reveal a powerful approach, showing that in one setting, uncurated text phenotypes can be used for differential diagnosis of common diseases, making use of information both inside and outside the setting. While the methods themselves should be explored for further optimisation, they could be applied to a variety of clinical tasks, such as differential diagnosis, cohort discovery, document and text classification, and outcome prediction.
Improved characterisation of clinical text through ontology-based vocabulary expansion
Luke T. Slater, William Bradlow, Simon Ball, Robert Hoehndorf and Georgios V Gkoutos Journal of Biomedical Semantics, vol. 12(1) (2021)
Applied OntologyDisease genetics
Ontology-based phenotype profiles have been utilised for the purpose of differential diagnosis of rare genetic diseases, and for decision support in specific disease domains. Particularly, semantic similarity facilitates diagnostic hypothesis generation through comparison with disease phenotype profiles. However, the approach has not been applied for differential diagnosis of common diseases, or generalised clinical diagnostics from uncurated text-derived phenotypes. In this work, we describe the development of an approach for deriving patient phenotype profiles from clinical narrative text, and apply this to text associated with MIMIC-III patient visits. We then explore the use of semantic similarity with those text-derived phenotypes to classify primary patient diagnosis, comparing the use of patient-patient similarity and patient-disease similarity using phenotype-disease profiles previously mined from literature. We also consider a combined approach, in which literature-derived phenotypes are extended with the content of text-derived phenotypes we mined from 500 patients. The results reveal a powerful approach, showing that in one setting, uncurated text phenotypes can be used for differential diagnosis of common diseases, making use of information both inside and outside the setting. While the methods themselves should be explored for further optimisation, they could be applied to a variety of clinical tasks, such as differential diagnosis, cohort discovery, document and text classification, and outcome prediction.
A fast, accurate, and generalisable heuristic-based negation detection algorithm for clinical text
Luke T. Slater, William Bradlow, Dino FA. Motti, Robert Hoehndorf, Simon Ball and Georgios V. Gkoutos Computers in Biology and Medicine, vol. 130, pp. 104216 (2021)
Disease genetics
Semantic similarity is a useful approach for comparing patient phenotypes, and holds the potential of an effective method for exploiting text-derived phenotypes for differential diagnosis, text and document classification, and outcome prediction. While approaches for context disambiguation are commonly used in text mining applications, forming a standard component of information extraction pipelines, their effects on semantic similarity calculations have not been widely explored. In this work, we evaluate how inclusion and disclusion of negated and uncertain mentions of concepts from text-derived phenotypes affects similarity of patients, and the use of those profiles to predict diagnosis. We report on the effectiveness of these approaches and report a very small, yet significant, improvement in performance when classifying primary diagnosis over MIMIC-III patient visits.
Effects of Negation and Uncertainty Stratification on Text-Derived Patient Profile Similarity
Luke T. Slater, Andreas Karwath, Robert Hoehndorf and Georgios V. Gkoutos Frontiers in Digital Health, vol. 3 (2021)
Disease geneticsSemantic similarity
Semantic similarity is a useful approach for comparing patient phenotypes, and holds the potential of an effective method for exploiting text-derived phenotypes for differential diagnosis, text and document classification, and outcome prediction. While approaches for context disambiguation are commonly used in text mining applications, forming a standard component of information extraction pipelines, their effects on semantic similarity calculations have not been widely explored. In this work, we evaluate how inclusion and disclusion of negated and uncertain mentions of concepts from text-derived phenotypes affects similarity of patients, and the use of those profiles to predict diagnosis. We report on the effectiveness of these approaches and report a very small, yet significant, improvement in performance when classifying primary diagnosis over MIMIC-III patient visits.
Semantic similarity and machine learning with ontologies
Kulmanov, Maxat, Smaili, Fatima Zohra, Gao, Xin and Hoehndorf, Robert Briefings in Bioinformatics (2020)
Applied OntologySemantic similarity
Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.
DeepGOPlus: improved protein function prediction from sequence
Kulmanov, Maxat and Hoehndorf, Robert Bioinformatics, vol. 36(2), pp. 422-429 (2020)
BioinformaticsProtein function prediction
Protein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many methods are available for predicting protein functions from sequence based features, protein–protein interaction networks, protein structure or literature. However, other than sequence, most of the features are difficult to obtain or not available for many proteins thereby limiting their scope. Furthermore, the performance of sequence-based function prediction methods is often lower than methods that incorporate multiple features and predicting protein functions may require a lot of time.We developed a novel method for predicting protein functions from sequence alone which combines deep convolutional neural network (CNN) model with sequence similarity based predictions. Our CNN model scans the sequence for motifs which are predictive for protein functions and combines this with functions of similar proteins (if available). We evaluate the performance of DeepGOPlus using the CAFA3 evaluation measures and achieve an Fmax of 0.390, 0.557 and 0.614 for BPO, MFO and CCO evaluations, respectively. These results would have made DeepGOPlus one of the three best predictors in CCO and the second best performing method in the BPO and MFO evaluations. We also compare DeepGOPlus with state-of-the-art methods such as DeepText2GO and GOLabeler on another dataset. DeepGOPlus can annotate around 40 protein sequences per second on common hardware, thereby making fast and accurate function predictions available for a wide range of proteins.http://deepgoplus.bio2vec.net/.Supplementary data are available at Bioinformatics online.
DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier
Kulmanov, Maxat AND Hoehndorf PLOS Computational Biology, vol. 16(11), pp. 1-22 (2020)
Applied OntologyDisease genetics
Author summary Gene–phenotype associations can help to understand the underlying mechanisms of many genetic diseases. However, experimental identification, often involving animal models, is time consuming and expensive. Computational methods that predict gene–phenotype associations can be used instead. We developed DeepPheno, a novel approach for predicting the phenotypes resulting from a loss of function of a single gene. We use gene functions and gene expression as information to prediction phenotypes. Our method uses a neural network classifier that is able to account for hierarchical dependencies between phenotypes. We extensively evaluate our method and compare it with related approaches, and we show that DeepPheno results in better performance in several evaluations. Furthermore, we found that many of the new predictions made by our method have been added to phenotype association databases released one year later. Overall, DeepPheno simulates some aspects of human physiology and how molecular and physiological alterations lead to abnormal phenotypes.
Formal axioms in biomedical ontologies improve analysis and interpretation of associated data
Smaili, Fatima Z., Gao, Xin and Hoehndorf, Robert Bioinformatics, vol. 36(7), pp. 2229-2236 (2020)
Applied Ontology
Motivation: There are now over 500 ontologies in the life sciences. Over the past years, significant resources have been invested into formalizing these biomedical ontologies. Formal axioms in ontologies have been developed and used to detect and ensure ontology consistency, find unsatisfiable classes, improve interoperability, guide ontology extension through the application of axiom-based design patterns, and encode domain background knowledge. At the same time, ontologies have extended their amount of human-readable information such as labels and definitions as well as other meta-data. As a consequence, biomedical ontologies now form large formalized domain knowledge bases and have a potential to improve ontology-based data analysis by providing background knowledge and relations between biological entities that are not otherwise connected. Results: We evaluate the contribution of formal axioms and ontology meta-data to the ontology-based prediction of protein-protein interactions and gene-disease associations. We find that the formal axioms that have been created for the Gene Ontology and several other ontologies significantly improve ontology- based prediction models through provision of domain-specific background knowledge. Furthermore, we find that the labels, synonyms and definitions in ontologies can also provide background knowledge that may be exploited for prediction. The axioms and meta-data of different ontologies contribute in varying degrees to improving data analysis. Our results have major implications on the further development of formal knowledge bases and ontologies in the life sciences, in particular as machine learning methods are more frequently being applied. Our findings clearly motivate the need for further development, and the systematic, application-driven evaluation and improvement, of formal axioms in ontologies
DDIEM}: drug database for inborn errors of metabolism
Marwa Abdelhakim, Eunice McMurray, Ali Raza Syed, Senay Kafkas, Allan Anthony Kamau, Paul N Schofield and Robert Hoehndorf Orphanet Journal of Rare Diseases, vol. 15(1) (2020)
Drug mechanisms
Abstract
Background
Inborn errors of metabolism (IEM) represent a subclass of rare inherited diseases caused by a wide range of defects in metabolic enzymes or their regulation. Of over a thousand characterized IEMs, only about half are understood at the molecular level, and overall the development of treatment and management strategies has proved challenging. An overview of the changing landscape of therapeutic approaches is helpful in assessing strategic patterns in the approach to therapy, but the information is scattered throughout the literature and public data resources.
Results
We gathered data on therapeutic strategies for 300 diseases into the Drug Database for Inborn Errors of Metabolism (DDIEM). Therapeutic approaches, including both successful and ineffective treatments, were manually classified by their mechanisms of action using a new ontology.
Conclusions
We present a manually curated, ontologically formalized knowledgebase of drugs, therapeutic procedures, and mitigated phenotypes. DDIEM is freely available through a web interface and for download at
http://ddiem.phenomebrowser.net
.
What is the right sequencing approach? Solo {VS} extended family analysis in consanguineous populations
Ahmed Alfares, Lamia Alsubaie, Taghrid Aloraini, Aljoharah Alaskar, Azza Althagafi, Ahmed Alahmad, Mamoon Rashid, Abdulrahman Alswaid, Ali Alothaim, Wafaa Eyaid, Faroug Ababneh, Mohammed Albalwi, Raniah Alotaibi, Mashael Almutairi, Nouf Altharawi, Alhanouf Alsamer, Marwa Abdelhakim, Senay Kafkas, Katsuhiko Mineta, Nicole Cheung, Abdallah M. Abdallah, Stine B"uchmann-Moller, Yoshinori Fukasawa, Xiang Zhao, Issaac Rajan, Robert Hoehndorf, Fuad Al Mutairi, Takashi Gojobori and Majid Alfadhel BMC} Medical Genomics, vol. 13(1) (2020)
Other
There was no difference in the hit rate between solo and extended family members. Trio-based analysis was a better approach than sibship testing, even in a consanguineous population. Finally, each additional family member helped to narrow down the number of variants by 50-75%. Our findings could help clinicians, researchers and testing laboratories select the most cost-effective and appropriate sequencing approach for their patients. Furthermore, using extended family analysis is a very useful tool for complex cases with novel genes.
Combining lexical and context features for automatic ontology extension
Sara Althubaiti, Senay Kafkas, Marwa Abdelhakim and Robert Hoehndorf Journal of Biomedical Semantics, vol. 11, pp. 1 (2020)
Applied OntologyOntology engineering
Towards semantic interoperability: finding and repairing hidden contradictions in biomedical ontologies
Luke T. Slater, Georgios V. Gkoutos and Robert Hoehndorf BMC} Medical Informatics and Decision Making, vol. 20(S10) (2020)
Applied OntologySemantic interoperability
Abstract
Background
Ontologies are widely used throughout the biomedical domain. These ontologies formally represent the classes and relations assumed to exist within a domain. As scientific domains are deeply interlinked, so too are their representations. While individual ontologies can be tested for consistency and coherency using automated reasoning methods, systematically combining ontologies of multiple domains together may reveal previously hidden contradictions.
Methods
We developed a method that tests for hidden unsatisfiabilities in an ontology that arise when combined with other ontologies. For this purpose, we combined sets of ontologies and use automated reasoning to determine whether unsatisfiable classes are present. In addition, we designed and implemented a novel algorithm that can determine justifications for contradictions across extremely large and complicated ontologies, and use these justifications to semi-automatically repair ontologies by identifying a small set of axioms that, when removed, result in a consistent and coherent set of ontologies.
Results
We tested the mutual consistency of the OBO Foundry and the OBO ontologies and find that the combined OBO Foundry gives rise to at least 636 unsatisfiable classes, while the OBO ontologies give rise to more than 300,000 unsatisfiable classes. We also applied our semi-automatic repair algorithm to each combination of OBO ontologies that resulted in unsatisfiable classes, finding that only 117 axioms could be removed to account for all cases of unsatisfiability across all OBO ontologies.
Conclusions
We identified a large set of hidden unsatisfiability across a broad range of biomedical ontologies, and we find that this large set of unsatisfiable classes is the result of a relatively small amount of axiomatic disagreements. Our results show that hidden unsatisfiability is a serious problem in ontology interoperability; however, our results also provide a way towards more consistent ontologies by addressing the issues we identified.
EMC}10
homozygous variant identified in a family with global developmental delay, mild intellectual disability, and speech delay
Muhammad Umair, Mariam Ballow, Abdulaziz Asiri, Yusra Alyafee, Abeer Tuwaijri, Kheloud M. Alhamoudi, Taghrid Aloraini, Marwa Abdelhakim, Azza Thamer Althagafi, Senay Kafkas, Lamia Alsubaie, Muhammad Talal Alrifai, Robert Hoehndorf, Ahmed Alfares and Majid Alfadhel Clinical Genetics, vol. 98(6), pp. 555-561 (2020)
Disease genetics
In recent years, several genes have been implicated in the variable disease presentation of global developmental delay (GDD) and intellectual disability (ID). The endoplasmic reticulum membrane protein complex (EMC) family is known to be involved in GDD and ID. Homozygous variants of EMC1 are associated with GDD, scoliosis, and cerebellar atrophy, indicating the relevance of this pathway for neurogenetic disorders. EMC10 is a bone marrow-derived angiogenic growth factor that plays an important role in infarct vascularization and promoting tissue repair. However, this gene has not been previously associated with human disease. Herein, we describe a Saudi family with two individuals segregating a recessive neurodevelopmental disorder. Both of the affected individuals showed mild ID, speech delay, and GDD. Whole-exome sequencing (WES) and Sanger sequencing were performed to identify candidate genes. Further, to elucidate the functional effects of the variant, quantitative real-time PCR (RT-qPCR)-based expression analysis was performed. WES revealed a homozygous splice acceptor site variant (c.679-1G>A) in EMC10 (chromosome 19q13.33) that segregated perfectly within the family. RT-qPCR showed a substantial decrease in the relative EMC10 gene expression in the patients, indicating the pathogenicity of the identified variant. For the first time in the literature, the EMC10 gene variant was associated with mild ID, speech delay, and GDD. Thus, this gene plays a key role in developmental milestones, with the potential to cause neurodevelopmental disorders in humans.
BioHackathon} 2015: Semantics of data for life sciences and reproducible research
Rutger A. Vos, Toshiaki Katayama, Hiroyuki Mishima, Shin Kawano, Shuichi Kawashima, Jin-Dong Kim, Yuki Moriya, Toshiaki Tokimatsu, Atsuko Yamaguchi, Yasunori Yamamoto, Hongyan Wu, Peter Amstutz, Erick Antezana, Nobuyuki P. Aoki, Kazuharu Arakawa, Jerven T. Bolleman, Evan Bolton, Raoul J. P. Bonnal, Hidemasa Bono, Kees Burger, Hirokazu Chiba, Kevin B. Cohen, Eric W. Deutsch, Jesualdo T. Fern'andez-Breis, Gang Fu, Takatomo Fujisawa, Atsushi Fukushima, Alexander Garc'ia, Naohisa Goto, Tudor Groza, Colin Hercus, Robert Hoehndorf, Kotone Itaya, Nick Juty, Takeshi Kawashima, Jee-Hyub Kim, Akira R. Kinjo, Masaaki Kotera, Kouji Kozaki, Sadahiro Kumagai, Tatsuya Kushida, Thomas L"utteke, Masaaki Matsubara, Joe Miyamoto, Attayeb Mohsen, Hiroshi Mori, Yuki Naito, Takeru Nakazato, Jeremy Nguyen-Xuan, Kozo Nishida, Naoki Nishida, Hiroyo Nishide, Soichi Ogishima, Tazro Ohta, Shujiro Okuda, Benedict Paten, Jean-Luc Perret, Philip Prathipati, Pjotr Prins, N'uria Queralt-Rosinach, Daisuke Shinmachi, Shinya Suzuki, Tsuyosi Tabata, Terue Takatsuki, Kieron Taylor, Mark Thompson, Ikuo Uchiyama, Bruno Vieira, Chih-Hsuan Wei, Mark Wilkinson, Issaku Yamada, Ryota Yamanaka, Kazutoshi Yoshitake, Akiyasu C. Yoshizawa, Michel Dumontier, Kenjiro Kosaki and Toshihisa Takagi F1000Research, vol. 9, pp. 136 (2020)
Applied Ontology
We report on the activities of the 2015 edition of the BioHackathon, an annual event that brings together researchers and developers from around the world to develop tools and technologies that promote the reusability of biological data. We discuss issues surrounding the representation, publication, integration, mining and reuse of biological data and metadata across a wide range of biomedical data types of relevance for the life sciences, including chemistry, genotypes and phenotypes, orthology and phylogeny, proteomics, genomics, glycomics, and metabolomics. We describe our progress to address ongoing challenges to the reusability and reproducibility of research results, and identify outstanding issues that continue to impede the progress of bioinformatics research. We share our perspective on the state of the art, continued challenges, and goals for future research and development for the life sciences Semantic Web.
JOWO 2020: The Joint Ontology Workshops : Proceedings of the Joint Ontology Workshops co-located with the Bolzano Summer of Knowledge (BOSK 2020)
BioHackathon series in 2013 and 2014: improvements of semantic interoperability in life science data and services
Katayama, T, Kawashima, S, Micklem, G, Kawano, S, Kim, JD, Kocbek, S, Okamoto, S, Wang, Y, Wu, H, Yamaguchi, A, Yamamoto, Y, Antezana, E, Aoki-Kinoshita, KF, Arakawa, K, Banno, M, Baran, J, Bolleman, JT, Bonnal, RJP, Bono, H, Fernández-Breis, JT, Buels, R, Campbell, MP, Chiba, H, Cock, PJA, Cohen, KB, Dumontier, M, Fujisawa, T, Fujiwara, T, Garcia, L, Gaudet, P, Hattori, E, Hoehndorf, R, Itaya, K, Ito, M, Jamieson, D, Jupp, S, Juty, N, Kalderimis, A, Kato, F, Kawaji, H, Kawashima, T, Kinjo, AR, Komiyama, Y, Kotera, M, Kushida, T, Malone, J, Matsubara, M, Mizuno, S, Mizutani, S, Mori, H, Moriya, Y, Murakami, K, Nakazato, T, Nishide, H, Nishimura, Y, Ogishima, S, Ohta, T, Okuda, S, Ono, H, Perez-Riverol, Y, Shinmachi, D, Splendiani, A, Strozzi, F, Suzuki, S, Takehara, J, Thompson, M, Tokimatsu, T, Uchiyama, I, Verspoor, K, Wilkinson, MD, Wimalaratne, S, Yamada, I, Yamamoto, N, Yarimizu, M, Kawamoto, S and Takagi, T F1000Research, vol. 8(1677) (2019)
Applied OntologySemantic interoperability
A Machine Learning Based Approach for Similarity Search on Biodiversity Knowledge Graphs
Claus Weiland, Maxat Kulmanov, Marco Schmidt and Robert Hoehndorf Biodiversity Information Science and Standards, vol. 3, pp. e37048 (2019)
Applied OntologySemantic similarity
Mass biodiversity data from scientific collections will be provided by world-wide digitization efforts like iDigBio in the U.S and DiSSCo in Europe. This opens up an increasing amount of data on wild type organisms, which enables the building of large biodiversity knowledge graphs comprising, inter alia, sequence, trait and occurrence data. Knowledge graphs model information in the form of entities and their relationships expressed in good practice as ontology-based annotations. Based on ontological descriptions, semantic similarity analysis makes linking of wild type data to genomic and proteonomic data of model organisms possible and thus supports knowledge discovery of crop wild relatives and underutilized species of interest for medicine, breeding and agriculture. Since classical similarity measurements focus on recording differences between character states (aiming to describe disease phenotypes), but not the character states in the sense of trait variations itself, new methods for similarity search are required. Machine learning algorithms operate on feature vectors, which are numeric representations of data (images, class labels etc) in n-dimensional vector space. We established a machine learning based workflow for similarity search on biodiversity entities using feature learning on ontologies and an associated RDF knowledge graph to project structured trait data into vector space. Vectors are then compared applying a similarity function (e.g. cosine similarity) to determine similarity between taxa based on trait semantics. We will present an application example of machine learning on biodiversity knowledge graphs using a pipeline built upon OPA2Vec, a method to generate feature vectors from the logical content of ontologies (Smaili et al. 2018), to successfully cluster plant species for life form and ecotype (e.g. tree vs. perennial plant) on the basis of their annotations with the Flora Phenotype Ontology (Hoehndorf et al. 2016).
Quantitative evaluation of ontology design patterns for combining pathology and anatomy ontologies
Sarah M. Alghamdi, Beth A. Sundberg, John P. Sundberg, Paul N. Schofield and Robert Hoehndorf Scientific Reports, vol. 9, pp. 4025 (2019)
Applied OntologyOntology engineering
Data are increasingly annotated with multiple ontologies to capture rich information about the features of the subject under investigation. Analysis may be performed over each ontology separately, but recently there has been a move to combine multiple ontologies to provide more powerful analytical possibilities. However, it is often not clear how to combine ontologies or how to assess or evaluate the potential design patterns available. Here we use a large and well-characterized dataset of anatomic pathology descriptions from a major study of aging mice. We show how different design patterns based on the MPATH and MA ontologies provide orthogonal axes of analysis, and perform differently in over-representation and semantic similarity applications. We discuss how such a data-driven approach might be used generally to generate and evaluate ontology design patterns.
Ontology-based prediction of cancer driver genes
Althubaiti, Sara, Karwath, Andreas, Dallol, Ashraf, Noor, Adeeb, Alkhayyat, Shadi Salem, Alwassia, Rolina, Mineta, Katsuhiko, Gojobori, Takashi, Beggs, Andrew D, Schofield, Paul N, Gkoutos, Georgios V and Hoehndorf, Robert Scientific Reports, vol. 9, pp. 17405 (2019)
Applied OntologyDisease genetics
Identifying and distinguishing cancer driver genes among thousands of candidate mutations remains a major challenge. Accurate identification of driver genes and driver mutations is critical for advancing cancer research and personalizing treatment based on accurate stratification of patients. Due to inter-tumor genetic heterogeneity, many driver mutations within a gene occur at low frequencies, which make it challenging to distinguish them from non-driver mutations. We have developed a novel method for identifying cancer driver genes. Our approach utilizes multiple complementary types of information, specifically cellular phenotypes, cellular locations, functions, and whole body physiological phenotypes as features. We demonstrate that our method can accurately identify known cancer driver genes and distinguish between their role in different types of cancer. In addition to confirming known driver genes, we identify several novel candidate driver genes. We demonstrate the utility of our method by validating its predictions in nasopharyngeal cancer and colorectal cancer using whole exome and whole genome sequencing.
The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens
Zhou, Naihui, Jiang, Yuxiang, Bergquist, Timothy R, Lee, Alexandra J, Kacsoh, Balint Z, Crocker, Alex W, Lewis, Kimberley A, Georghiou, George, Nguyen, Huy N, Hamid, Md Nafiz, Davis, Larry, Dogan, Tunca, Atalay, Volkan, Rifaioglu, Ahmet S, Dalkiran, Alperen, Cetin-Atalay, Rengul, Zhang, Chengxin, Hurto, Rebecca L, Freddolino, Peter L, Zhang, Yang, Bhat, Prajwal, Supek, Fran, Fern'andez, Jos'e M, Gemovic, Branislava, Perovic, Vladimir R, Davidovi'c, Radoslav S, Sumonja, Neven, Veljkovic, Nevena, Asgari, Ehsaneddin, Mofrad, Mohammad RK, Profiti, Giuseppe, Savojardo, Castrense, Martelli, Pier Luigi, Casadio, Rita, Boecker, Florian, Kahanda, Indika, Thurlby, Natalie, McHardy, Alice C, Renaux, Alexandre, Saidi, Rabie, Gough, Julian, Freitas, Alex A, Antczak, Magdalena, Fabris, Fabio, Wass, Mark N, Hou, Jie, Cheng, Jianlin, Hou, Jie, Wang, Zheng, Romero, Alfonso E, Paccanaro, Alberto, Yang, Haixuan, Goldberg, Tatyana, Zhao, Chenguang, Holm, Liisa, T"or"onen, Petri, Medlar, Alan J, Zosa, Elaine, Borukhov, Itamar, Novikov, Ilya, Wilkins, Angela, Lichtarge, Olivier, Chi, Po-Han, Tseng, Wei-Cheng, Linial, Michal, Rose, Peter W, Dessimoz, Christophe, Vidulin, Vedrana, Dzeroski, Saso, Sillitoe, Ian, Das, Sayoni, Lees, Jonathan Gill, Jones, David T, Wan, Cen, Cozzetto, Domenico, Fa, Rui, Torres, Mateo, Vesztrocy, Alex Wiarwick, Rodriguez, Jose Manuel, Tress, Michael L, Frasca, Marco, Notaro, Marco, Grossi, Giuliano, Petrini, Alessandro, Re, Matteo, Valentini, Giorgio, Mesiti, Marco, Roche, Daniel B, Reeb, Jonas, Ritchie, David W, Aridhi, Sabeur, Alborzi, Seyed Ziaeddin, Devignes, Marie-Dominique, Emily Koo, Da Chen, Bonneau, Richard, Gligorijevi'c, Vladimir, Barot, Meet, Fang, Hai, Toppo, Stefano, Lavezzo, Enrico, Falda, Marco, Berselli, Michele, Tosatto, Silvio CE, Carraro, Marco, Piovesan, Damiano, Rehman, Hafeez Ur, Mao, Qizhong, Zhang, Shanshan, Vucetic, Slobodan, Black, Gage S, Jo, Dane, Larsen, Dallas J, Omdahl, Ashton R, Sagers, Luke W, Suh, Erica, Dayton, Jonathan B, McGuffin, Liam J, Brackenridge, Danielle A, Babbitt, Patricia C, Yunes, Jeffrey M, Fontana, Paolo, Zhang, Feng, Zhu, Shanfeng, You, Ronghui, Zhang, Zihan, Dai, Suyang, Yao, Shuwei, Tian, Weidong, Cao, Renzhi, Chandler, Caleb, Amezola, Miguel, Johnson, Devon, Chang, Jia-Ming, Liao, Wen-Hung, Liu, Yi-Wei, Pascarelli, Stefano, Frank, Yotam, Hoehndorf, Robert, Kulmanov, Maxat, Boudellioua, Imane, Politano, Gianfranco, Di Carlo, Stefano, Benso, Alfredo, Hakala, Kai, Ginter, Filip, Mehryary, Farrokh, Kaewphan, Suwisa, Bj"orne, Jari, Moen, Hans, Tolvanen, Martti E E, Salakoski, Tapio, Kihara, Daisuke, Jain, Aashish, v Smuc, Tomislav, Altenhoff, Adrian, Ben-Hur, Asa, Rost, Burkhard, Brenner, Steven E, Orengo, Christine A, Jeffery, Constance J, Bosco, Giovanni, Hogan, Deborah A, Martin, Maria J, OtextquoterightDonovan, Claire, Mooney, Sean D, Greene, Casey S, Radivojac, Predrag and Friedberg, Iddo Genome Biology, vol. 20, pp. 244 (2019)
Protein function prediction
Hyaline Arteriolosclerosis in 30 Strains of Aged Inbred Mice
Timothy K. Cooper, Kathleen A. Silva, Victoria E. Kennedy, Sarah M. Alghamdi, Robert Hoehndorf, Beth A. Sundberg, Paul N. Schofield and John P. Sundberg Veterinary Pathology, pp. 0300985819844822 (2019)
Other
During a screen for vascular phenotypes in aged laboratory mice, a unique discrete phenotype of hyaline arteriolosclerosis of the intertubular arteries and arterioles of the testes was identified in several inbred strains. Lesions were limited to the testes and did not occur as part of any renal, systemic, or pulmonary arteriopathy or vasculitis phenotype. There was no evidence of systemic or pulmonary hypertension, and lesions did not occur in ovaries of females. Frequency was highest in males of the SM/J (27/30, 90\%) and WSB/EiJ (19/26, 73\%) strains, aged 383 to 847 days. Lesions were sporadically present in males from several other inbred strains at a much lower (<20\%) frequency. The risk of testicular hyaline arteriolosclerosis is at least partially underpinned by a genetic predisposition that is not associated with other vascular lesions (including vasculitis), separating out the etiology of this form and site of arteriolosclerosis from other related conditions that often co-occur in other strains of mice and in humans. Because of their genetic uniformity and controlled dietary and environmental conditions, mice are an excellent model to dissect the pathogenesis of human disease conditions. In this study, a discrete genetically driven phenotype of testicular hyaline arteriolosclerosis in aging mice was identified. These observations open the possibility of identifying the underlying genetic variant(s) associated with the predisposition and therefore allowing future interrogation of the pathogenesis of this condition.
Ontology based text mining of gene-phenotype associations: application to candidate gene prediction
Kafkas, Şenay and Hoehndorf, Robert Database, vol. 2019, pp. baz019 (2019)
Applied OntologyDisease genetics
Gene–phenotype associations play an important role in understanding the disease mechanisms which is a requirement for treatment development. A portion of gene–phenotype associations are observed mainly experimentally and made publicly available through several standard resources such as MGI. However, there is still a vast amount of gene–phenotype associations buried in the biomedical literature. Given the large amount of literature data, we need automated text mining tools to alleviate the burden in manual curation of gene–phenotype associations and to develop comprehensive resources. In this study, we present an ontology-based approach in combination with statistical methods to text mine gene–phenotype associations from the literature. Our method achieved AUC values of 0.90 and 0.75 in recovering known gene–phenotype associations from HPO and MGI respectively. We posit that candidate genes and their relevant diseases should be expressed with similar phenotypes in publications. Thus, we demonstrate the utility of our approach by predicting disease candidate genes based on the semantic similarities of phenotypes associated with genes and diseases. To the best of our knowledge, this is the first study using an ontology based approach to extract gene–phenotype associations from the literature. We evaluated our disease candidate prediction model on the gene–disease associations from MGI. Our model achieved AUC values of 0.90 and 0.87 on OMIM (human) and MGI (mouse) datasets of gene–disease associations respectively. Our manual analysis on the text mined data revealed that our method can accurately extract gene–phenotype associations which are not currently covered by the existing public gene–phenotype resources. Overall, results indicate that our method can precisely extract known as well as new gene–phenotype associations from literature. All the data and methods are available at https://github.com/bio-ontology-research-group/genepheno.
DeepPVP: phenotype-based prioritization of causative variants using deep learning
Boudellioua, Imane, Kulmanov, Maxat, Schofield, Paul N, Gkoutos, Georgios V and Hoehndorf, Robert BMC Bioinformatics, vol. 20, pp. 65 (2019)
Disease genetics
Background: Prioritization of variants in personal genomic data is a major challenge. Recently, computational methods that rely on comparing phenotype similarity have shown to be useful to identify causative variants. In these methods, pathogenicity prediction is combined with a semantic similarity measure to prioritize not only variants that are likely to be dysfunctional but those that are likely involved in the pathogenesis of a patient{\textquoteright}s phenotype. Results: We have developed DeepPVP, a variant prioritization method that combined automated inference with deep neural networks to identify the likely causative variants in whole exome or whole genome sequence data. We demonstrate that DeepPVP performs significantly better than existing methods, including phenotype-based methods that use similar features. DeepPVP is freely available at https://github.com/bio-ontology-research-group/phenomenet-vp Conclusions: DeepPVP further improves on existing variant prioritization methods both in terms of speed as well as accuracy.
Nail abnormalities identified in an ageing study of 30 inbred mouse strains
Linn, Sarah C., Mustonen, Allison M., Silva, Kathleen A., Kennedy, Victoria E., Sundberg, Beth A., Bechtold, Lesley S., Alghamdi, Sarah M., Hoehndorf, Robert, Schofield, Paul N. and Sundberg, John P. Experimental Dermatology, vol. 28(4), pp. 383-390 (2019)
Other
In a large-scale ageing study, 30 inbred mouse strains were systematically screened for histologic evidence of lesions in all organ systems. Ten strains were diagnosed with similar nail abnormalities. The highest frequency was noted in NON/ShiLtJ mice. Lesions identified fell into two main categories: acute to chronic penetration of the third phalangeal bone through the hyponychium with associated inflammation and bone remodelling or metaplasia of the nail matrix and nail bed associated with severe orthokeratotic hyperkeratosis replacing the nail plate. Penetration of the distal phalanx through the hyponychium appeared to be the initiating feature resulting in nail abnormalities. The accompanying acute to subacute inflammatory response was associated with osteolysis of the distal phalanx. Evaluation of young NON/ShiLtJ mice revealed that these lesions were not often found, or affected only one digit. The only other nail unit abnormality identified was sporadic subungual epidermoid inclusion cysts which closely resembled similar lesions in human patients. These abnormalities, being age-related developments, may have contributed to weight loss due to impacts upon feeding and should be a consideration for future research due to the potential to interact with other experimental factors in ageing studies using the affected strains of mice.
Ontology based mining of pathogen--disease associations from literature
Senay Kafkas and Robert Hoehndorf Journal of Biomedical Semantics, vol. 10, pp. 15 (2019)
Applied OntologyDisease genetics
PathoPhenoDB: linking human pathogens to their disease phenotypes in support of infectious disease research
Kafkas, Senay, Abdelhakim, Marwa, Hashish, Yasmeen, Kulmanov, Maxat, Abdellatif, Marwa, Schofield, Paul N and Hoehndorf, Robert Scientific Data, vol. 6(1), pp. 79 (2019)
Disease genetics
Understanding the relationship between the pathophysiology of infectious disease, the biology of the causative agent and the development of therapeutic and diagnostic approaches is dependent on the synthesis of a wide range of types of information. Provision of a comprehensive and integrated disease phenotype knowledgebase has the potential to provide novel and orthogonal sources of information for the understanding of infectious agent pathogenesis, and support for research on disease mechanisms. We have developed PathoPhenoDB, a database containing pathogen-to-phenotype associations. PathoPhenoDB relies on manual curation of pathogen-disease relations, on ontology-based text mining as well as manual curation to associate phenotypes with infectious disease. Using Semantic Web technologies, PathoPhenoDB also links to knowledge about drug resistance mechanisms and drugs used in the treatment of infectious diseases. PathoPhenoDB is accessible at http://patho.phenomebrowser.net/, and the data is freely available through a public SPARQL endpoint.
EL Embeddings: Geometric construction of models for the Description Logic EL++
Maxat Kulmanov, Wang Liu-Wei, Yuan Yan and Robert Hoehndorf Proceedings of IJCAI 2019 (2019)
Applied Ontology
An embedding is a function that maps entities from one algebraic structure into another while preserving certain characteristics. Embeddings are being used successfully for mapping relational data or text into vector spaces where they can be used for machine learning, similarity search, or similar tasks. We address the problem of finding vector space embeddings for theories in the Description Logic EL++ that are also models of the TBox. To find such embeddings, we define an optimization problem that characterizes the model-theoretic semantics of the operators in EL++ within ℜn, thereby solving the problem of finding an interpretation function for an EL++ theory given a particular domain $\Delta$. Our approach is mainly relevant to large EL++ theories and knowledge bases such as the ontologies and knowledge graphs used in the life sciences. We demonstrate that our method can be used for improved prediction of protein--protein interactions when compared to semantic similarity measures or knowledge graph embedding
Semi-Supervised Entity Alignment via Knowledge Graph Embedding with Awareness of Degree Difference
Pei, Shichao, Yu, Lu, Hoehndorf, Robert and Zhang, Xiangliang The World Wide Web Conference, pp. 3130-3136 (2019)
Applied Ontology
Entity alignment associates entities in different knowledge graphs if they are semantically same, and has been successfully used in the knowledge graph construction and connection. Most of the recent solutions for entity alignment are based on knowledge graph embedding, which maps knowledge entities in a low-dimension space where entities are connected with the guidance of prior aligned entity pairs. The study in this paper focuses on two important issues that limit the accuracy of current entity alignment solutions: 1) labeled data of priorly aligned entity pairs are difficult and expensive to acquire, whereas abundant of unlabeled data are not used; and 2) knowledge graph embedding is affected by entity's degree difference, which brings challenges to align high frequent and low frequent entities. We propose a semi-supervised entity alignment method (SEA) to leverage both labeled entities and the abundant unlabeled entity information for the alignment. Furthermore, we improve the knowledge graph embedding with awareness of the degree difference by performing the adversarial training. To evaluate our proposed model, we conduct extensive experiments on real-world datasets. The experimental results show that our model consistently outperforms the state-of-the-art methods with significant improvement on alignment accuracy.
FoodOn: a harmonized food ontology to increase global food traceability, quality control and data integration
Damion M. Dooley, Emma J. Griffiths, Gurinder S. Gosal, Pier L. Buttigieg, Robert Hoehndorf, Matthew C. Lange, Lynn M. Schriml, Fiona S. L. Brinkman and William W. L. Hsiao Science of Food, vol. 2, pp. 23 (2018)
Applied OntologySemantic interoperability
Notions of similarity for systems biology models
Ron Henkel, Robert Hoehndorf, Tim Kacprowski, Christian Knüpfer, Wolfgang Liebermeister and Dagmar Waltemath Briefings in Bioinformatics, vol. 19(1), pp. 77-88 (2018)
Semantic similarity
Systems biology models are rapidly increasing in complexity, size and numbers. When building large models, researchers rely on software tools for the retrieval, comparison, combination and merging of models, as well as for version control. These tools need to be able to quantify the differences and similarities between computational models. However, depending on the specific application, the notion of 'similarity' may greatly vary. A general notion of model similarity, applicable to various types of models, is still missing. Here we survey existing methods for the comparison of models, introduce quantitative measures for model similarity, and discuss potential applications of combined similarity measures. To frame model comparison as a general problem, we describe a theoretical approach to defining and computing similarities based on a combination of different model aspects. The six aspects that we define as potentially relevant for similarity are underlying encoding, references to biological entities, quantitative behaviour, qualitative behaviour, mathematical equations and parameters and network structure. We argue that future similarity measures will benefit from combining these model aspects in flexible, problem-specific ways to mimic users' intuition about model similarity, and to support complex model searches in databases.
A Review of Current Standards and the Evolution of Histopathology Nomenclature for Laboratory Animals
Keenan, Charlotte M, McKerlie, Colin, Gkoutos, Georgios V, Ward, Jerrold M, Sundberg, John P, Cesta, Mark F, Schofield, Paul N, Cardiff, Robert, Hoehndorf, Robert and Elmore, Susan A ILAR Journal, pp. 1-11 (2018)
Ontology engineering
The need for international collaboration in rodent pathology has evolved since the 1970s and was initially driven by the new field of toxicologic pathology. First initiated by the World Health Organization’s International Agency for Research on Cancer for rodents, it has evolved to include pathology of the major species (rats, mice, guinea pigs, nonhuman primates, pigs, dogs, fish, rabbits) used in medical research, safety assessment, and mouse pathology. The collaborative effort today is driven by the needs of the regulatory agencies in multiple countries, and by needs of research involving genetically engineered animals, for “basic” research and for more translational preclinical models of human disease. These efforts led to the establishment of an international rodent pathology nomenclature program. Since that time, multiple collaborations for standardization of laboratory animal pathology nomenclature and diagnostic criteria have been developed, and just a few are described herein. Recently, approaches to a nomenclature that is amenable to sophisticated computation have been made available and implemented for large-scale programs in functional genomics and aging. Most terminologies continue to evolve as the science of human and veterinary pathology continues to develop, but standardization and successful implementation remain critical for scientific communication now as ever in the history of veterinary nosology.
OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants
Boudellioua, Imane, Kulmanov, Maxat, Schofield, Paul N, Gkoutos, Georgios V and Hoehndorf, Robert Scientific Reports, vol. 8, pp. 14681 (2018)
Disease genetics
Purpose: An increasing number of Mendelian disorders have been identified for which two or more variants in one or more genes are required to cause the disease, or significantly modify its severity or phenotype. It is difficult to discover such interactions using existing approaches. The purpose of our work is to develop and evaluate a system that can identify combinations of variants underlying oligogenic diseases in individual whole exome or whole genome sequences. Methods: Information that links patient phenotypes to databases of gene-phenotype associations observed in clinical research can provide useful information and improve variant prioritization for Mendelian diseases. Additionally, background knowledge about interactions between genes can be utilized to guide and restrict the selection of candidate disease modules. Results: We developed OligoPVP, an algorithm that can be used to identify variants in oligogenic diseases and their interactions, using whole exome or whole genome sequences together with patient phenotypes as input. We demonstrate that OligoPVP has significantly improved performance when compared to state of the art pathogenicity detection methods. Conclusions: Our results show that OligoPVP can efficiently detect oligogenic interactions using a phenotype-driven approach and identify etiologically important variants in whole genomes.
Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations
Smaili, Fatima Zohra, Gao, Xin and Hoehndorf, Robert Bioinformatics, vol. 34(13), pp. i52-i60 (2018)
Applied Ontology
Motivation: Biological knowledge is widely represented in the form of ontology-based annotations: ontologies describe the phenomena assumed to exist within a domain, and the annotations associate a (kind of) biological entity with a set of phenomena within the domain. The structure and information contained in ontologies and their annotations make them valuable for developing machine learning, data analysis and knowledge extraction algorithms; notably, semantic similarity is widely used to identify relations between biological entities, and ontology-based annotations are frequently used as features in machine learning applications. Results: We propose the Onto2Vec method, an approach to learn feature vectors for biological entities based on their annotations to biomedical ontologies. Our method can be applied to a wide range of bioinformatics research problems such as similarity-based prediction of interactions between proteins, classification of interaction types using supervised learning, or clustering. To evaluate Onto2Vec, we use the gene ontology (GO) and jointly produce dense vector representations of proteins, the GO classes to which they are annotated, and the axioms in GO that constrain these classes. First, we demonstrate that Onto2Vec-generated feature vectors can significantly improve prediction of protein--protein interactions in human and yeast. We then illustrate how Onto2Vec representations provide the means for constructing data-driven, trainable semantic similarity measures that can be used to identify particular relations between proteins. Finally, we use an unsupervised clustering approach to identify protein families based on their Enzyme Commission numbers. Our results demonstrate that Onto2Vec can generate high quality feature vectors from biological entities and ontologies. Onto2Vec has the potential to significantly outperform the state-of-the-art in several predictive applications in which ontologies are involved. Availability and implementation: https://github.com/bio-ontology-research-group/onto2vec
OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction
Smaili, Fatima Zohra, Gao, Xin and Hoehndorf, Robert Bioinformatics, pp. bty933 (2018)
Applied OntologySemantic similarity
Motivation:
Ontologies are widely used in biology for data annotation, integration and analysis. In addition to formally structured axioms, ontologies contain meta-data in the form of annotation axioms which provide valuable pieces of information that characterize ontology classes. Annotation axioms commonly used in ontologies include class labels, descriptions or synonyms. Despite being a rich source of semantic information, the ontology meta-data are generally unexploited by ontology-based analysis methods such as semantic similarity measures.
Results:
We propose a novel method, OPA2Vec, to generate vector representations of biological entities in ontologies by combining formal ontology axioms and annotation axioms from the ontology meta-data. We apply a Word2Vec model that has been pre-trained on either a corpus or abstracts or full-text articles to produce feature vectors from our collected data. We validate our method in two different ways: first, we use the obtained vector representations of proteins in a similarity measure to predict protein–protein interaction on two different datasets. Second, we evaluate our method on predicting gene–disease associations based on phenotype similarity by generating vector representations of genes and diseases using a phenotype ontology, and applying the obtained vectors to predict gene–disease associations using mouse model phenotypes. We demonstrate that OPA2Vec significantly outperforms existing methods for predicting gene–disease associations. Using evidence from mouse models, we apply OPA2Vec to identify candidate genes for several thousand rare and orphan diseases. OPA2Vec can be used to produce vector representations of any biomedical entity given any type of biomedical ontology.
Availability and implementation: https://github.com/bio-ontology-research-group/opa2vec
In silico exploration of Red Sea Bacillus genomes for natural product biosynthetic gene clusters
Othoum, Ghofran
and Bougouffa BMC Genomics, vol. 19(1), pp. 382 (2018)
Bioinformatics
The increasing spectrum of multidrug-resistant bacteria is a major global public health concern, necessitating discovery of novel antimicrobial agents. Here, members of the genus Bacillus are investigated as a potentially attractive source of novel antibiotics due to their broad spectrum of antimicrobial activities. We specifically focus on a computational analysis of the distinctive biosynthetic potential of Bacillus paralicheniformis strains isolated from the Red Sea, an ecosystem exposed to adverse, highly saline and hot conditions.
The anatomy of phenotype ontologies: principles, properties and applications
Georgios V. Gkoutos, Paul N. Schofield and Robert Hoehndorf Briefings in Bioinformatics, vol. 19(5), pp. 1008-1021 (2018)
Applied OntologyDisease genetics
The past decade has seen an explosion in the collection of genotype data in domains as diverse as medicine, ecology, livestock and plant breeding. Along with this comes the challenge of dealing with the related phenotype data, which is not only large but also highly multidimensional. Computational analysis of phenotypes has therefore become critical for our ability to understand the biological meaning of genomic data in the biological sciences. At the heart of computational phenotype analysis are the phenotype ontologies. A large number of these ontologies have been developed across many domains, and we are now at a point where the knowledge captured in the structure of these ontologies can be used for the integration and analysis of large interrelated data sets. The Phenotype And Trait Ontology framework provides a method for formal definitions of phenotypes and associated data sets and has proved to be key to our ability to develop methods for the integration and analysis of phenotype data. Here, we describe the development and products of the ontological approach to phenotype capture, the formal content of phenotype ontologies and how their content can be used computationally.
Ontology-based validation and identification of regulatory phenotypes
Kulmanov, Maxat, Schofield, Paul N, Gkoutos, Georgios V and Hoehndorf, Robert Bioinformatics, vol. 34(17), pp. i857-i865 (2018)
Applied OntologyDisease genetics
Motivation: Function annotations of gene products, and phenotype annotations of genotypes, provide valuable information about molecular mechanisms that can be utilized by computational methods to identify functional and phenotypic relatedness, improve our understanding of disease and pathobiology, and lead to discovery of drug targets. Identifying functions and phenotypes commonly requires experiments which are time-consuming and expensive to carry out; creating the annotations additionally requires a curator to make an assertion based on reported evidence. Support to validate the mutual consistency of functional and phenotype annotations as well as a computational method to predict phenotypes from function annotations, would greatly improve the utility of function annotations. Results: We developed a novel ontology-based method to validate the mutual consistency of function and phenotype annotations. We apply our method to mouse and human annotations, and identify several inconsistencies that can be resolved to improve overall annotation quality. We also apply our method to the rule-based prediction of regulatory phenotypes from functions and demonstrate that we can predict these phenotypes with Fmax of up to 0.647. Availability and implementation: https://github.com/bio-ontology-research-group/phenogocon
Alshahrani, Mona and Hoehndorf, Robert Bioinformatics, vol. 34(17), pp. i901-i907 (2018)
Applied OntologyDisease genetics
Motivation: In the past years, several methods have been developed to incorporate information about phenotypes into computational disease gene prioritization methods. These methods commonly compute the similarity between a disease's (or patient's) phenotypes and a database of gene-to-phenotype associations to find the phenotypically most similar match. A key limitation of these methods is their reliance on knowledge about phenotypes associated with particular genes which is highly incomplete in humans as well as in many model organisms such as the mouse. Results: We developed SmuDGE, a method that uses feature learning to generate vector-based representations of phenotypes associated with an entity. SmuDGE can be used as a trainable semantic similarity measure to compare two sets of phenotypes (such as between a disease and gene, or a disease and patient). More importantly, SmuDGE can generate phenotype representations for entities that are only indirectly associated with phenotypes through an interaction network; for this purpose, SmuDGE exploits background knowledge in interaction networks comprised of multiple types of interactions. We demonstrate that SmuDGE can match or outperform semantic similarity in phenotype-based disease gene prioritization, and furthermore significantly extends the coverage of phenotype-based methods to all genes in a connected interaction network. Availability and implementation: https://github.com/bio-ontology-research-group/SmuDGE
Taxon and trait recognition from digitized herbarium specimens using deep convolutional neural networks
Sohaib Younis, Claus Weiland, Robert Hoehndorf, Stefan Dressler, Thomas Hickler, Bernhard Seeger and Marco Schmidt Botany Letters, vol. 165(3--4), pp. 377-383 (2018)
Bioinformatics
Herbaria worldwide are housing a treasure of hundreds of millions of herbarium specimens, which are increasingly being digitized and thereby more accessible to the scientific community. At the same time, deep-learning algorithms are rapidly improving pattern recognition from images and these techniques are more and more being applied to biological objects. In this study, we are using digital images of herbarium specimens in order to identify taxa and traits of these collection objects by applying convolutional neural networks (CNN). Images of the 1000 species most frequently documented by herbarium specimens on GBIF have been downloaded and combined with morphological trait data, preprocessed and divided into training and test datasets for species and trait recognition. Good performance in both domains suggests substantial potential of this approach for supporting taxonomy and natural history collection management. Trait recognition is also promising for applications in functional ecology.
Ontology based mining of pathogen-disease associations from literature
Senay Kafkas and Robert Hoehndorf Bio-Ontologies COSI (2018)
Applied OntologyDisease genetics
Ontology-Based Concept Recognition by Using Word Embeddings
Sara Althubaiti, Senay Kafkas and Robert Hoehndorf Bio-Ontologies COSI (2018)
Applied Ontology
Vec2SPARQL: integrating {SPARQL} queries and knowledge graph embeddings
Maxat Kulmanov and
Senay Kafkas and
Andreas Karwath and
Alexander Malic and
Georgios V. Gkoutos and
Michel Dumontier and
Robert Hoehndorf Proceedings of the 11th International Conference Semantic Web Applications
and Tools for Life Sciences, {SWAT4LS} 2018, Antwerp, Belgium, December
3-6, 2018. (2018)
Applied Ontology
Neuro-symbolic representation learning on biological knowledge graphs
Alshahrani, Mona, Khan, Mohammad Asif, Maddouri, Omar, Kinjo, Akira R., Queralt-Rosinach, Núria and Hoehndorf, Robert Bioinformatics, vol. 33(17), pp. 2723-2730 (2017)
Applied OntologyNeuro-symbolic AI
Motivation: Biological data and knowledge bases increasingly rely on Semantic Web technologies and the use of knowledge graphs for data integration, retrieval and federated queries. In the past years, feature learning methods that are applicable to graph-structured data are becoming available, but have not yet widely been applied and evaluated on structured biological knowledge. Results: We develop a novel method for feature learning on biological knowledge graphs. Our method combines symbolic methods, in particular knowledge representation using symbolic logic and automated reasoning, with neural networks to generate embeddings of nodes that encode for related information within knowledge graphs. Through the use of symbolic logic, these embeddings contain both explicit and implicit information. We apply these embeddings to the prediction of edges in the knowledge graph representing problems of function prediction, finding candidate genes of diseases, protein-protein interactions, or drug target relations, and demonstrate performance that matches and sometimes outperforms traditional approaches based on manually crafted features. Our method can be applied to any biological knowledge graph, and will thereby open up the increasing amount of Semantic Web based knowledge bases in biology to use in machine learning and data analytics. Availability and implementation: https://github.com/bio-ontology-research-group/walking-rdf-and-owl Contact: robert.hoehndorf@kaust.edu.sa
Data science and symbolic AI: Synergies, challenges and opportunities
Hoehndorf, Robert and Queralt-Rosinach, Núria Data Science, vol. 1(1--2), pp. 27-38 (2017)
Other
Symbolic approaches to artificial intelligence represent things within a domain of knowledge through physical symbols, combine symbols into symbol expressions, and manipulate symbols and symbol expressions through inference processes. While a large part of Data Science relies on statistics and applies statistical approaches to artificial intelligence, there is an increasing potential for successfully applying symbolic approaches as well. Symbolic representations and symbolic inference are close to human cognitive representations and therefore comprehensible and interpretable; they are widely used to represent data and metadata, and their specific semantic content must be taken into account for analysis of such information; and human communication largely relies on symbols, making symbolic representations a crucial part in the analysis of natural language. Here we discuss the role symbolic representations and inference can play in Data Science, highlight the research challenges from the perspective of the data scientist, and argue that symbolic methods should become a crucial component of the data scientists’ toolbox.
DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier
Kulmanov, Maxat, Khan, Mohammed Asif and Hoehndorf, Robert Bioinformatics, vol. 34(4), pp. 660-668 (2017)
Applied OntologyProtein function prediction
Motivation: A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. The functions of proteins are classified using the Gene Ontology (GO), which contains over 40 000 classes. Additionally, proteins have multiple functions, making function prediction a large-scale, multi-class, multi-label problem. Results: We have developed a novel method to predict protein function from sequence. We use deep learning to learn features from protein sequences as well as a cross-species protein–protein interaction network. Our approach specifically outputs information in the structure of the GO and utilizes the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and demonstrate a significant improvement over baseline methods such as BLAST, in particular for predicting cellular locations. Availability and implementation: Web server: http://deepgo.bio2vec.net, Source code: https://github.com/bio-ontology-research-group/deepgo Contact: robert.hoehndorf@kaust.edu.sa
DES-TOMATO: A Knowledge Exploration System Focused On Tomato Species
Salhi, Adil, Negrão, Sónia, Essack, Magbubah, Morton, Mitchell J. L., Bougouffa, Salim, Mohamad Razali, Rozaimi, Radovanovic, Aleksandar, Marchand, Benoit, Kulmanov, Maxat, Hoehndorf, Robert, Tester, Mark A. and Bajic, Vladimir B. Scientific Reports, vol. 7, pp. 5968 (2017)
Applied Ontology
Tomato is the most economically important horticultural crop used as a model to study plant biology and particularly fruit development. Knowledge obtained from tomato research initiated improvements in tomato and, being transferrable to other such economically important crops, has led to a surge of tomato-related research and published literature. We developed DES-TOMATO knowledgebase (KB) for exploration of information related to tomato. Information exploration is enabled through terms from 26 dictionaries and combination of these terms. To illustrate the utility of DES-TOMATO, we provide several examples how one can efficiently use this KB to retrieve known or potentially novel information. DES-TOMATO is free for academic and nonprofit users and can be accessed at http://cbrc.kaust.edu.sa/des\_tomato/, using any of the mainstream web browsers, including Firefox, Safari and Chrome.
Usage of cell nomenclature in biomedical literature
Kafkas, cSenay
and Sarntivijai BMC Bioinformatics, vol. 18(17), pp. 561 (2017)
Ontology engineering
Cell lines and cell types are extensively studied in biomedical research yielding to a significant amount of publications each year. Identifying cell lines and cell types precisely in publications is crucial for science reproducibility and knowledge integration. There are efforts for standardisation of the cell nomenclature based on ontology development to support FAIR principles of the cell knowledge. However, it is important to analyse the usage of cell nomenclature in publications at a large scale for understanding the level of uptake of cell nomenclature in literature by scientists. In this study, we analyse the usage of cell nomenclature, both in Vivo, and in Vitro in biomedical literature by using text mining methods and present our results.
Evaluating the effect of annotation size on measures of semantic similarity
Kulmanov, Maxat and Hoehndorf, Robert Journal of Biomedical Semantics, vol. 8(1), pp. 7 (2017)
Applied OntologySemantic similarity
Ontologies are widely used as metadata in biological and biomedical datasets. Measures of semantic similarity utilize ontologies to determine how similar two entities annotated with classes from ontologies are, and semantic similarity is increasingly applied in applications ranging from diagnosis of disease to investigation in gene networks and functions of gene products.
In silico screening for candidate chassis strains of free fatty acid-producing cyanobacteria
Motwalli, Olaa
and Essack BMC Genomics, vol. 18(1), pp. 33 (2017)
Bioinformatics
Finding a source from which high-energy-density biofuels can be derived at an industrial scale has become an urgent challenge for renewable energy production. Some microorganisms can produce free fatty acids (FFA) as precursors towards such high-energy-density biofuels. In particular, photosynthetic cyanobacteria are capable of directly converting carbon dioxide into FFA. However, current engineered strains need several rounds of engineering to reach the level of production of FFA to be commercially viable; thus new chassis strains that require less engineering are needed. Although more than 120 cyanobacterial genomes are sequenced, the natural potential of these strains for FFA production and excretion has not been systematically estimated.
Semantic prioritization of novel causative genomic variants
Boudellioua, Imane AND Mahamad Razali PLOS Computational Biology, vol. 13(4), pp. e1005500 (2017)
Applied OntologyDisease genetics
Author summary We address the problem of how to distinguish which of the many thousands of DNA sequence variants carried by an individual with a rare disease is responsible for the disease phenotypes. This can help clinicians arrive at a diagnosis, but also can be instrumental in improving our understanding of the pathobiology of the disease. Many methods are currently available to help with the problem of determining causative variant, using information about evolutionary conservation and prediction of the functional consequences of the sequence variant. We have developed a novel algorithm (PVP) which augments existing strategies by using the similarity of the patients phenotype to known phenotype-genotype data in human and model organism databases to further rank potential candidate genes. In a retrospective study, we apply PVP to the interpretation of whole exome sequencing data in patients suffering from congenital hypothyroidism, and find that PVP accurately identifies causative variants in whole exome and whole genome sequencing datasets and provides a powerful resource for the discovery of causal variants.
Phenotype-driven discovery of digenic variants in personal genome sequences
Imane Boudellioua, Maxat Kulmanov, Paul N Schofield, Georgios V Gkoutos and Robert Hoehndorf Proceedings of VarI-SIG (2017)
BioinformaticsDisease genetics
Identification of variants associated with inherited diseases is a major challenge, in particular in the analysis of clinical sequence data from individual patients. An increasing number of Mendelian diseases have been identified in which two or more variants in multiple genes are required to cause the disease, or significantly modify its severity or phenotype. It is difficult to discover such interactions using existing approaches. Information that links patient phenotypes to databases of gene--phenotype associations observed in clinical and basic research can provide useful information and improve variant prioritization for Mendelian diseases. PhenomeNET is a computational framework that utilized pan-phenomic data from human and non-human model organisms to prioritize candidate genes in genetically based diseases, and we have recently combined PhenomeNET with genome-wide pathogenicity prediction methods into the PhenomeNET Variant Predictor (PVP) that can be used to prioritize variants in inherited diseases. Here, we illustrate extensions to PVP that can be used to identify variants in oligogenic diseases and their interactions. We inserted multiple variants known to be associated with digenic disease into synthetic genomes and find that PVP can identify sets of causative variants in a hypothesis-neutral manner. Our results show that PVP can efficiently detect oligogenic interactions using a phenotype-driven approach and identify etiologically important variants in whole genomes.
FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation
Bolleman, Jerven T.
and Mungall Journal of Biomedical Semantics, vol. 7(1), pp. 39 (2016)
Applied OntologyOntology engineering
Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples.
Prediction of Metabolic Pathway Involvement in Prokaryotic UniProtKB Data by Association Rule Mining
Boudellioua, Imane, Saidi, Rabie, Hoehndorf, Robert, Martin, Maria J. and Solovyev, Victor PLoS ONE, vol. 11(7), pp. e0158896 (2016)
BioinformaticsProtein function prediction
The widening gap between known proteins and their functions has encouraged the development of methods to automatically infer annotations. Automatic functional annotation of proteins is expected to meet the conflicting requirements of maximizing annotation coverage, while minimizing erroneous functional assignments. This trade-off imposes a great challenge in designing intelligent systems to tackle the problem of automatic protein annotation. In this work, we present a system that utilizes rule mining techniques to predict metabolic pathways in prokaryotes. The resulting knowledge represents predictive models that assign pathway involvement to UniProtKB entries. We carried out an evaluation study of our system performance using cross-validation technique. We found that it achieved very promising results in pathway identification with an F1-measure of 0.982 and an AUC of 0.987. Our prediction models were then successfully applied to 6.2 million UniProtKB/TrEMBL reference proteome entries of prokaryotes. As a result, 663,724 entries were covered, where 436,510 of them lacked any previous pathway annotations.
DermO; an ontology for the description of dermatologic disease
Fisher, Hannah M.
and Hoehndorf Journal of Biomedical Semantics, vol. 7(1), pp. 38 (2016)
Applied OntologyDisease genetics
There have been repeated initiatives to produce standard nosologies and terminologies for cutaneous disease, some dedicated to the domain and some part of bigger terminologies such as ICD-10. Recently, formally structured terminologies, ontologies, have been widely developed in many areas of biomedical research. Primarily, these address the aim of providing comprehensive working terminologies for domains of knowledge, but because of the knowledge contained in the relationships between terms they can also be used computationally for many purposes.
The flora phenotype ontology (FLOPO): tool for integrating morphological traits and phenotypes of vascular plants
Hoehndorf, Robert
and Alshahrani Journal of Biomedical Semantics, vol. 7(1), pp. 65 (2016)
Applied OntologyDisease genetics
The systematic analysis of a large number of comparable plant trait data can support investigations into phylogenetics and ecological adaptation, with broad applications in evolutionary biology, agriculture, conservation, and the functioning of ecosystems. Floras, i.e., books collecting the information on all known plant species found within a region, are a potentially rich source of such plant trait data. Floras describe plant traits with a focus on morphology and other traits relevant for species identification in addition to other characteristics of plant species, such as ecological affinities, distribution, economic value, health applications, traditional uses, and so on. However, a key limitation in systematically analyzing information in Floras is the lack of a standardized vocabulary for the described traits as well as the difficulties in extracting structured information from free text.
DESM: portal for microbial knowledge exploration systems
Salhi, Adil, Essack, Magbubah, Radovanovic, Aleksandar, Marchand, Benoit, Bougouffa, Salim, Antunes, Andre, Simoes, Marta Filipa, Lafi, Feras F., Motwalli, Olaa A., Bokhari, Ameerah, Malas, Tariq, Amoudi, Soha Al, Othum, Ghofran, Allam, Intikhab, Mineta, Katsuhiko, Gao, Xin, Hoehndorf, Robert, C. Archer, John A., Gojobori, Takashi and Bajic, Vladimir B. Nucleic Acids Research, vol. 44(D1), pp. D624-D633 (2016)
Applied Ontology
Microorganisms produce an enormous variety of chemical compounds. It is of general interest for microbiology and biotechnology researchers to have means to explore information about molecular and genetic basis of functioning of different microorganisms and their ability for bioproduction. To enable such exploration, we compiled 45 topic-specific knowledgebases (KBs) accessible through DESM portal (www.cbrc.kaust.edu.sa/desm). The KBs contain information derived through text-mining of PubMed information and complemented by information data-mined from various other resources (e.g. ChEBI, Entrez Gene, GO, KOBAS, KEGG, UniPathways, BioGrid). All PubMed records were indexed using 4 538 278 concepts from 29 dictionaries, with 1 638 986 records utilized in KBs. Concepts used are normalized whenever possible. Most of the KBs focus on a particular type of microbial activity, such as production of biocatalysts or nutraceuticals. Others are focused on specific categories of microorganisms, e.g. streptomyces or cyanobacteria. KBs are all structured in a uniform manner and have a standardized user interface. Information exploration is enabled through various searches. Users can explore statistically most significant concepts or pairs of concepts, generate hypotheses, create interactive networks of associated concepts and export results. We believe DESM will be a useful complement to the existing resources to benefit microbiology and biotechnology research.
Using AberOWL for fast and scalable reasoning over BioPortal ontologies
Slater, Luke
and Gkoutos Journal of Biomedical Semantics, vol. 7(1), pp. 49 (2016)
Applied OntologyOntology engineering
Reasoning over biomedical ontologies using their OWL semantics has traditionally been a challenging task due to the high theoretical complexity of OWL-based automated reasoning. As a consequence, ontology repositories, as well as most other tools utilizing ontologies, either provide access to ontologies without use of automated reasoning, or limit the number of ontologies for which automated reasoning-based access is provided.
Large-Scale Reasoning over Functions in Biomedical Ontologies
Robert Hoehndorf, Liam Mencel, Georgios V. Gkoutos and Paul N. Schofield Formal Ontology in Information Systems, vol. 283, pp. 299 - 312 (2016)
Applied Ontology
A large number of biomedical resources have been developed to represent the functions of biological entities, and these resources are widely used for data integration and analysis. Expressing functions in biomedical ontologies currently uses formal representation patterns that renders basic reasoning tasks to fall in complexity classes beyond polynomial time, thereby limiting the potential of using knowledge-based methods for data integration, querying or quality control. Here, we propose an alternative representation pattern for expressing knowledge about biological functions, together with a biological and ontological justification, which can be expressed using the description logic EL++ and implemented using the OWL 2 EL profile. To demonstrate the utility of our account of biological functions, we apply it to all proteins contained in the SwissProt database and evaluate its utility with respect to answering complex queries as well with respect to the classification and query times.
To MIREOT or not to MIREOT? A case study of the impact of using MIREOT in the Experimental Factor Ontology (EFO)
Luke Slater, Georgios V. Gkoutos, Paul N Schofield and Robert Hoehndorf International Conference on Biomedical Ontology and BioCreative (ICBO BioCreative 2016) (2016)
Applied OntologyOntology engineering
MIREOT is a mechanism for the selective re-use of individual ontology classes in other ontologies. Designed to minimise effort and to support orthogonality, it is now in widespread use. The consequences for ontology integrity and automated reasoning of using the MIREOT mechanism have so far not been fully assessed. In this paper, we perform an analysis of the Experimental Factor Ontology (EFO), an ontology which uses the MIREOT process to gather classes from a large range of other ontologies. Our study examines the effect of combining EFO with the ontologies it references by actually importing them into the EFO. We then evaluate the consistency and status of the combined ontologies. Through our investigation, we reveal that EFO in combination with all its referenced ontologies is logically inconsistent. Furthermore, when EFO is individually combined with many of the ontologies it references, we find a large number of unsatisfiable classes. These results demonstrate a potential problem within a major ontological ecosystem, and reveals possible disadvantages to the use of the MIREOT system for developing ontologies.
Evaluating the effect of annotation size on measures of semantic similarity
Maxat Kulmanov and Robert Hoehndorf Proceedings of Bio-Ontologies SIG (2016)
Applied OntologySemantic similarity
Ontologies are widely used as metadata in biological and biomedical datasets. Measures of semantic similarity utilize ontologies to determine how similar two entities annotated with classes from ontologies are, and semantic similarity is increasingly applied in applications ranging from diagnosis of disease to investigation in gene networks and functions of gene products. Here, we analyze a large number of semantic similarity measures and the sensitivity of similarity values to the number of annotations of entities as well as to the difference in annotation size. We find that most similarity measures are sensitive to the number of annotations of entities as well as to the difference in annotation size; well-studied and richly annotated entities will usually show higher similarity than entities with only few annotations even in the absence of any biological relation. Our findings have significant impact on the interpretation of results that rely on measures of semantic similarity.
Integrating phenotype ontologies with PhenomeNET
Miguel Rodriguez-Garcia, Georgios V. Gkoutos, Paul N. Schofield and Robert Hoehndorf Proceedings of Ontology Matching Workshop 2016 (2016)
Applied OntologyDisease genetics
PhenomeNET is a system for disease gene prioritization that includes as one of its components an ontology designed to integrate phenotype ontologies. While not applicable to matching arbitrary ontologies, PhenomeNET can be used to identify related phenotypes in different species, including human, mouse, zebrafish, nematode worm, fruit fly, and yeast. Here, we apply the PhenomeNET to identify related classes from four phenotype and disease ontologies using automated reasoning. We demonstrate that we can identify a large number of mappings, some of which require automated reasoning and cannot easily be identified through lexical approaches alone.
Experiences with Aber-OWL, an Ontology Repository with OWL EL Reasoning
Slater, Luke
and Rodriguez-Garcia Ontology Engineering: 12th International Experiences and Directions Workshop on OWL, OWLED 2015, co-located with ISWC 2015, Bethlehem, PA, USA, October 9-10, 2015, Revised Selected Papers, pp. 81-86, In: Tamma, Valentina
and Dragoni (Ed.) (2016)
Applied OntologyOntology engineering
Reasoning over biomedical ontologies using their OWL semantics has traditionally been a challenging task due to the high theoretical complexity of OWL-based automated reasoning. As a consequence, ontology repositories, as well as most other tools utilizing ontologies, either provide access to ontologies without use of automated reasoning, or limit the number of ontologies for which automated reasoning-based access is provided.
SPARQL2OWL:} Towards Bridging the Semantic Gap Between {RDF} and
{OWL
Mona Alshahrani and
Hussein Almashouq and
Robert Hoehndorf Proceedings of the Joint International Conference on Biological Ontology
and BioCreative, Corvallis, Oregon, United States, August 1-4, 2016. (2016)
Applied Ontology
Several large databases in biology are now making theirinformation available through the Resource Description Framework(RDF). RDF can be used for large datasets and provides agraph-based semantics. The Web Ontology Language (OWL),another Semantic Web standard, provides a more formal, model-theoretic semantics. While some approaches combine RDF andOWL, for example for querying, knowledge in RDF and OWLis often expressed differently. Here, we propose a method togenerate OWL ontologies from SPARQL queries usingn-aryrelational patterns. Combined with background knowledge fromontologies, the generated OWL ontologies can be used for expressivequeries and quality control of RDF data. We implement ourmethod in a a prototype tool available athttps://github.com/bio-ontology-research-group/SPARQL2OWL
Datamining with Ontologies
Hoehndorf, Robert
and Gkoutos Data Mining Techniques for the Life Sciences, pp. 385-397, In: Carugo, Oliviero
and Eisenhaber (Ed.) (2016)
Applied Ontology
The use of ontologies has increased rapidly over the past decade and they now provide a key component of most major databases in biology and biomedicine. Consequently, datamining over these databases benefits from considering the specific structure and content of ontologies, and several methods have been developed to use ontologies in datamining applications. Here, we discuss the principles of ontology structure, and datamining methods that rely on ontologies. The impact of these methods in the biological and biomedical sciences has been profound and is likely to increase as more datasets are becoming available using common, shared ontologies.
Aber-OWL: a framework for ontology-based data access in biology
Robert Hoehndorf, Luke Slater, Paul N Schofield and Georgios V Gkoutos BMC Bioinformatics, vol. 16, pp. 26 (2015)
Applied OntologyOntology engineering
Background: Many ontologies have been developed in biology and these ontologies increasingly contain large volumes of formalized knowledge commonly expressed in the Web Ontology Language (OWL). Computational access to the knowledge contained within these ontologies relies on the use of automated reasoning. Results: We have developed the Aber-OWL infrastructure that provides reasoning services for bio-ontologies. Aber-OWL consists of an ontology repository, a set of web services and web interfaces that enable ontology-based semantic access to biological data and literature. Aber-OWL is freely available at http://aber-owl.net. Conclusions: Aber-OWL provides a framework for automatically accessing information that is annotated with ontologies or contains terms used to label classes in ontologies. When using Aber-OWL, access to ontologies and data annotated with them is not merely based on class names or identifiers but rather on the knowledge the ontologies contain and the inferences that can be drawn from it.
GFVO: the Genomic Feature and Variation Ontology
Baran, Joachim, Durgahee, Bibi Sehnaaz Begum, Eilbeck, Karen, Antezana, Erick, Hoehndorf, Robert and Dumontier, Michel PeerJ, vol. 3, pp. e933 (2015)
Applied Ontology
Falling costs in genomic laboratory experiments have led to a steady increase of genomic feature and variation data. Multiple genomic data formats exist for sharing these data, and whilst they are similar, they are addressing slightly different data viewpoints and are consequently not fully compatible with each other. The fragmentation of data format specifications makes it hard to integrate and interpret data for further analysis with information from multiple data providers. As a solution, a new ontology is presented here for annotating and representing genomic feature and variation dataset contents. The Genomic Feature and Variation Ontology (GFVO) specifically addresses genomic data as it is regularly shared using the GFF3 (incl. FASTA), GTF, GVF and VCF file formats. GFVO simplifies data integration and enables linking of genomic annotations across datasets through common semantics of genomic types and relations.\textbf{Availability and implementation.} The latest stable release of the ontology is available via its base URI; previous and development versions are available at the ontology’s GitHub repository: https://github.com/BioInterchange/Ontologies; versions of the ontology are indexed through BioPortal (without external class-/property-equivalences due to BioPortal release 4.10 limitations); examples and reference documentation is provided on a separate web-page: http://www.biointerchange.org/ontologies.html. GFVO version 1.0.2 is licensed under the CC0 1.0 Universal license (https://creativecommons.org/publicdomain/zero/1.0) and therefore \textit{de facto} within the public domain; the ontology can be appropriated without attribution for commercial and non-commercial use.
Best behaviour? Ontologies and the formal description of animal behaviour
Gkoutos, Georgios V, Hoehndorf, Robert, Tsaprouni, Loukia and Schofield, Paul N Mammalian Genome, vol. 26(9--10), pp. 540-547 (2015)
Applied OntologyOntology engineering
The development of ontologies for describing animal behavior has proved to be one of the
most difficult of all scientific knowledge domains. Ranging from neurological processes to
human emotions the range and scope needed for such ontologies is highly challenging, but if
data integration and computational tools such as automated reasoning are to be fully applied
in this important area the underlying principles of these ontologies need to be better
established and development needs detailed coordination. Whilst the state of scientific
knowledge is always paramount in ontology and formal description framework design, this is
a particular problem with neurobehavioural ontologies where our understanding of the
relationship between behaviour and its underlying biophysical basis is currently in its infancy.
In this commentary we discuss some of the fundamental problems in designing and using
behavior ontologies, and present some of the best developed tools in this domain.
Analysis of mammalian gene function through broad-based phenotypic screens across a consortium of mouse clinics
Martin Hrabve de Angelis, George Nicholson, Mohammed Selloum, Jacqueline K White, Hugh Morgan, Ramiro Ramirez-Solis, Tania Sorg, Sara Wells, Helmut Fuchs, Martin Fray, David J Adams, Niels C Adams, Thure Adler, Antonio Aguilar-Pimentel, Dalila Ali-Hadji, Gregory Amann, Philippe Andr'e, Sarah Atkins, Aurelie Auburtin, Abdel Ayadi, Julien Becker, Lore Becker, Elodie Bedu, Raffi Bekeredjian, Marie-Christine Birling, Andrew Blake, Joanna Bottomley, Michael R Bowl, V'eronique Brault, Dirk H Busch, James N Bussell, Julia Calzada-Wack, Heather Cater, Marie-France Champy, Philippe Charles, Claire Chevalier, Francesco Chiani, Gemma F Codner, Roy Combe, Roger Cox, Emilie Dalloneau, Andr'e Dierich, Armida Di Fenza, Brendan Doe, Arnaud Duchon, Oliver Eickelberg, Chris T Esapa, Lahcen El Fertak, Tanja Feigel, Irina Emelyanova, Jeanne Estabel, Jack Favor, Ann Flenniken, Alessia Gambadoro, Lilian Garrett, Hilary Gates, Anna-Karin Gerdin, George Gkoutos, Simon Greenaway, Lisa Glasl, Patrice Goetz, Isabelle Goncalves Da Cruz, Alexander G"otz, Jochen Graw, Alain Guimond, Wolfgang Hans, Geoff Hicks, Sabine M H"olter, Heinz H"ofler, John M Hancock, Robert Hoehndorf, Tertius Hough, Richard Houghton, Anja Hurt, Boris Ivandic, Hughes Jacobs, Sylvie Jacquot, Nora Jones, Natasha A Karp, Hugo A Katus, Sharon Kitchen, Tanja Klein-Rodewald, Martin Klingenspor, Thomas Klopstock, Valerie Lalanne, Sophie Leblanc, Christoph Lengger, Elise le Marchand, Tonia Ludwig, Aline Lux, Colin McKerlie, Holger Maier, Jean-Louis Mandel, Susan Marschall, Manuel Mark, David G Melvin, Hamid Meziane, Kateryna Micklich, Christophe Mittelhauser, Laurent Monassier, David Moulaert, St'ephanie Muller, Beatrix Naton, Frauke Neff, Patrick M Nolan, Lauryl M J Nutter, Markus Ollert, Guillaume Pavlovic, Natalia S Pellegata, Emilie Peter, Benoit Petit-Demouli`ere, Amanda Pickard, Christine Podrini, Paul Potter, Laurent Pouilly, Oliver Puk, David Richardson, Stephane Rousseau, Leticia Quintanilla-Fend, Mohamed M Quwailid, Ildiko Racz, Birgit Rathkolb, Fabrice Riet, Janet Rossant, Michel Roux, Jan Rozman, Edward Ryder, Jennifer Salisbury, Luis Santos, Karl-Heinz Sch"able, Evelyn Schiller, Anja Schrewe, Holger Schulz, Ralf Steinkamp, Michelle Simon, Michelle Stewart, Claudia St"oger, Tobias St"oger, Minxuan Sun, David Sunter, Lydia Teboul, Isabelle Tilly, Glauco P Tocchini-Valentini, Monica Tost, Irina Treise, Laurent Vasseur, Emilie Velot, Daniela Vogt-Weisenhorn, Christelle Wagner, Alison Walling, Marie Wattenhofer-Donze, Bruno Weber, Olivia Wendling, Henrik Westerberg, Monja Willersh"auser, Eckhard Wolf, Anne Wolter, Joe Wood, Wolfgang Wurst, Ali "Onder Yildirim, Ramona Zeh, Andreas Zimmer, Annemarie Zimprich, Chris Holmes, Karen P Steel, Yann Herault, Val'erie Gailus-Durner, Ann-Marie Mallon and Steve D M Brown Nature Genetics, vol. 47, pp. 969-978 (2015)
Disease genetics
The function of the majority of genes in the mouse and human genomes remains unknown. The mouse embryonic stem cell knockout resource provides a basis for the characterization of relationships between genes and phenotypes. The EUMODIC consortium developed and validated robust methodologies for the broad-based phenotyping of knockouts through a pipeline comprising 20 disease-oriented platforms. We developed new statistical methods for pipeline design and data analysis aimed at detecting reproducible phenotypes with high power. We acquired phenotype data from 449 mutant alleles, representing 320 unique genes, of which half had no previous functional annotation. We captured data from over 27,000 mice, finding that 83\% of the mutant lines are phenodeviant, with 65\% demonstrating pleiotropy. Surprisingly, we found significant differences in phenotype annotation according to zygosity. New phenotypes were uncovered for many genes with previously unknown function, providing a powerful basis for hypothesis generation and further investigation in diverse systems.
Ranking Adverse Drug Reactions With Crowdsourcing
Gottlieb, Assaf
and Hoehndorf J Med Internet Res, vol. 17(3), pp. e80 (2015)
Drug mechanisms
Background: There is no publicly available resource that provides the relative severity of adverse drug reactions (ADRs). Such a resource would be useful for several applications, including assessment of the risks and benefits of drugs and improvement of patient-centered care. It could also be used to triage predictions of drug adverse events. Objective: The intent of the study was to rank ADRs according to severity. Methods: We used Internet-based crowdsourcing to rank ADRs according to severity. We assigned 126,512 pairwise comparisons of ADRs to 2589 Amazon Mechanical Turk workers and used these comparisons to rank order 2929 ADRs. Results: There is good correlation (rho=.53) between the mortality rates associated with ADRs and their rank. Our ranking highlights severe drug-ADR predictions, such as cardiovascular ADRs for raloxifene and celecoxib. It also triages genes associated with severe ADRs such as epidermal growth-factor receptor (EGFR), associated with glioblastoma multiforme, and SCN1A, associated with epilepsy. Conclusions: ADR ranking lays a first stepping stone in personalized drug risk assessment. Ranking of ADRs using crowdsourcing may have useful clinical and financial implications, and should be further investigated in the context of health care decision making.
Similarity-based search of model organism, disease and drug effect phenotypes
Hoehndorf, Robert, Gruenberger, Michael, Gkoutos, Georgios and Schofield, Paul Journal of Biomedical Semantics, vol. 6(1), pp. 6 (2015)
BACKGROUND:Semantic similarity measures over phenotype ontologies have been demonstrated to provide a powerful approach for the analysis of model organism phenotypes, the discovery of animal models of human disease, novel pathways, gene functions, druggable therapeutic targets, and determination of pathogenicity.RESULTS:We have developed PhenomeNET 2, a system that enables similarity-based searches over a large repository of phenotypes in real-time. It can be used to identify strains of model organisms that are phenotypically similar to human patients, diseases that are phenotypically similar to model organism phenotypes, or drug effect profiles that are similar to the phenotypes observed in a patient or model organism. PhenomeNET 2 is available at http://aber-owl.net/phenomenet webcite.CONCLUSIONS:Phenotype-similarity searches can provide a powerful tool for the discovery and investigation of molecular mechanisms underlying an observed phenotypic manifestation. PhenomeNET 2 facilitates user-defined similarity searches and allows researchers to analyze their data within a large repository of human, mouse and rat phenotypes.
The role of ontologies in biological and biomedical research: a functional perspective
Hoehndorf, Robert, Schofield, Paul N. and Gkoutos, Georgios V. Briefings in Bioinformatics, vol. 16(6), pp. 1069-1080 (2015)
Applied Ontology
Ontologies are widely used in biological and biomedical research. Their success lies in their combination of four main features present in almost all ontologies: provision of standard identifiers for classes and relations that represent the phenomena within a domain; provision of a vocabulary for a domain; provision of metadata that describes the intended meaning of the classes and relations in ontologies; and the provision of machine-readable axioms and definitions that enable computational access to some aspects of the meaning of classes and relations. While each of these features enables applications that facilitate data integration, data access and analysis, a great potential lies in the possibility of combining these four features to support integrative analysis and interpretation of multimodal data. Here, we provide a functional perspective on ontologies in biology and biomedicine, focusing on what ontologies can do and describing how they can be used in support of integrative research. We also outline perspectives for using ontologies in data-driven science, in particular their application in structured data mining and machine learning applications.
Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases
Robert Hoehndorf, Paul N Schofield and Georgios V Gkoutos Scientific Reports, vol. 5, pp. 10888 (2015)
Disease geneticsSemantic similarity
Phenotypes are the observable characteristics of an organism arising from its response to the environment. Phenotypes associated with engineered and natural genetic variation are widely recorded using phenotype ontologies in model organisms, as are signs and symptoms of human Mendelian diseases in databases such as OMIM and Orphanet. Exploiting these resources, several computational methods have been developed for integration and analysis of phenotype data to identify the genetic etiology of diseases or suggest plausible interventions. A similar resource would be highly useful not only for rare and Mendelian diseases, but also for common, complex and infectious diseases. We apply a semantic text-mining approach to identify the phenotypes (signs and symptoms) associated with over 6,000 diseases. We evaluate our text-mined phenotypes by demonstrating that they can correctly identify known disease-associated genes in mice and humans with high accuracy. Using a phenotypic similarity measure, we generate a human disease network in which diseases that have similar signs and symptoms cluster together, and we use this network to identify closely related diseases based on common etiological, anatomical as well as physiological underpinnings.
An ontology approach to comparative phenomics in plants
BACKGROUND:Plant phenotype datasets include many different types of data, formats, and terms from specialized vocabularies. Because these datasets were designed for different audiences, they frequently contain language and details tailored to investigators with different research objectives and backgrounds. Although phenotype comparisons across datasets have long been possible on a small scale, comprehensive queries and analyses that span a broad set of reference species, research disciplines, and knowledge domains continue to be severely limited by the absence of a common semantic framework.RESULTS:We developed a workflow to curate and standardize existing phenotype datasets for six plant species, encompassing both model species and crop plants with established genetic resources. Our effort focused on mutant phenotypes associated with genes of known sequence in Arabidopsis thaliana (L.) Heynh. (Arabidopsis), Zea mays L. subsp. mays (maize), Medicago truncatula Gaertn. (barrel medic or Medicago), Oryza sativa L. (rice), Glycine max (L.) Merr. (soybean), and Solanum lycopersicum L. (tomato). We applied the same ontologies, annotation standards, formats, and best practices across all six species, thereby ensuring that the shared dataset could be used for cross-species querying and semantic similarity analyses. Curated phenotypes were first converted into a common format using taxonomically broad ontologies such as the Plant Ontology, Gene Ontology, and Phenotype and Trait Ontology. We then compared ontology-based phenotypic descriptions with an existing classification system for plant phenotypes and evaluated our semantic similarity dataset for its ability to enhance predictions of gene families, protein functions, and shared metabolic pathways that underlie informative plant phenotypes.CONCLUSIONS:The use of ontologies, annotation standards, shared formats, and best practices for cross-taxon phenotype data analyses represents a novel approach to plant phenomics that enhances the utility of model genetic organisms and can be readily applied to species with fewer genetic resources and less well-characterized genomes. In addition, these tools should enhance future efforts to explore the relationships among phenotypic similarity, gene function, and sequence similarity in plants, and to make genotype-to-phenotype predictions relevant to plant biology, crop improvement, and potentially even human health.
Using Aber-OWL for fast and scalable reasoning over BioPortal ontologies
Luke Slater, Georgios Gkoutos, Paul N. Schofield and Robert Hoehndorf Proceedings of International Conference on Biomedical Ontologies (ICBO), pp. 72-76 (2015)
Applied OntologyOntology engineering
Reasoning over biomedical ontologies using their OWL semantics
has traditionally been a challenging task due to the high theoretical
complexity of OWL-based automated reasoning. As a consequence,
ontology repositories, as well as most other tools utilizing ontologies,
either provide access to ontologies without use of automated
reasoning, or limit the number of ontologies for which automated
reasoning-based access is provided. We apply the Aber-OWL
infrastructure to provide automated reasoning-based access to all
accessible and consistent ontologies in BioPortal (368 ontologies).
We perform an extensive performance evaluation to determine query
times, both for queries of different complexity as well as for queries
that are performed in parallel over the ontologies. We demonstrate
that, with the exception of a few ontologies, even complex and parallel
queries can now be answered in milliseconds, therefore allowing
automated reasoning to be used on a large scale, to run in parallel,
and with rapid response times.
AberOWL: an ontology portal with OWL EL reasoning
Luke Slater, Georgios Gkoutos, Paul N. Schofield and Robert Hoehndorf Proceedings of International Conference on Biomedical Ontologies (ICBO), pp. 127-128 (2015)
Applied OntologyOntology engineering
The field of biological and biomedical science quickly generate
large quantities of data and knowledge; often, domain knowledge
is formalised using ontologies expressed in the Web Ontology
Language (OWL). Ontology repositories such as Bioportal and
Ontobee have been an important infrastructural component for
managing ontologies, specifically to search, browse and download
ontologies over the Web. We present the AberOWL system, a novel
ontology repository that allows access to multiple ontologies through
automated reasoning, utilizing parts of the OWL of the ontologies
alongside a web interface and web services. AberOWL contains over
300 ontologies and integrates reasoning over ontologies with access
to literature and SPARQL endpoints.
The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery
Dumontier, Michel, Baker, Christopher, Baran, Joachim, Callahan, Alison, Chepelev, Leonid, Cruz-Toledo, Jose, Del Rio, Nicholas, Duck, Geraint, Furlong, Laura, Keath, Nichealla, Klassen, Dana, McCusker, James, Queralt-Rosinach, Nuria, Samwald, Matthias, Villanueva-Rosales, Natalia, Wilkinson, Mark and Hoehndorf, Robert Journal of Biomedical Semantics, vol. 5(1), pp. 14 (2014)
Applied Ontology
The Semanticscience Integrated Ontology (SIO) is an ontology to facilitate biomedical knowledge discovery. SIO features a simple upper level comprised of essential types and relations for the rich description of arbitrary (real, hypothesized, virtual, fictional) objects, processes and their attributes. SIO specifies simple design patterns to describe and associate qualities, capabilities, functions, quantities, and informational entities including textual, geometrical, and mathematical entities, and provides specific extensions in the domains of chemistry, biology, biochemistry, and bioinformatics. SIO provides an ontological foundation for the Bio2RDF linked data for the life sciences project and is used for semantic integration and discovery for SADI-based semantic web services. SIO is freely available to all users under a creative commons by attribution license. See website for further information: http://sio.semanticscience.org webcite.
BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains.
Katayama, Toshiaki
and Wilkinson Journal of biomedical semantics, vol. 5(1), pp. 5 (2014)
Applied Ontology
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
Thematic series on biomedical ontologies in JBMS: challenges and new directions
Hoehndorf, Robert, Haendel, Melissa, Stevens, Robert and Rebholz-Schuhmann, Dietrich Journal of Biomedical Semantics, vol. 5(1), pp. 15 (2014)
Applied Ontology
Over the past 15 years, the biomedical research community has increased its efforts to produce ontologies encoding biomedical knowledge, and to provide the corresponding infrastructure to maintain them. As ontologies are becoming a central part of biological and biomedical research, a communication channel to publish frequent updates and latest developments on them would be an advantage.Here, we introduce the JBMS thematic series on Biomedical Ontologies. The aim of the series is to disseminate the latest developments in research on biomedical ontologies and provide a venue for publishing newly developed ontologies, updates to existing ontologies as well as methodological advances, and selected contributions from conferences and workshops. We aim to give this thematic series a central role in the exploration of ongoing research in biomedical ontologies and intend to work closely together with the research community towards this aim. Researchers and working groups are encouraged to provide feedback on novel developments and special topics to be integrated into the existing publication cycles.
Analyzing gene expression data in mice with the Neuro Behavior Ontology
Hoehndorf, R., Hancock, J. M., Hardy, N. W., Mallon, A. M., Schofield, P. N. and Gkoutos, G. V. Mamm Genome, vol. 25(1-2), pp. 32-40 (2014)
Applied OntologyBioinformatics
We have applied the Neuro Behavior Ontology (NBO), an ontology for the annotation of behavioral gene functions and behavioral phenotypes, to the annotation of more than 1,000 genes in the mouse that are known to play a role in behavior. These annotations can be explored by researchers interested in genes involved in particular behaviors and used computationally to provide insights into the behavioral phenotypes resulting from differences in gene expression. We developed the OntoFUNC tool and have applied it to enrichment analyses over the NBO to provide high-level behavioral interpretations of gene expression datasets. The resulting increase in the number of gene annotations facilitates the identification of behavioral or neurologic processes by assisting the formulation of hypotheses about the relationships between gene, processes, and phenotypic manifestations resulting from behavioral observations.
Enriched biodiversity data as a resource and service
Rutger Vos, Jordan Biserkov, Bachir Balech, Niall Beard, Matthew Blissett, Christian Brenninkmeijer, Tom van Dooren, David Eades, George Gosline, Quentin Groom, Thomas Hamann, Hannes Hettling, Robert Hoehndorf, Ayco Holleman, Peter Hovenkamp, Patricia Kelbert, David King, Don Kirkup, Youri Lammers, Thibaut DeMeulemeester, Daniel Mietchen, Jeremy Miller, Ross Mounce, Nicola Nicolson, Rod Page, Aleksandra Pawlik, Serrano Pereira, Lyubomir Penev, Kevin Richards, Guido Sautter, David Shorthouse, Marko Tähtinen, Claus Weiland, Alan Williams and Soraya Sierra Biodiversity Data Journal, vol. 2, pp. e1125 (2014)
Other
Background: Recent years have seen a surge in projects that produce large volumes of structured, machine-readable biodiversity data. To make these data amenable to processing by generic, open source “data enrichment” workflows, they are increasingly being represented in a variety of standards-compliant interchange formats. Here, we report on an initiative in which software developers and taxonomists came together to address the challenges and highlight the opportunities in the enrichment of such biodiversity data by engaging in intensive, collaborative software development: The Biodiversity Data Enrichment Hackathon.
Results: The hackathon brought together 37 participants (including developers and taxonomists, i.e. scientific professionals that gather, identify, name and classify species) from 10 countries: Belgium, Bulgaria, Canada, Finland, Germany, Italy, the Netherlands, New Zealand, the UK, and the US. The participants brought expertise in processing structured data, text mining, development of ontologies, digital identification keys, geographic information systems, niche modeling, natural language processing, provenance annotation, semantic integration, taxonomic name resolution, web service interfaces, workflow tools and visualisation. Most use cases and exemplar data were provided by taxonomists.
One goal of the meeting was to facilitate re-use and enhancement of biodiversity knowledge by a broad range of stakeholders, such as taxonomists, systematists, ecologists, niche modelers, informaticians and ontologists. The suggested use cases resulted in nine breakout groups addressing three main themes: i) mobilising heritage biodiversity knowledge; ii) formalising and linking concepts; and iii) addressing interoperability between service platforms. Another goal was to further foster a community of experts in biodiversity informatics and to build human links between research projects and institutions, in response to recent calls to further such integration in this research domain.
Conclusions: Beyond deriving prototype solutions for each use case, areas of inadequacy were discussed and are being pursued further. It was striking how many possible applications for biodiversity data there were and how quickly solutions could be put together when the normal constraints to collaboration were broken down for a week. Conversely, mobilising biodiversity knowledge from their silos in heritage literature and natural history collections will continue to require formalisation of the concepts (and the links between them) that define the research domain, as well as increased interoperability between the software platforms that operate on these concepts.
An integrative, translational approach to understanding rare and orphan genetically based diseases
Hoehndorf, Robert, Schofield, Paul N. and Gkoutos, Georgios V. Interface Focus, vol. 3(2) (2013)
Disease geneticsSemantic interoperability
PhenomeNet is an approach for integrating phenotypes across species and identifying candidate genes for genetic diseases based on the similarity between a disease and animal model phenotypes. In contrast to 'guilt-by-association' approaches, PhenomeNet relies exclusively on the comparison of phenotypes to suggest candidate genes, and can, therefore, be applied to study the molecular basis of rare and orphan diseases for which the molecular basis is unknown. In addition to disease phenotypes from the Online Mendelian Inheritance in Man (OMIM) database, we have now integrated the clinical signs from Orphanet into PhenomeNet. We demonstrate that our approach can efficiently identify known candidate genes for genetic diseases in Orphanet and OMIM. Furthermore, we find evidence that mutations in the HIP1 gene might cause Bassoe syndrome, a rare disorder with unknown genetic aetiology. Our results demonstrate that integration and computational analysis of human disease and animal model phenotypes using PhenomeNet has the potential to reveal novel insights into the pathobiology underlying genetic diseases.
Representing physiological processes and their participants with PhysioMaps
Cook, Daniel, Neal, Maxwell, Hoehndorf, Robert, Gkoutos, Georgios and Gennari, John Journal of Biomedical Semantics, vol. 4(Suppl 1), pp. S2 (2013)
Applied Ontology
BACKGROUND:As the number and size of biological knowledge resources for physiology grows, researchers need improved tools for searching and integrating knowledge and physiological models. Unfortunately, current resources--databases, simulation models, and knowledge bases, for example--are only occasionally and idiosyncratically explicit about the semantics of the biological entities and processes that they describe.RESULTS:We present a formal approach, based on the semantics of biophysics as represented in the Ontology of Physics for Biology, that divides physiological knowledge into three partitions: structural knowledge, process knowledge and biophysical knowledge. We then computationally integrate these partitions across multiple structural and biophysical domains as computable ontologies by which such knowledge can be archived, reused, and displayed. Our key result is the semi-automatic parsing of biosimulation model code into PhysioMaps that can be displayed and interrogated for qualitative responses to hypothetical perturbations.CONCLUSIONS:Strong, explicit semantics of biophysics can provide a formal, computational basis for integrating physiological knowledge in a manner that supports visualization of the physiological content of biosimulation models across spatial scales and biophysical domains.
Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources
Rebholz-Schuhmann, Dietrich, Kafkas, Senay, Kim, Jee-Hyub, Li, Chen, Jimeno Yepes, Antonio, Hoehndorf, Robert, Backofen, Rolf and Lewin, Ian Journal of Biomedical Semantics, vol. 4(1), pp. 28 (2013)
Ontology engineering
Motivation The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features. Ideally all three resources, i.e.~corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them. Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other. We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance. In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs.RESULTS:In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measure performance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon based approaches (LexTag) in combination with disambiguation methods show better results on FsuPrge and PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions have different precision and recall profiles at the same F1-measure across all corpora. Higher recall is achieved with larger lexical resources, which also introduce more noise (false positive results). The ML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the training corpus. As expected, the false negative errors characterize the test corpora and - on the other hand - the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions that are based on a large terminological resource in combination with false positive filtering produce better results, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tag solutions.CONCLUSION:The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus should be trained using several different corpora to reduce possible biases. The LexTag solutions have different profiles for their precision and recall performance, but with similar F1-measure. This result is surprising and suggests that they cover a portion of the most common naming standards, but cope differently with the term variability across the corpora. The false positive filtering applied to LexTag solutions does improve the results by increasing their precision without compromising significantly their recall. The harmonisation of the annotation schemes in combination with standardized lexical resources in the tagging solutions will enable their comparability and will pave the way for a shared standard.
Evaluation and Cross-Comparison of Lexical Entities of Biological Interest (LexEBI)
Dietrich Rebholz-Schuhmann, Jee-Hyub Kim, Ying Yan, Abhishek Dixit, Caroline Friteyre, Robert Hoehndorf, Rolf Backofen and Ian Lewin PLoS ONE, vol. 8(10), pp. e75185 (2013)
Applied Ontology
Motivation: Biomedical entities, their identifiers and names, are essential in the representation of biomedical facts and knowledge. In the same way, the complete set of biomedical and chemical terms, i.e. the biomedical "term space" (the "Lexeome"), forms a key resource to achieve the full integration of the scientific literature with biomedical data resources: any identified named entity can immediately be normalized to the correct database entry. This goal does not only require that we are aware of all existing terms, but would also profit from knowing all their senses and their semantic interpretation (ambiguities, nestedness). Result: This study compiles a resource for lexical terms of biomedical interest in a standard format (called "LexEBI"), determines the overall number of terms, their reuse in different resources and the nestedness of terms. LexEBI comprises references for protein and gene entries and their term variants and chemical entities amongst other terms. In addition, disease terms have been identified from Medline and PubmedCentral and added to LexEBI. Our analysis demonstrates that the baseforms of terms from the different semantic types show only little polysemous use. Nonetheless, the term variants of protein and gene names (PGNs) frequently contain species mentions, which should have been avoided according to protein annotation guidelines. Furthermore, the protein and gene entities as well as the chemical entities, both do comprise enzymes leading to hierarchical polysemy, and a large portion of PGNs make reference to a chemical entity. Altogether, according to our analysis based on the Medline distribution, 401,869 unique PGNs in the documents contain a reference to 25,022 chemical entities, 3,125 disease terms or 1,576 species mentions. Conclusion: LexEBI delivers the complete biomedical and chemical Lexeome in a standardized representation (http://www.ebi.ac.uk/Rebholz-srv/LexEBI/). The resource provides the disease terms as open source content, and fully interlinks terms across resources.
Semantic Systems Biology: Formal Knowledge Representation in Systems Biology for Model Construction, Retrieval, Validation and Discovery
Dumontier, Michel, Chepelev, Leonid L and Hoehndorf, Robert Systems Biology, pp. 355-373 (2013)
Applied Ontology
With the publication of the human genome, scientists worldwide opened champagne and let out a collective cheer for progress in biology. After all, the untold number of interactions of tens of thousands of genes, a greater number of their products and product derivatives, and tens of thousands of chemicals came much closer to complete characterization. Paradoxically however, while individual efforts produced important biological results, an integrated view of biology from systems perspective seemed ever more distant due to the complexity of data integration from multiple knowledge representation forms, formalisms, modeling paradigms, and conflicting scientific statements. To address this, semantic technologies have risen over the past decade with the promise of truly unifying biological knowledge and allowing cross-domain queries and model integration. In this chapter, we shall examine Semantic Web technologies and their applications to build, publish, query, discover, compare, validate, reason about, and evaluate models and knowledge in Systems Biology. We shall specifically address biological ontologies, open data repositories, modeling and annotation tools, and selected promising applications of Semantic Systems Biology. We firmly believe that it shall soon be possible to completely close the gap between facts, models, and results, and to fully apply the accrued models and facts to evaluate biological hypotheses on a system level, discovering meaning within the vast collection of biological knowledge and taking Systems Biology research to a new, unprecedented level.
Mouse model phenotypes provide information about human drug targets
Hoehndorf, Robert, Hiebert, Tanya, Hardy, Nigel W., Schofield, Paul N., Gkoutos, Georgios V. and Dumontier, Michel Bioinformatics (2013)
Disease geneticsDrug mechanisms
Motivation: Methods for computational drug target identification utilize information from diverse information sources to predict or prioritize drug targets for known drugs. One set of resources that has been relatively neglected for drug repurposing are animal model phenotypes.Results: We investigate the use of mouse model phenotypes for drug target identification. To achieve this goal, we first integrate mouse model phenotypes and drug effects, and then systematically compare the phenotypic similarity between mouse models and drug effect profiles. We find a high similarity between phenotypes resulting from loss-of-function mutations and drug effects resulting from the inhibition of a protein through a drug action, and demonstrate how this approach can be used to suggest candidate drug targets.Availability and implementation: Analysis code and supplementary data files are available on the project website at https://drugeffects.googlecode.com.Contact: roh25@aber.ac.uk
Semantic integration of physiology phenotypes with an application to the Cellular Phenotype Ontology
Hoehndorf, Robert, Harris, Midori A., Herre, Heinrich, Rustici, Gabriella and Gkoutos, Georgios V. Bioinformatics, vol. 28(13), pp. 1783-1789 (2012)
Motivation: The systematic observation of phenotypes has become a crucial tool of functional genomics, and several large international projects are currently underway to identify and characterize the phenotypes that are associated with genotypes in several species. To integrate phenotype descriptions within and across species, phenotype ontologies have been developed. Applying ontologies to unify phenotype descriptions in the domain of physiology has been a particular challenge due to the high complexity of the underlying domain.Results: In this study, we present the outline of a theory and its implementation for an ontology of physiology-related phenotypes. We provide a formal description of process attributes and relate them to the attributes of their temporal parts and participants. We apply our theory to create the Cellular Phenotype Ontology (CPO). The CPO is an ontology of morphological and physiological phenotypic characteristics of cells, cell components and cellular processes. Its prime application is to provide terms and uniform definition patterns for the annotation of cellular phenotypes. The CPO can be used for the annotation of observed abnormalities in domains, such as systems microscopy, in which cellular abnormalities are observed and for which no phenotype ontology has been created.Availability and implementation: The CPO and the source code we generated to create the CPO are freely available on http://cell-phenotype.googlecode.com.Contact: rh497@cam.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online.
Identifying aberrant pathways through integrated
analysis of knowledge in pharmacogenomics
Hoehndorf, Robert, Dumontier, Michel and Gkoutos, Georgios V. Bioinformatics, vol. 28(16), pp. 2169-2175 (2012)
Applied OntologyBioinformaticsDrug mechanisms
Motivation: Many complex diseases are the result of
abnormal pathway functions instead of single
abnormalities. Disease diagnosis and intervention
strategies must target these pathways while
minimizing the interference with normal
physiological processes. Large scale identification
of disease pathways and chemicals that may be used
to perturb them requires the integration of
information about drugs, genes, diseases and
pathways. This information is currently distributed
over several pharmacogenomics databases. An
integrated analysis of the information in these
databases can reveal disease pathways and facilitate
novel biomedical analyses. Results: We demonstrate
how to integrate pharmacogenomics databases through
integration of the biomedical ontologies that are
used as meta-data in these databases. The additional
background knowledge in these ontologies can then be
used to enable novel analyses. We identify disease
pathways using a novel multi-ontology enrichment
analysis over the Human Disease Ontology, and we
identify significant associations between chemicals
and pathways using an enrichment analysis over a
chemical ontology. The drug-pathway and
disease-pathway associations are a valuable resource
for research in disease and drug mechanisms and can
be used to improve computational drug
repurposing. Contact: rh497@cam.ac.uk
Quantitative comparison of mapping methods between Human and Mammalian Phenotype Ontology
Oellrich, Anika, Gkoutos, Georgios V., Hoehndorf, Robert and Rebholz-Schuhmann, Dietrich Journal of Biomedical Semantics, vol. 3(Suppl 2), pp. S1 (2012)
Researchers use animal studies to better understand human diseases. In recent years, large-scale phenotype studies such as Phenoscape and EuroPhenome have been initiated to identify genetic causes of a species' phenome. Species-specific phenotype ontologies are required to capture and report about all findings and to automatically infer results relevant to human diseases. The integration of the different phenotype ontologies into a coherent framework is necessary to achieve interoperability for cross-species research. Here, we investigate the quality and completeness of two different methods to align the Human Phenotype Ontology and the Mammalian Phenotype Ontology. The first method combines lexical matching with inference over the ontologies' taxonomic structures, while the second method uses a mapping algorithm based on the formal definitions of the ontologies. Neither method could map all concepts. Despite the formal definitions method provides mappings for more concepts than does the lexical matching method, it does not outperform the lexical matching in a biological use case. Our results suggest that combining both approaches will yield a better mappings in terms of completeness, specificity and application purposes.
Towards improving phenotype representation in OWL
Loebe, Frank, Stumpf, Frank, Hoehndorf, Robert and Herre, Heinrich Journal of Biomedical Semantics, vol. 3(Suppl 2), pp. S5 (2012)
Applied OntologyDisease genetics
BACKGROUND:Phenotype ontologies are used in species-specific databases for the annotation of mutagenesis experiments and to characterize human diseases. The Entity-Quality (EQ) formalism is a means to describe complex phenotypes based on one or more affected entities and a quality. EQ-based definitions have been developed for many phenotype ontologies, including the Human and Mammalian Phenotype ontologies.METHODS:We analyze formalizations of complex phenotype descriptions in the Web Ontology Language (OWL) that are based on the EQ model, identify several representational challenges and analyze potential solutions to address these challenges.RESULTS:In particular, we suggest a novel, role-based approach to represent relational qualities such as concentration of iron in spleen, discuss its ontological foundation in the General Formal Ontology (GFO) and evaluate its representation in OWL and the benefits it can bring to the representation of phenotype annotations.CONCLUSION:Our analysis of OWL-based representations of phenotypes can contribute to improving consistency and expressiveness of formal phenotype descriptions.
Ontology-based cross-species integration and analysis of Saccharomyces cerevisiae phenotypes
Gkoutos, Georgios V. and Hoehndorf, Robert Journal of Biomedical Semantics, vol. 3(Suppl 2), pp. S6 (2012)
Ontologies are widely used in the biomedical community for annotation and integration of databases. Formal definitions can relate classes from different ontologies and thereby integrate data across different levels of granularity, domains and species. We have applied this methodology to the Ascomycete Phenotype Ontology (APO), enabling the reuse of various orthogonal ontologies and we have converted the phenotype associated data found in the SGD following our proposed patterns. We have integrated the resulting data in the cross-species phenotype network PhenomeNET, and we make both the cross-species integration of yeast phenotypes and a similarity-based comparison of yeast phenotypes across species available in the PhenomeBrowser. Furthermore, we utilize our definitions and the yeast phenotype annotations to suggest novel functional annotations of gene products in yeast.
Evaluation of research in biomedical ontologies
Hoehndorf, Robert, Dumontier, Michel and Gkoutos, Georgios V. Briefings in Bioinformatics (2012)
Applied Ontology
Ontologies are now pervasive in biomedicine, where they serve as a means to standardize terminology, to enable access to domain knowledge, to verify data consistency and to facilitate integrative analyses over heterogeneous biomedical data. For this purpose, research on biomedical ontologies applies theories and methods from diverse disciplines such as information management, knowledge representation, cognitive science, linguistics and philosophy. Depending on the desired applications in which ontologies are being applied, the evaluation of research in biomedical ontologies must follow different strategies. Here, we provide a classification of research problems in which ontologies are being applied, focusing on the use of ontologies in basic and translational research, and we demonstrate how research results in biomedical ontologies can be evaluated. The evaluation strategies depend on the desired application and measure the success of using an ontology for a particular biomedical problem. For many applications, the success can be quantified, thereby facilitating the objective evaluation and comparison of research in biomedical ontology. The objective, quantifiable comparison of research results based on scientific applications opens up the possibility for systematically improving the utility of ontologies in biomedical research.
The Units Ontology: a tool for integrating units of measurement in science
Gkoutos, Georgios V., Schofield, Paul N. and Hoehndorf, Robert Database, vol. 2012 (2012)
Applied Ontology
Units are basic scientific tools that render meaning to numerical data. Their standardization and formalization caters for the report, exchange, process, reproducibility and integration of quantitative measurements. Ontologies are means that facilitate the integration of data and knowledge allowing interoperability and semantic information processing between diverse biomedical resources and domains. Here, we present the Units Ontology (UO), an ontology currently being used in many scientific resources for the standardized description of units of measurements.
Text-mining solutions for biomedical research: enabling integrative biology
Dietrich Rebholz-Schuhmann, Anika Oellrich and Robert Hoehndorf Nature Reviews Genetics, vol. 13(12), pp. 829-839 (2012)
Bioinformatics
In response to the unbridled growth of information in literature and biomedical databases, researchers require efficient means of handling and extracting information. As well as providing background information for research, scientific publications can be processed to transform textual information into database content or complex networks and can be integrated with existing knowledge resources to suggest novel hypotheses. Information extraction and text data analysis can be particularly relevant and helpful in genetics and biomedical research, in which up-to-date information about complex processes involving genes, proteins and phenotypes is crucial. Here we explore the latest advancements in automated literature analysis and its contribution to innovative research approaches.
Chapter Four - The Neurobehavior Ontology: An Ontology for Annotation and Integration of Behavior and Behavioral Phenotypes
Georgios V. Gkoutos, Paul N. Schofield and Robert Hoehndorf Bioinformatics of Behavior: Part 1, vol. 103, pp. 69 - 87, In: Elissa J. Chesler and Melissa A. Haendel (Eds.) (2012)
Abstract
In recent years, considerable advances have been made toward our understanding of the genetic architecture of behavior and the physical, mental, and environmental influences that underpin behavioral processes. The provision of a method for recording behavior-related phenomena is necessary to enable integrative and comparative analyses of data and knowledge about behavior. The neurobehavior ontology facilitates the systematic representation of behavior and behavioral phenotypes, thereby improving the unification and integration behavioral data in neuroscience research.
Argumentation to Represent and Reason over Biological Systems
Adam Wyner, Luke Riley, Robert Hoehndorf and Samuel Croset Proceedings of the 3rd International Conference on Information Technology in Bio- and Medical Informatics (ITBAM 2012) (2012)
Applied Ontology
A translational medicine approach to orphan diseases
Robert Hoehndorf and Georgios V. Gkoutos Proceedings of the Virtual Physiological Human Conference 2012 (VPH2012) (2012)
Disease geneticsSemantic interoperability
Integration of knowledge for personalized medicine: a pharmacogenomics case-study
Robert Hoehndorf, Michel Dumontier and Georgios V. Gkoutos Proceedings of the Virtual Physiological Human Conference 2012 (VPH2012) (2012)
Higgs bosons, mars missions, and unicorn delusions: How to deal with terms of dubious reference in scientific ontologies
Stefan Schulz, Mathias Brochhausen and Robert Hoehndorf Proceedings of the Second International Conference on Biomedical Ontology (2011)
Applied Ontology
Realist ontologies claim to represent what exists. Scientific discourse, however, often contains non-referring terms when describing hypotheses, plans, or ideas. We present a framework in which a realist ontology is embedded in an description logics theory, which is indifferent regarding the existence of class members, and which may include representational units for representing various kinds of non-referring terms. Using a taxonomy of terminological units we are able to distinguish between different kinds of classes in the description logics theory and to identify classes as unsatisfiable, which are put as the extensions of non-referring terms. We also demonstrate how discourse using non-referring terms can be represented without departing from the principle of realist ontologies. An example OWL file can be downloaded from: http://purl.org/steschu/misc/ICBO2011.
Investigation of the fundamental strategy for interoperability of description of biological measurements
Hiroshi Masuya, Georgios V. Gkoutos, Nobuhiko Tanaka, Kazunori Waki, Yoshihiro Okuda, Tatsuya Kushida, Norio Kobayashi, Koji Doi, Kouji Kozaki, Robert Hoehndorf, Shigeharu Wakana, Tetsuro Toyoda and Riichiro Mizoguchi Proceedings of the Second International Conference on Biomedical Ontology (2011)
Applied OntologySemantic interoperability
Aiming the facilitation of the advanced integration of measurement data across various biological experiments, we have investigated the fundamental methodology to expand the Phenotypic Quality Ontology (PATO) commonly used for descriptions of biological phenotypes with the framework of the Yet Another More Advanced Top-level Ontology (YAMATO). The mapping of ontology terms of PATO to YAMATO’s framework represents several advanced aspects such as the introduction of the classification of quality values to represent scales of measurements, the distinction of the different contexts of the comparison of ordinal values, and the establishment of the interoperability of quality description formalisms between different top-level ontologies. In this study, we propose a logical base to integrate cross-species and cross-experimental annotations of biological measurements.
Exploring Gene Ontology Annotations with OWL
Simon Jupp, Robert Stevens and Robert Hoehndorf Proceedings of the 13th Bio-Ontology Meeting (2011)
Applied Ontology
Ontologies such as the Gene Ontology (GO) and their use in annotations make cross species comparisons of genes possible, along with a wide range of other activities. Tools, such as AmiGO, allow exploration of genes based on their GO annotations. This human driven exploration and querying of GO is obviously useful, but by taking advantage of the ontological representation we can use these annotations to create a rich polyhierarchy of gene products for enhanced querying. This also opens up possibilities for exploring GO annotations (GOA) for redundancies and defects in annotations.
To do this we have created a set of OWL classes for mouse genes and their GOA. Each gene is represented as a class, with appropriate relationships to the GO aspects with which it has been annotated. We then use defined classes to query these gene product classes and to build a complex hierarchy. This standard use of OWL affords a rich interaction with GO annotations to give a fine partitioning of the gene products in the ontology.
Ontology-based cross-species integration and analysis of Saccharomyces cerevisiae phenotypes
Georgios V. Gkoutos and Robert Hoehndorf Proceedings of the 3rd Workshop for Ontologies in Biomedicine and Life sciences (OBML) (2011)
Ontologies are widely used in the biomedical community for
annotation and integration of databases. Formal definitions can relate
classes from different ontologies and thereby integrate data across
different levels of granularity, domains and species. We have applied
this methodology to the Ascomycete Phenotype Ontology (APO),
enabling the reuse of various orthogonal ontologies and we have
converted the phenotype associated data found in the SGD following
our proposed patterns. We have integrated the resulting data to
a cross-species phenotype network termed PhenomeNET and we
make both the cross-species integration of yeast phenotypes and
a similarity-based comparison of yeast phenotypes across species
available in the PhenomeBrowser.
Quantitative comparison of mapping methods between Human and Mammalian Phenotype Ontology
Anika Oellrich, Robert Hoehndorf, Georgios V. Gkoutos and Dietrich Rebholz-Schuhmann Proceedings of the 3rd Workshop for Ontologies in Biomedicine and Life sciences (OBML) (2011)
Researchers use animal studies to better understand human diseases.
In recent years, large-scale phenotype studies such as Phenoscape and
EuroPhenome have been initiated to identify genetic causes of a species
phenome. Species-specific phenotype ontologies are required to capture
and report about all findings and to automatically infer results relevant to
human diseases. The integration of the different phenotype ontologies into a
coherent framework is necessary to achieve interoperability for cross-species
research.
Here, we investigate the quality and completeness of two different methods
to align the Human Phenotype Ontology and the Mammalian Phenotype
Ontology. The first method combines lexical matching with inference over the
ontologies taxonomic structures, while the second method uses a mapping
algorithm based on the formal definitions from the ontologies. Neither method
could map all concepts. Despite the formal definitions method provides
mappings for more concepts than does the lexical matching method, it does
not outperform the lexical matching in a biological use case. Our results
suggest that combining both approaches will yield to better mappings in terms
of completeness, specificity and application purposes.
Towards Improving Phenotype Representation in OWL
Frank Loebe, Frank Stumpf, Robert Hoehndorf and Heinrich Herre Proceedings of the 3rd Workshop for Ontologies in Biomedicine and Life sciences (OBML) (2011)
Applied OntologyDisease genetics
Phenotype ontologies are used in species-specific databases for
the annotation of mutagenesis experiments and to characterize hu-
man diseases. The Entity-Quality (EQ) formalism is a means to
describe complex phenotypes based on one or more affected en-
tities and a quality. EQ-based definitions have been developed for
many phenotype ontologies, including the Human and Mammalian
Phenotype ontologies. We analyze the OWL-based formalizations of
complex phenotype descriptions based on the EQ model, identify
several representational challenges and analyze potential solutions
to address these challenges. In particular, we suggest a novel,
role-based approach to represent relational qualities such as Con-
centration of calcium in blood, discuss its ontological foundation in
the General Formal Ontology (GFO) and evaluate its representation in
OWL and the benefits it can bring to the representation of phenotype
annotations. Our analysis of OWL-based representation of pheno-
types can contribute to improving consistency and expressiveness of
formal phenotype descriptions.
Applying ontology design patterns to the implementation of relations in GENIA
Hoehndorf, Robert, Ngonga Ngomo, Axel-Cyrille, Pyysalo, Sampo, Ohta, Tomoko, Oellrich, Anika and Rebholz-Schuhmann, Dietrich Proceedings of the Fourth Symposium on Semantic Mining in Biomedicine (SMBM 2010) (2010)
Applied OntologyOntology engineering
Motivation: Annotated reference corpora such as the GENIA corpus play an important role in biomedical infor-mation extraction. A semantic annotation of the natural language texts in these reference corpora using formalontologies and logic is challenging due to the ambiguous use of natural language and natural language semantics.Providing formal definitions and axioms for these relations would offer the means for developing consistent andverifiable annotation guidelines and allow for the automatic verification of annotations as well as enabling thediscovery of new information through deductive inferences.Results: We developed a formal ontology of relations based on the relations used in the recent GENIA corpusannotations. For this purpose, we selected existing axiom systems based on the desired properties of the relationswithin the domain and provided new axioms for several relations. To apply this ontology of relations to thesemantic annotation of natural language texts, we developed and implemented two ontology design patterns. Weprovide an implementation of the ontology of relations in the Web Ontology Language (OWL). By combining theimplementation of the design patterns and that of the relation ontology, we also provide a software applicationto convert annotated GENIA abstracts into OWL ontologies. In this way, we make these ontologies amenable forautomated verification, deductive inferences and other knowledge-based applications.Availability: Documentation, implementation and examples are available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/.Contact: rh497@cam.ac.uk
The Ontology of Primary Immunodeficiency Diseases (PIDs): Using PIDs to Rethink the Ontology of Phenotypes
Adams, Nico, Hennig, Christian, Hoehndorf, Robert, Oellrich, Anika, Rebholz-Schuhmann, Dietrich and Hansen, Gesine Proceedings of the 2nd Workshop for Ontologies in Biomedicine and Life sciences (OBML) (2010)
Applied OntologyDisease genetics
Primary immunodeficiency diseases (PIDs) are the consequence of
genetic disorders and usually manifest themselves in very young patients.
Because of their rarity, they are notoriously difficult to diagnose both for
general practitioners and clinicians. In this paper, we present the foundations
of an ontology of PIDs, which will be at the heart of an expert system
designed to assist the clinician in the diagnosis of these diseases. To achieve
this, the PIDOntology characterises Primary Immunodefieciencies in terms
of Phenotypes. While there are a number of different ontologies already
available that allow the description of phenotypes and phenotypic qualities,
these have a number of associated ontological problems, which we will also
address as part of this paper. We use the subtype of Hyper-IgE Syndrome
caused by a STAT3 defects as an example of a primary immunodeficiency
and show how the clinical phenotype of the disease can be modeled in terms
of other phenotypes by introducing the notion of the "phene". Furthermore, we
develop patterns for different types of phenes and show, that these patterns
can be mapped onto more traditional entity-quality statements, which are the
current state of the art in phenotypic modeling.
Relational patterns in OWL and their application to OBO
Hoehndorf, Robert, Oellrich, Anika, Dumontier, Michel, Kelso, Janet, Herre, Heinrich and Rebholz-Schuhmann, Dietrich OWL: Experiences and Directions (OWLED) (2010)
Applied Ontology
Directed acyclic graphs are commonly used to represent on-tologies in the biomedical domain. They provide an intuitive means toformalize relations that hold between ontological categories. However,their semantics is usually not explicit. We provide a semantics for a partof the OBO Flatfile Format by extending OWL with a method to expressrelational patterns. These patterns are OWL axioms with variables forclasses. The variables can only be filled with named classes. Addition-ally, we provide a semantics for open patterns in OWL. Our method isapplicable to the OBO Flatfile Format, and provides a means to designOWL ontologies using complex ontology design patterns. Therefore, itleads not only to an integration of the OBO Flatfile Format and OWL,but extends OWL with an intuitive interface for designing ontologies us-ing complex definition patterns. A prototypic implementation and testresults are available at http://bioonto.de/obo2owl.
OWLDEF: Integrating OBO and OWL
Hoehndorf, Robert, Oellrich, Anika, Dumontier, Michel, Janet Kelso, Herre, Heinrich and Rebholz-Schuhmann, Dietrich Proceedings of the 13th Annual Bio-Ontologies Meeting (2010)
Applied Ontology
An integration of the OBO Flatfile Format and the Web Ontology Language
OBOF with OWL while maintaining the semantics for relations
provided by the RO.
(OWL) would enable automated reasoning, inferences and consistency
checking of biomedical ontologies and support the development and
maintenance of ontologies developed in the OBO Flatfile Format. So far, the
translation of relations in the OBO language to OWL is performed according
to a single rigid pattern and in violation of the relation definitions of the
OBO Relationship Ontology. We extend both the OBO Flatfile Format and
the Manchester OWL Syntax to accommodate relation definitions. Based on
these extensions, we implemented and evaluated two software applications.
The first converts the OBO Flatfile Format to an OWL representation. The
second uses automated inferences to convert OWL ontologies back to a
representation in the OBO Flatfile Format. The OWLDEF method is generally
applicaple whenever ontologies are developed primarily using patterns and
not a detailled knowledge representation language. The tools and libraries
we developed for the OWLDEF method are available from http://bioonto.
de/obo2owl.
Realism for scientific ontologies
Dumontier, Michel and Hoehndorf, Robert Formal Ontology in Information Systems, Proceedings of the
Sixth International Conference, FOIS 2010, vol. 209, pp. 387-399, In: Antony Galton and Riichiro Mizoguchi (Eds.) (2010)
Applied Ontology
Science aims to develop an accurate understanding of reality through a variety of rigorously empirical and formal methods. Ontologies are used to formalize the meaning of terms within a domain of discourse. The Basic Formal Ontology (BFO) is an ontology of particular importance in the biomedical domains, where it provides the top-level for numerous ontologies, including those admitted as part of the OBO Foundry collection. The BFO requires that all classes in an ontology are actually instantiated in reality. Despite the fact that it is hard to show whether entities of some kind exist or do not exist in reality (especially for unobservable entities like elementary particles), this criterion fails to satisfy the need of scientists to communicate their findings and theories unambiguously. We discuss the problems that arise due to the BFO's realism criterion and suggest viable alternatives.
Ontologies in Biology
Kelso, Janet, Hoehndorf, Robert and Pr"ufer, Kay Theory and Applications of Ontology: Computer Applications, pp. 347-371, In: Poli, Roberto, Healy, Michael and Kameas, Achilles (Eds.) (2010)
Applied Ontology
In recent years ontologies have come to play an increasingly important role in the biomedical domain. Primary applications have been the formalisation of community knowledge in molecular biology, and the provision of a shared vocabulary for the annotation of the growing amount of biological data being generated. Ontologies now play a key role in the analysis and reporting of biological data and act as the basis for new biological services being hosted by various GRID projects. More formal methods from ontology theory are gradually being adopted, and have made the existing ontologies more robust. These approaches will continue to extend the number of potential applications for ontologies in the biomedical domain.
The application of an ontology design pattern for functional abnormalities to phenotype ontologies and the extraction of an ontology of anatomical functions
Hoehndorf, Robert, Ngonga Ngomo, Axel-Cyrille and Kelso, Janet Proceedings of the The 3rd International Symposium on Languages in Biology and Medicine (2009)
Functions play an important role throughoutbiology. Although molecular functions are cov-ered in the Gene Ontology, there is currently nopublicly available ontology of anatomical func-tions. Ontological considerations on the natureof functional abnormalities and their represen-tation in current phenotype ontologies showthat we can automatically extract a skeletonfor such an ontology of anatomical functionsby using a combination of process, phenotypeand anatomy ontologies. We provide an onto-logical analysis of the nature of functions andfunctional abnormalities. From this analysis,we derive an approach to the automatic ex-traction of anatomical functions from existingontologies using a combination of natural lan-guage processing, graph-based analysis of theontologies and formal inferences. Alternatively,we introduce a new relation to relate materialobjects to processes that realize the functionof the object to avoid a needless duplication ofprocesses already present in the Gene Ontol-ogy in a new ontology of anatomical functions.We discuss several limitations of the currentontologies that still need to be addressed to en-sure a consistent and complete representationof anatomical functions and functional abnor-malities.
Developing Consistent and Modular Software Models with Ontologies
Hoehndorf, Robert, Ngonga Ngomo, Axel-Cyrille and Herre, Heinrich SoMeT, pp. 399-412 (2009)
Applied OntologyOntology engineering
The development and verification of software models that are applica-
ble across multiple domains remains a difficult problem. We propose a novel ap-
proach to model-driven software development based on ontologies and Semantic
Web technology. Our approach uses three ontologies to define software models: a
task ontology, a domain ontology and a top-level ontology. The task ontology serves
as the conceptual model for the software, the domain ontology provides domain-
specific knowledge and the top-level ontology integrates the task and domain on-
tologies. Our method allows the verification of these models both for consistency
and ontological adequacy. This verification can be performed both at development
and runtime. Domain ontologies are replaceable modules, which enables the com-
parison and application of the models built using our method across multiple do-
mains. We demonstrate the viability of our approach through the design and im-
plementation of a semantic wiki and a social tagging system, and compare it with
model-driven software development to illustrate its benefits.
Contributions to the formal ontology of functions and dispositions: an application of non-monotonic reasoning
Hoehndorf, Robert, Kelso, Janet and Herre, Heinrich Proceedings of the 12th Annual Bio-Ontologies Meeting (2009)
Applied Ontology
We introduce a basic ontology of functions and dispositions. The
theory we suggest is compatible both with major philosophical
theories of biological functions and with most top-level
ontologies. The particular focus of the suggested formalism is on
the inference of causal relationships from functionality and the
explicit formalization of the normative character of functions using
non-monotonic forms of knowledge representation.
A Formal Ontology of Sequences
Hoehndorf, Robert, Kelso, Janet and Herre, Heinrich Nature Precedings, no. 713 (2009)
Applied Ontology
The Sequence Ontology is an OBO Foundry ontology that provides categories of sequences and sequence features that are applied to the annotation of genomes. To facilitate interoperability with other domain ontologies and to provide a foundation for automated inference, we provide here an axiom system for the Sequence and Junction categories in first- and second-order predicate logics.
Towards Ontological Interpretations for Improved Text Mining
Hoehndorf, Robert, Ngonga Ngomo, Axel-Cyrille and Dannemann, Michael Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), Turku, Finland, pp. 165-166, In: Salakoski, Tapio, Rebholz-Schuhmann, Dietrich and Pyysalo, Sampo (Eds.) (2008)
Applied Ontology
From Terms to Categories: Testing the Significance of Co-occurrences between Ontological Categories
Hoehndorf, Robert, Ngonga Ngomo, Axel-Cyrille, Dannemann, Michael and Kelso, Janet Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), Turku, Finland, pp. 53-60, In: Salakoski, Tapio, Rebholz-Schuhmann, Dietrich and Pyysalo, Sampo (Eds.) (2008)
Applied Ontology
The co-occurrence of terms in a text corpus
may indicate the presence of a relation
between the referents of these terms. We
expect co-occurrence-based methods to
identify association relations that cannot be
found using static patterns. We developed
a new method to identify associations
between ontological categories in text
using the co-occurrence of terms that
designate these categories. We use the
taxonomic structure of the ontologies to
cumulate the number of co-occurrences
of terms designating categories. Based
on these cumulated values, we designed
a novel family of statistical tests to
identify associated categories.
These
tests take both co-occurrence specificity
and relevance into consideration.
We
applied our method to a 2.2 GB text
corpus containing fulltext articles and
used Gene Ontology's biological process
ontology and the Celltype Ontology. The
software and results can be found at http:
//bioonto.de/pmwiki.php/Main/
ExtractingBiologicalRelations.
BOWiki: An ontology-based wiki for annotation of data and integration of knowledge in biology
Hoehndorf, Robert, Bacher, Joshua, Backhaus, Michael, Gregorio, Sergio E., Loebe, Frank, Pr"ufer, Kay, Uciteli, Alexandr, Visagie, Johann, Herre, Heinrich and Kelso, Janet Proceedings of the 11th Annual Bio-Ontologies Meeting, In: Lord, Phillip, Shah, Nigam, Sansone, Susanna-Assunta and Cockerill, Matthew (Eds.) (2008)
Ontology development and the annotation of biological data using
ontologies are time-consuming exercises that currently requires input from
expert curators. Open, collaborative platforms for biological data annotation
enable the wider scientific community to become involved in developing
and maintaining such resources. However, this openness raises concerns
regarding the quality and correctness of the information added to these
knowledge bases. The combination of a collaborative web-based platform
with logic-based approaches and Semantic Web technology can be used to
address some of these challenges and concerns.
We have developed the BOWiki, a web-based system that includes a
biological core ontology. The core ontology provides background knowledge
about biological types and relations. Against this background, an automated
reasoner assesses the consistency of new information added to the
knowledge base. The system provides a platform for research communities
such as wikis for the description, discussion and annotation of the
functions of genes and gene products [Wang, 2006, Hoehndorf et al.,
2006, Giles, 2007].
However, an open approach like wikis frequently raises concerns
regarding the quality of the information captured. The information
represented in the wiki should adhere to particular quality
criteria such as internal consistency (the wiki content does not
contain contradictory information) and consistency with biological
background knowledge (the wiki content should be semantically
correct). To address some of these concerns, logic-based tools can
be employed.
We have developed the BOWiki, a wiki system that uses a
core ontology together with an automated reasoner to maintain a
consistent knowledge base. It is specifically targeted at small- to
medium-sized communities.
to collaboratively integrate information and annotate data.
The BOWiki and supplementary material is available at http://www.
bowiki.net/. The source code is available under the GNU GPL from
http://onto.eva.mpg.de/trac/BoWiki.
Contact: bowiki-users@lists.informatik.uni-leipzig.de
BOWiki - a collaborative annotation and ontology
curation framework
Backhaus, Michael, Kelso, Janet, Bacher, Joshua, Herre, Heinrich, Hoehndorf, Robert and
Loebe and Visagie, Johann Proceedings of Workshop on Social and Collaborative
Construction of Structured Knowledge (2007)
Applied OntologyOntology engineering
As the amount of data being generated in biology has
increased, a major challenge has been how to store
and represent this data in a way that makes it
easily accessible to researchers from diverse
domains. Understanding the relationship between
genotype and phenotype is a major focus of
biological research. Various approaches to providing
the link between genes and their functions have been
undertaken - most require significant and dedicated
manual curation. Advances in web technologies make
possible an alternative route for the construction
of such knowledge bases - large-scale community
collaboration. We describe here a system, the
BOWiki, for the collaborative annotation of gene
information. We argue that a semantic wiki
provides the functionality required for this
project since this can capitalize on the existing
representations in biological ontologies. We
describe our implementation and show how formal
ontologies could be used to increase the usability
of the software through type-checking and automatic
reasoning.
The design of a wiki-based curation system for the Ontology of Functions
Hoehndorf, Robert, Pr"ufer, Kay, Backhaus, Michael, Visagie, Johann and Kelso, Janet Proceedings of the Joint BioLINK and 9th Bio-Ontologies Meeting (2006)
Applied OntologyOntology engineering
Recently, studies argued that statistical and linguistic methods can be applied to extract information from biomedical ontologies, and represent the identified relations in the top-level ontology Ontology of Functions (OF). However, human intervention is required in order to clear noise from the generated data. A simple platform for collaborative work is needed. We propose here the use of a semantic wiki to represent relations between terms. We provide a relationship model for this semantic wiki, and add a core ontology as top-level type system to this wiki. We then discuss a design for the implementation of a platform for the curation of the OF, thereby enabling the community to curate the results of automatic extraction methods, and to add and correct ontology and annotation information. The aim of this collaborative effort is to provide a means to extend and correct the numerous ontologies which are used to describe biological functions in the OF.
A proposal for a gene functions wiki
Hoehndorf, Robert, Pr"ufer, Kay, Backhaus, Michael, Herre, Heinrich, Kelso, Janet, Loebe, Frank and Visagie, Johann OTM Workshops 2006, no. 4277, pp. 669-678, In: Meersman, R., Tari, Z. and Herrero, P. (Eds.) (2006)
Ontology engineering
Large knowledge bases integrating dierent domains canprovide a foundation for new applications in biology such as data miningor automated reasoning. The traditional approach to the constructionof such knowledge bases is manual and therefore extremely time consuming.The ubiquity of the internet now makes large-scale communitycollaboration for the construction of knowledge bases, such as the successfulonline encyclopedia "Wikipedia", possible. We propose an extension of this model to the collaborative annotationof molecular data. We argue that a semantic wiki provides the functionalityrequired for this project since this can capitalize on the existingrepresentations in biological ontologies. We discuss the use of a differentrelationship model than the one provided by RDF and OWL torepresent the semantic data. We argue that this leads to a more intuitiveand correct way to enter semantic content in the wiki. Furthermore, weshow how formal ontologies could be used to increase the usability ofthe software through type-checking and automatic reasoning.
Statistical tests for associations between two directed acyclic graphs.
Hoehndorf, Robert, Ngonga Ngomo, Axel-Cyrille, Dannemann, Michael and Kelso, Janet PloS ONE, vol. 5(6), pp. e10996+ (Unknown)
Other
Biological data, and particularly annotation data, are increasingly being represented in directed acyclic graphs (DAGs). However, while relevant biological information is implicit in the links between multiple domains, annotations from these different domains are usually represented in distinct, unconnected DAGs, making links between the domains represented difficult to determine. We develop a novel family of general statistical tests for the discovery of strong associations between two directed acyclic graphs. Our method takes the topology of the input graphs and the specificity and relevance of associations between nodes into consideration. We apply our method to the extraction of associations between biomedical ontologies in an extensive use-case. Through a manual and an automatic evaluation, we show that our tests discover biologically relevant relations. The suite of statistical tests we develop for this purpose is implemented and freely available for download.
Applying the functional abnormality ontology pattern to anatomical functions.
Hoehndorf, Robert, Ngonga Ngomo, Axel-Cyrille and Kelso, Janet Journal of biomedical semantics, vol. 1(1), pp. 4+ (Unknown)
Applied Ontology
ABSTRACT: BACKGROUND: Several biomedical ontologies cover the domain of biological functions, including molecular and cellular functions. However, there is currently no publicly available ontology of anatomical functions.Consequently, no explicit relation between anatomical structures and their functions is expressed in the anatomy ontologies that are available for various species. Such an explicit relation between anatomical structures and their functions would be useful both for defining the classes of the anatomy and the phenotype ontologies accurately. RESULTS: We provide an ontological analysis of functions and functional abnormalities. From this analysis, we derive an approach to the automatic extraction of anatomical functions from existing ontologies which uses a combination of natural language processing, graph-based analysis of the ontologies and formal inferences. Additionally, we introduce a new relation to link material objects to processes that realize the function of these objects. This relation is introduced to avoid a needless duplication of processes already covered by the Gene Ontology in a new ontology of anatomical functions. CONCLUSIONS: Ontological considerations on the nature of functional abnormalities and their representation in current phenotype ontologies show that we can extract a skeleton for an ontology of anatomical functions by using a combination of process, phenotype and anatomy ontologies automatically. We identify several limitations of the current ontologies that still need to be addressed to ensure a consistent and complete representation of anatomical functions and their abnormalities. AVAILABILITY: The source code and results of our analysis are available at http://bioonto.de.
The ontology of biological sequences.
Hoehndorf, Robert, Kelso, Janet and Herre, Heinrich BMC Bioinformatics, vol. 10(1), pp. 377+ (Unknown)
Applied OntologyBioinformatics
BACKGROUND: Biological sequences play a major role in molecular and computational biology. They are studied as information-bearing entities that make up DNA, RNA or proteins. The Sequence Ontology, which is part of the OBO Foundry, contains descriptions and definitions of sequences and their properties. Yet the most basic question about sequences remains unanswered: what kind of entity is a biological sequence? An answer to this question benefits formal ontologies that use the notion of biological sequences and analyses in computational biology alike. RESULTS: We provide both an ontological analysis of biological sequences and a formal representation that can be used in knowledge-based applications and other ontologies. We distinguish three distinct kinds of entities that can be referred to as "biological sequence": chains of molecules, syntactic representations such as those in biological databases, and the abstract information-bearing entities. For use in knowledge-based applications and inclusion in biomedical ontologies, we implemented the developed axiom system for use in automated theorem proving. CONCLUSION: Axioms are necessary to achieve the main goal of ontologies: to formally specify the meaning of terms used within a domain. The axiom system for the ontology of biological sequences is the first elaborate axiom system for an OBO Foundry ontology and can serve as starting point for the development of more formal ontologies and ultimately of knowledge-based applications.
BOWiki: an ontology-based wiki for annotation of data and integration of knowledge in biology.
Hoehndorf, Robert, Bacher, Joshua, Backhaus, Michael, Gregorio, Sergio E., Loebe, Frank, Pr"ufer, Kay, Uciteli, Alexandr, Visagie, Johann, Herre, Heinrich and Kelso, Janet BMC Bioinformatics, vol. 10 Suppl 5(Suppl 5), pp. S5+ (Unknown)
MOTIVATION: Ontology development and the annotation of biological data using ontologies are time-consuming exercises that currently require input from expert curators. Open, collaborative platforms for biological data annotation enable the wider scientific community to become involved in developing and maintaining such resources. However, this openness raises concerns regarding the quality and correctness of the information added to these knowledge bases. The combination of a collaborative web-based platform with logic-based approaches and Semantic Web technology can be used to address some of these challenges and concerns. RESULTS: We have developed the BOWiki, a web-based system that includes a biological core ontology. The core ontology provides background knowledge about biological types and relations. Against this background, an automated reasoner assesses the consistency of new information added to the knowledge base. The system provides a platform for research communities to integrate information and annotate data collaboratively. AVAILABILITY: The BOWiki and supplementary material is available at http://www.bowiki.net/. The source code is available under the GNU GPL from http://onto.eva.mpg.de/trac/BoWiki.
Relations as patterns: bridging the gap between OBO and OWL.
Hoehndorf, Robert, Oellrich, Anika, Dumontier, Michel, Kelso, Janet, Rebholz-Schuhmann, Dietrich and Herre, Heinrich BMC Bioinformatics, vol. 11(1), pp. 441+ (Unknown)
Applied Ontology
BACKGROUND: most biomedical ontologies are represented in the OBO Flatfile Format, which is an easy-to-use graph-based ontology language. The semantics of the OBO Flatfile Format 1.2 enforces a strict predetermined interpretation of relationship statements between classes. It does not allow flexible specifications that provide better approximations of the intuitive understanding of the considered relations. If relations cannot be accurately expressed then ontologies built upon them may contain false assertions and hence lead to false inferences. Ontologies in the OBO Foundry must formalize the semantics of relations according to the OBO Relationship Ontology (RO). Therefore, being able to accurately express the intended meaning of relations is of crucial importance. Since the Web Ontology Language (OWL) is an expressive language with a formal semantics, it is suitable to de ne the meaning of relations accurately. RESULTS: we developed a method to provide definition patterns for relations between classes using OWL and describe a novel implementation of the RO based on this method. We implemented our extension in software that converts ontologies in the OBO Flatfile Format to OWL, and also provide a prototype to extract relational patterns from OWL ontologies using automated reasoning. The conversion software is freely available at http://bioonto.de/obo2owl, and can be accessed via a web interface. CONCLUSIONS: explicitly defining relations permits their use in reasoning software and leads to a more flexible and powerful way of representing biomedical ontologies. Using the extended langua0067e and semantics avoids several mistakes commonly made in formalizing biomedical ontologies, and can be used to automatically detect inconsistencies. The use of our method enables the use of graph-based ontologies in OWL, and makes complex OWL ontologies accessible in a graph-based form. Thereby, our method provides the means to gradually move the representation of biomedical ontologies into formal knowledge representation languages that incorporates an explicit semantics. Our method facilitates the use of OWL-based software in the back-end while ontology curators may continue to develop ontologies with an OBO-style front-end.
GFO-Bio: A biomedical core ontology
Hoehndorf, Robert, Loebe, Frank, Poli, Roberto, Kelso, Janet and Herre, Heinrich Applied Ontology, vol. 3(4), pp. 219-227 (Unknown)
Applied Ontology
The rapid increase in the number and use of biological ontologies necessitates developing systems for their integration. In this paper we present a core ontology for biology, and outline its application for integrating biological domain ontologies. Our ontology rests on a foundational ontology, which offers higher-order categories and a theory of levels of reality. The core ontology is implemented in two separate components, each of which adheres to OWL-DL. These can be used independently with efficient DL reasoners, but they will be most effective when used together, which necessitates working with an OWL-Full ontology. The ontology is freely available from our website at: http://bioonto.de/pmwiki.php/Main/GFO-Bio.
Representing default knowledge in biomedical ontologies: Application to the integration of anatomy and phenotype ontologies
Hoehndorf, Robert, Loebe, Frank, Kelso, Janet and Herre, Heinrich BMC Bioinformatics, vol. 8(1) (Unknown)
BACKGROUND:Current efforts within the biomedical ontology community focus on achieving interoperability between various biomedical ontologies that cover a range of diverse domains. Achieving this interoperability will contribute to the creation of a rich knowledge base that can be used for querying, as well as generating and testing novel hypotheses. The OBO Foundry principles, as applied to a number of biomedical ontologies, are designed to facilitate this interoperability. However, semantic extensions are required to meet the OBO Foundry interoperability goals. Inconsistencies may arise when ontologies of properties - mostly phenotype ontologies - are combined with ontologies taking a canonical view of a domain - such as many anatomical ontologies. Currently, there is no support for a correct and consistent integration of such ontologies.RESULTS:We have developed a methodology for accurately representing canonical domain ontologies within the OBO Foundry. This is achieved by adding an extension to the semantics for relationships in the biomedical ontologies that allows for treating canonical information as default. Conclusions drawn from default knowledge may be revoked when additional information becomes available. We show how this extension can be used to achieve interoperability between ontologies, and further allows for the inclusion of more knowledge within them. We apply the formalism to ontologies of mouse anatomy and mammalian phenotypes in order to demonstrate the approach.CONCLUSION:Biomedical ontologies require a new class of relations that can be used in conjunction with default knowledge, thereby extending those currently in use. The inclusion of default knowledge is necessary in order to ensure interoperability between ontologies.
A top-level ontology of functions and its application in the Open Biomedical Ontologies.
Burek, Patryk, Hoehndorf, Robert, Loebe, Frank, Visagie, Johann, Herre, Heinrich and Kelso, Janet Bioinformatics, vol. 22(14), pp. e66-e73 (Unknown)
Applied Ontology
MOTIVATION: A clear understanding of functions in biology is a key component in accurate modelling of molecular, cellular and organismal biology. Using the existing biomedical ontologies it has been impossible to capture the complexity of the community's knowledge about biological functions. RESULTS: We present here a top-level ontological framework for representing knowledge about biological functions. This framework lends greater accuracy, power and expressiveness to biomedical ontologies by providing a means to capture existing functional knowledge in a more formal manner. An initial major application of the ontology of functions is the provision of a principled way in which to curate functional knowledge and annotations in biomedical ontologies. Further potential applications include the facilitation of ontology interoperability and automated reasoning. A major advantage of the proposed implementation is that it is an extension to existing biomedical ontologies, and can be applied without substantial changes to these domain ontologies. AVAILABILITY: The Ontology of Functions (OF) can be downloaded in OWL format from http://onto.eva.mpg.de/. Additionally, a UML profile and supplementary information and guides for using the OF can be accessed from the same website. CONTACT: bioonto@lists.informatik.uni-leipzig.de.
Interoperability between phenotype and anatomy ontologies
Hoehndorf, Robert, Oellrich, Anika and Rebholz-Schuhmann, Dietrich Bioinformatics, vol. 26(24), pp. 3112-3118 (Unknown)
Motivation: Phenotypic information is important for the analysis of the molecular mechanisms underlying disease. A formal ontological representation of phenotypic information can help to identify, interpret and infer phenotypic traits based on experimental findings. The methods that are currently used to represent data and information about phenotypes fail to make the semantics of the phenotypic trait explicit and do not interoperate with ontologies of anatomy and other domains. Therefore, valuable resources for the analysis of phenotype studies remain unconnected and inaccessible to automated analysis and reasoning.Results: We provide a framework to formalize phenotypic descriptions and make their semantics explicit. Based on this formalization, we provide the means to integrate phenotypic descriptions with ontologies of other domains, in particular anatomy and physiology. We demonstrate how our framework leads to the capability to represent disease phenotypes, perform powerful queries that were not possible before and infer additional knowledge.Availability: http://bioonto.de/pmwiki.php/Main/PheneOntology Contact: rh497@cam.ac.uk
A common layer of interoperability for biomedical ontologies based on OWL EL
Robert Hoehndorf, Michel Dumontier, Anika Oellrich, Sarala Wimalaratne, Dietrich Rebholz-Schuhmann, Paul N. Schofield and Georgios V. Gkoutos Bioinformatics, vol. 27(7), pp. 1001-1008 (Unknown)
Applied OntologySemantic interoperability
Motivation: Ontologies are essential in biomedical research due to their ability to semantically integrate content from different scientific databases and resources. Their application improves capabilities for querying and mining biological knowledge. An increasing number of ontologies is being developed for this purpose, and considerable effort is invested into formally defining them in order to represent their semantics explicitly. However, current biomedical ontologies do not facilitate data integration and interoperability yet, since reasoning over these ontologies is very complex and cannot be performed efficiently or is even impossible. We propose the use of less expressive subsets of ontology representation languages to enable efficient reasoning and achieve the goal of genuine interoperability between ontologies. Results: We present and evaluate EL Vira, a framework that transforms OWL ontologies into the OWL EL subset, thereby enabling the use of tractable reasoning. We illustrate which OWL constructs and inferences are kept and lost following the conversion and demonstrate the performance gain of reasoning indicated by the significant reduction of processing time. We applied EL Vira to the open biomedical ontologies and provide a repository of ontologies resulting from this conversion. EL Vira creates a common layer of ontological interoperability that, for the first time, enables the creation of software solutions that can employ biomedical ontologies to perform inferences and answer complex queries to support scientific analyses. Availability and implementation: The EL Vira software is available from http://el-vira.googlecode.com and converted OBO ontologies and their mappings are available from http://bioonto.gen.cam.ac.uk/el-ont. Contact: rh497@cam.ac.uk
The RNA Ontology (RNAO): An Ontology for Integrating RNA Sequence and Structure Data
Robert Hoehndorf, Colin Batchelor, Thomas Bittner, Michel
Dumontier, Karen Eilbeck, Rob Knight, Chris J. Mungall, Jane
S. Richardson, Jesse Stombaugh, Eric Westhof and Craig L. Zirbel and
Neocles B. Leontis Applied Ontology, vol. 6(1), pp. 53-89 (Unknown)
Applied Ontology
Biomedical Ontologies integrate diverse biomedical data and enable intelligent data-mining and help translate basic research into useful clinical knowledge. We present the RNA Ontology (RNAO), an ontology for integrating diverse RNA data, including RNA sequences and sequence alignments, three-dimensional structures, and biochemical and functional data. For example, individual atomic resolution RNA structures have broader significance as representatives of classes of homologous molecules, which can differ significantly in sequence while sharing core structural features and common roles or functions. Thus, structural data gain value by being linked to homologous sequences in genomic data and databases of sequence alignments. Likewise, the value of genomic data is enhanced by annotation of shared structural features, especially when these can be linked to specific functions. Moreover, the significance of biochemical, functional and mutational analyses of RNA molecules are most fully understood when linked to molecular structures and phylogenies. To achieve these goals, RNAO provides logically rigorous definitions of the components of RNA primary, secondary and tertiary structure and the relations between these entities. RNAO is being developed to comply with the developing standards of the Open Biomedical Ontologies (OBO) Consortium. The RNAO can be accessed at http://code.google.com/p/rnao/.
PhenomeNET: a whole-phenome approach to disease gene discovery
Hoehndorf, Robert, Schofield, Paul N. and Gkoutos, Georgios V. Nucleic Acids Research, vol. 39(18), pp. e119 (Unknown)
Disease genetics
Phenotypes are investigated in model organisms to understand and reveal
the molecular mechanisms underlying disease. Phenotype ontologies
were developed to capture and compare phenotypes within the context
of a single species. Recently, these ontologies were augmented with
formal class definitions that may be utilized to integrate phenotypic
data and enable the direct comparison of phenotypes between different
species. We have developed a method to transform phenotype ontologies
into a formal representation, combine phenotype ontologies with anatomy
ontologies, and apply a measure of semantic similarity to construct
the PhenomeNET cross-species phenotype network. We demonstrate that
PhenomeNET can identify orthologous genes, genes involved in the
same pathway and gene–disease associations through the comparison
of mutant phenotypes. We provide evidence that the Adam19 and Fgf15
genes in mice are involved in the tetralogy of Fallot, and, using
zebrafish phenotypes, propose the hypothesis that the mammalian homologs
of Cx36.7 and Nkx2.5 lie in a pathway controlling cardiac morphogenesis
and electrical conductivity which, when defective, cause the tetralogy
of Fallot phenotype. Our method implements a whole-phenome approach
toward disease gene discovery and can be applied to prioritize genes
for rare and orphan diseases for which the molecular basis is unknown.
Interoperability between biomedical ontologies
through relation expansion, upper-level ontologies
and automatic reasoning
Robert Hoehndorf, Michel Dumontier, Anika Oellrich and
Dietrich Rebholz-Schuhmann and Paul N. Schofield and
Georgios V. Gkoutos PLOS ONE, vol. 6(7), pp. e22006 (Unknown)
Applied OntologySemantic interoperability
Researchers design ontologies as a means to accurately annotate and integrate experimental data across heterogeneous and disparate data- and knowledge bases. Formal ontologies make the semantics of terms and relations explicit such that automated reasoning can be used to verify the consistency of knowledge. However, many biomedical ontologies do not sufficiently formalize the semantics of their relations and are therefore limited with respect to automated reasoning for large scale data integration and knowledge discovery. We describe a method to improve automated reasoning over biomedical ontologies and identify several thousand contradictory class definitions. Our approach aligns terms in biomedical ontologies with foundational classes in a top-level ontology and formalizes composite relations as class expressions. We describe the semi-automated repair of contradictions and demonstrate expressive queries over interoperable ontologies. Our work forms an important cornerstone for data integration, automatic inference and knowledge discovery based on formal representations of knowledge. Our results and analysis software are available at http://bioonto.de/pmwiki.php/Main/ReasonableOntologies.
Integrating systems biology models and biomedical ontologies
Hoehndorf, Robert, Dumontier, Michel, Gennari, John H., Wimalaratne, Sarala, de Bono, Bernard, Cook, Daniel L. and Gkoutos, Georgios V. BMC Systems Biology, vol. 5(1), pp. 124+ (Unknown)
Applied Ontology
BACKGROUND: Systems biology is an approach to biology that emphasizes the structure and dynamic behavior of biological systems and the interactions that occur within them. To succeed, systems biology crucially depends on the accessibility and integration of data across domains and levels of granularity. Biomedical ontologies were developed to facilitate such an integration of data and are often used to annotate biosimulation models in systems biology. RESULTS: We provide a framework to integrate representations of in silico systems biology with those of in vivo biology as described by biomedical ontologies, and demonstrate this framework using the Systems Biology Markup Language. We developed the SBML Harvester software that automatically converts annotated SBML models into OWL, and we apply our software to those biosimulation models that are contained in the BioModels Database. We utilize the resulting knowledge base for complex biological queries that can bridge levels of granularity, verify models based on the biological phenomenon they represent and provide a means to establish a basic qualitative layer on which to express the semantics of biosimulation models. CONCLUSIONS: We establish an information flow between biomedical ontologies and biosimulation models, and we demonstrate that the integration of annotated biosimulation models and biomedical ontologies enables the verification of models as well as expressive queries. Establishing a bi-directional information flow between systems biology and biomedical ontologies has the potential to enable large-scale analyses of biological systems that span levels of granularity from molecules to organisms.
The RICORDO approach to semantic interoperability for biomedical data and models: strategy, standards and solutions.
de Bono, Bernard, Hoehndorf, Robert, Wimalaratne, Sarala, Gkoutos, Georgios V. and Grenon, Pierre BMC Research Notes, vol. 4(1), pp. 313 (Unknown)
BACKGROUND:The practice and research of medicine generates considerable quantities of data and model resources (DMRs). Although in principle biomedical resources are re-usable, in practice few can currently be shared. In particular, the clinical communities in physiology and pharmacology research, as well as medical education, (i.e. PPME communities) are facing considerable operational and technical obstacles in sharing data and models.FINDINGS:We outline the efforts of the PPME communities to achieve automated semantic interoperability for clinical resource documentation in collaboration with the RICORDO project. Current community practices in resource documentation and knowledge management are overviewed. Furthermore, requirements and improvements sought by the PPME communities to current documentation practices are discussed. The RICORDO plan and effort in creating a representational framework and associated open software toolkit for the automated management of PPME metadata resources is also described. CONCLUSIONS:RICORDO is providing the PPME community with tools to effect, share and reason over clinical resource annotations. This work is contributing to the semantic interoperability of DMRs through ontology-based annotation by (i) supporting more effective navigation and re-use of clinical DMRs, as well as (ii) sustaining interoperability operations based on the criterion of biological similarity. Operations facilitated by RICORDO will range from automated dataset matching to model merging and managing complex simulation workflows. In effect, RICORDO is contributing to community standards for resource sharing and interoperability
PIDO: The Primary Immunodeficiency Disease Ontology
Adams, Nico, Hoehndorf, Robert, Gkoutos, Georgios V., Hansen, Gesine and Hennig, Christian Bioinformatics (Unknown)
Applied OntologyDisease genetics
Motivation: Primary Immunodeficiency Diseases (PIDs) are Mendelian conditions of high phenotypic complexity and low incidence. They usually manifest in toddlers and infants, although they can also occur much later in life. Information about PIDs is often widely scattered throughout the clinical as well as the research literature and hard to find for both generalists as well as experienced clinicians. Semantic Web technologies coupled to clinical information systems can go some way towards addressing this problem. Ontologies are a central component of such a system, containing and centralizing knowledge about primary immunodeficiencies in both a human- and computer-comprehensible form. The development of an ontology of PIDs is therefore a central step towards developing informatics tools, which can support the clinician in the diagnosis and treatment of these diseases.Results: We present PIDO, the Primary Immunodeficiency Disease Ontology. PIDO characterises PIDs in terms of the phenotypes commonly observed by clinicians during a diagnosis process. Phenotype terms in PIDO are formally defined using complex definitions based on qualities, functions, processes, and structures. We provide mappings to biomedical reference ontologies to ensure interoperability with ontologies in other domains. Based on PIDO, we developed the PIDFinder, an ontology-driven software prototype that can facilitate clinical decision support. PIDO connects immunological knowledge across resources within a common framework and thereby enables translational research and the development of medical applications for the domain of immunology and primary immunodeficiency diseases.Availability: The Primary Immunodeficiency Disease Ontology is available under a Creative Commons Attribution 3.0 (CC-BY 3.0) licence at http://code.google.com/p/pido/ The most recent public release of the ontology can always be found at http://purl.org/scimantica/pido/owl/pid.owl. An instance of the PIDFinder software can be found at http://pidfinder.appspot.comContact: nico.adams@csiro.au
Ontology design patterns to disambiguate relations between genes and gene products in GENIA
Hoehndorf, Robert, Ngonga Ngomo, Axel-Cyrille, Pyysalo, Sampo, Ohta, Tomoko, Oellrich, Anika and Rebholz-Schuhmann, Dietrich Journal of Biomedical Semantics, vol. 2(Suppl 5), pp. S1 (Unknown)
Applied OntologyOntology engineering
MOTIVATION:Annotated reference corpora play an important role in biomedical information extraction. A semantic annotation of the natural language texts in these reference corpora using formal ontologies is challenging due to the inherent ambiguity of natural language. The provision of formal definitions and axioms for semantic annotations offers the means for ensuring consistency as well as enables the development of verifiable annotation guidelines. Consistent semantic annotations facilitate the automatic discovery of new information through deductive inferences.RESULTS:We provide a formal characterization of the relations used in the recent GENIA corpus annotations. For this purpose, we both select existing axiom systems based on the desired properties of the relations within the domain and develop new axioms for several relations. To apply this ontology of relations to the semantic annotation of text corpora, we implement two ontology design patterns. In addition, we provide a software application to convert annotated GENIA abstracts into OWL ontologies by combining both the ontology of relations and the design patterns. As a result, the GENIA abstracts become available as OWL ontologies and are amenable for automated verification, deductive inferences and other knowledge-based applications.AVAILABILITY:Documentation, implementation and examples are available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ webcite.
New approaches to the representation and analysis of phenotype knowledge in human diseases and their animal models
Schofield, Paul N., Sundberg, John P., Hoehndorf, Robert and Gkoutos, Georgios V. Briefings in Functional Genomics, vol. 10(5), pp. 258-265 (Unknown)
Applied OntologyDisease genetics
The systematic investigation of the phenotypes associated with genotypes in model organisms holds the promise of revealing genotype–phenotype relations directly and without additional, intermediate inferences. Large-scale projects are now underway to catalog the complete phenome of a species, notably the mouse. With the increasing amount of phenotype information becoming available, a major challenge that biology faces today is the systematic analysis of this information and the translation of research results across species and into an improved understanding of human disease. The challenge is to integrate and combine phenotype descriptions within a species and to systematically relate them to phenotype descriptions in other species, in order to form a comprehensive understanding of the relations between those phenotypes and the genotypes involved in human disease. We distinguish between two major approaches for comparative phenotype analyses: the first relies on evolutionary relations to bridge the species gap, while the other approach compares phenotypes directly. In particular, the direct comparison of phenotypes relies heavily on the quality and coherence of phenotype and disease databases. We discuss major achievements and future challenges for these databases in light of their potential to contribute to the understanding of the molecular mechanisms underlying human disease. In particular, we discuss how the use of ontologies and automated reasoning can significantly contribute to the analysis of phenotypes and demonstrate their potential for enabling translational research.
OBML - Ontologies in Biomedicine and Life Sciences
Herre, Heinrich, Hoehndorf, Robert, Kelso, Janet, Loebe, Frank and Schulz, Stefan Journal of Biomedical Semantics, vol. 2(Suppl 4), pp. I1 (Unknown)
Applied Ontology
The OBML 2010 workshop, held at the University of Mannheim on September 9-10, 2010, is the 2nd in a series of meetings organized by the Working Group "Ontologies in Biomedicine and Life Sciences" of the German Society of Computer Science (GI) and the German Society of Medical Informatics, Biometry and Epidemiology (GMDS). Integrating, processing and applying the rapidly expanding information generated in the life sciences -- from public health to clinical care and molecular biology -- is one of the most challenging problems that research in these fields is facing today. As the amounts of experimental data, clinical information and scientific knowledge increase, there is a growing need to promote interoperability of these resources, support formal analyses, and to pre-process knowledge for further use in problem solving and hypothesis formulation.The OBML workshop series pursues the aim of gathering scientists who research topics related to life science ontologies, to exchange ideas, discuss new results and establish relationships. The OBML group promotes the collaboration between ontologists, computer scientists, bio-informaticians and applied logicians, as well as the cooperation with physicians, biologists, biochemists and biometricians, and supports the establishment of this new discipline in research and teaching. Research topics of OBML 2010 included medical informatics, Semantic Web applications, formal ontology, bio-ontologies, knowledge representation as well as the wide range of applications of biomedical ontologies to science and medicine. A total of 14 papers were presented, and from these we selected four manuscripts for inclusion in this special issue.An interdisciplinary audience from all areas related to biomedical ontologies attended OBML 2010. In the future, OBML will continue as an annual meeting that aims to bridge the gap between theory and application of ontologies in the life sciences. The next event emphasizes the special topic of the ontology of phenotypes, in Berlin, Germany on October 6-7, 2011.
Linking PharmGKB to phenotype studies and animal models of disease for drug repurposing
Robert Hoehndorf, Anika Oellrich, Dietrich Rebholz-Schuhmann, Paul N. Schofield and Georgios V. Gkoutos Pacific Symposium on Biocomputing (PSB), pp. 388-399 (Unknown)
Disease geneticsDrug mechanisms
The investigation of phenotypes in model organisms has the potential
to reveal the molecular mechanisms underlying disease. The
large-scale comparative analysis of phenotypes across species can
reveal novel associations between genotypes and diseases. We use the
PhenomeNET network of phenotypic similarity to suggest
genotype--disease association, combine them with drug--gene
associations available from the PharmGKB database, and infer novel
associations between drugs and diseases. We evaluate and quantify
our results based on our method's capability to reproduce known
drug--disease associations. We find and discuss evidence that
levonorgestrel, tretinoin and estradiol are associated with cystic
fibrosis ($p<2.65\cdot 10^{-6}$, $p<0.002$ and $p<0.031$, Wilcoxon
signed-rank test, Bonferroni correction) and that ibuprofen may be
active in chronic lymphocytic leukemia ($p<2.63\cdot 10^{-23}$,
Wilcoxon signed-rank test, Bonferroni correction). To enable access
to our results, we implement a web server and make our raw data
freely available. Our results are the first steps in implementing
an integrated system for the analysis and prediction of
drug--disease associations for rare and orphan diseases for which
the molecular basis is not known.
Mouse genetic and phenotypic resources for human genetics
Schofield, Paul N., Hoehndorf, Robert and Gkoutos, Georgios V. Human Mutation (Unknown)
Other
The use of model organisms to provide information on gene function has proved to be a powerful approach to our understanding of both human disease and fundamental mammalian biology. Large-scale community projects using mice, based on forward and reverse genetics, and now the pan-genomic phenotyping efforts of the International Mouse Phenotyping Consortium (IMPC), are generating resources on an unprecedented scale which will be extremely valuable to human genetics and medicine. We discuss the nature and availability of data, mice and ES cells from these large-scale programmes, the use of these resources to help prioritise and validate candidate genes in human genetic association studies, and how they can improve our understanding of the underlying pathobiology of human disease.
Jupp, Simon, Stevens, Robert and Hoehndorf, Robert Journal of Biomedical Semantics, vol. 3(Suppl 1), pp. S3 (Unknown)
Applied Ontology
MOTIVATION:Ontologies such as the Gene Ontology (GO) and their use in annotations make cross species comparisons of genes possible, along with a wide range of other analytical activities. The bio-ontologies community, in particular the Open Biomedical Ontologies (OBO) community, have provided many other ontologies and an increasingly large volume of annotations of gene products that can be exploited in query and analysis. As many annotations with different ontologies centre upon gene products, there is a possibility to explore gene products through multiple ontological perspectives at the same time. Questions could be asked that link a gene product's function, process, cellular location, phenotype and disease. Current tools, such as AmiGO, allow exploration of genes based on their GO annotations, but not through multiple ontological perspectives. In addition, the semantics of these ontology's representations should be able to, through automated reasoning, afford richer query opportunities of the gene product annotations than is currently possible.RESULTS:To do this multi-perspective, richer querying of gene product annotations, we have created the Logical Gene Ontology, or GOAL ontology, in OWL that combines the Gene Ontology, Human Disease Ontology and the Mammalian Phenotype Ontology, together with classes that represent the annotations with these ontologies for mouse gene products. Each mouse gene product is represented as a class, with the appropriate relationships to the GO aspects, phenotype and disease with which it has been annotated. We then use defined classes to query these protein classes through automated reasoning, and to build a complex hierarchy of gene products. We have presented this through a Web interface that allows arbitrary queries to be constructed and the results displayed.CONCLUSION:This standard use of OWL affords a rich interaction with Gene Ontology, Human Disease Ontology and Mammalian Phenotype Ontology annotations for the mouse, to give a fine partitioning of the gene products in the GOAL ontology. OWL in combination with automated reasoning can be effectively used to query across ontologies to ask biologically rich questions. We have demonstrated that automated reasoning can be used to deliver practical on-line querying support for the ontology annotations available for the mouse.AVAILABILITY:The GOAL Web page is to be found at http://owl.cs.manchester.ac.uk/goal webcite.
An infrastructure for ontology-based information systems in biomedicine: RICORDO case study
Wimalaratne, Sarala M., Grenon, Pierre, Hoehndorf, Robert, Gkoutos, Georgios V. and de Bono, Bernard Bioinformatics, vol. 28(3), pp. 448-450 (Unknown)
Applied Ontology
Summary: The article presents an infrastructure for supporting the semantic interoperability of biomedical resources based on the management (storing and inference-based querying) of their ontology-based annotations. This infrastructure consists of: (i) a repository to store and query ontology-based annotations; (ii) a knowledge base server with an inference engine to support the storage of and reasoning over ontologies used in the annotation of resources; (iii) a set of applications and services allowing interaction with the integrated repository and knowledge base. The infrastructure is being prototyped and developed and evaluated by the RICORDO project in support of the knowledge management of biomedical resources, including physiology and pharmacology models and associated clinical data.Availability and implementation: The RICORDO toolkit and its source code are freely available from http://ricordo.eu/relevant-resources.Contact: sarala@ebi.ac.uk
Neural Multi-hop Logical Query Answering with Concept-Level Answers
Tang, Zhenwei, Pei, Shichao, Peng, Xi, Zhuang, Fuzhen, Zhang, Xiangliang and Hoehndorf, Robert The Semantic Web – {ISWC} 2023, pp. 522-540, In: Payne, Terry R., Presutti, Valentina, Qi, Guilin, Poveda-Villalón, María, Stoilos, Giorgos, Hollink, Laura, Kaoudi, Zoi, Cheng, Gong and Li, Juanzi (Eds.) (Unknown)
Applied OntologyNeuro-symbolic AI
Neural multi-hop logical query answering ({LQA}) is a fundamental task to explore relational data such as knowledge graphs, which aims at answering multi-hop queries with logical operations based on distributed representations of queries and answers. Although previous {LQA} methods can give specific instance-level answers, they are not able to provide descriptive concept-level answers, where each concept is a description of a set of instances. Concept-level answers are more comprehensible to users and are of great usefulness in the field of applied ontology. In this work, we formulate the problem of {LQA} with concept-level answers ({LQAC}), solving which needs to address challenges in incorporating, representing, and operating on concepts. We propose an original solution for {LQAC}. Firstly, we incorporate description logic-based ontological axioms to provide the source of concepts. Then, we represent concepts and queries as fuzzy sets, i.e., sets whose elements have degrees of membership, to bridge concepts and queries with instances. Moreover, we design operators involving concepts on top of fuzzy set representation of concepts and queries for optimization and inference. Extensive experimental results on three real-world datasets demonstrate the effectiveness of our method for {LQAC}. In particular, we show that our method is promising in discovering complex logical biomedical facts.
Interactively Exploring Graph Coloring Algorithms in a Bilingual Web Platform with Gamification
Maha Alrashed, Lujain Alharbi, Omamah Talal Al-Muhammadi, Salha Bahadiq, Robert Hoehndorf and Liam Mencel Proceedings of EdMedia: World Conference on Educational Media and Technology 2017 , pp. 298-302 , In: Joyce P. Johnston (Ed.) ( 2017 )
Bioinformatics
Graph coloring is a concept in graph theory that has many real world applications, such as scheduling and map coloring, thus making it an essential part of a computer science curriculum. Most graph theory courses are taught using standard methods such as with textbooks or a blackboard. Such methods introduce graph theory without providing the student with an adequate understanding of the process and the computational complexity of this NP-complete problem, which in return, hinders their ability to understand, implement, and develop graph coloring algorithms. In this paper, we describe the conceptual design of a bilingual web platform developed to provide students with interactive exploration of different graph coloring algorithms.