This article was review by Paul Sternberg, PhD from the California Institute of Technology and the Gene Ontology Consortium (GOC).

Stay up to date on the latest science with Brush Up Summaries.

A scientist in a lab coat analyzes data on a monitor
Scientists can use GO to analyze the results of high-throughput data and make inferences about the active biological processes in a specimen. 
iStock: Wiseman

What Is Gene Ontology?

The Gene Ontology (GO) database is an effort to create an explicitly defined dictionary to describe the roles of gene products and their relationships to each other.1,2  “It is essentially a vocabulary that allows scientists or biologists across all domains of life to describe what they are seeing… Every term is defined and has a relation to other terms,” said Paul Sternberg, a molecular biologist and geneticist at the California Institute of Technology and a principal investigator with the Gene Ontology Consortium (GOC).3 The GOC consists of a group of scientists dedicated to developing and maintaining the GO. Through the GOC-run GO, all biologists have a universal vocabulary to discuss the function of genes, even if they are working with vastly different species.

What Information Is Included in an Individual Ontology? 

The most basic unit of a specific ontology is a “term,” which refers to an attribute used to describe a specific gene function or product.3 GO terms are themselves composed of several core elements.

  • An alphanumeric ID, which serves as a unique identifier
  • A human-readable name
  • The domain to which it belongs (molecular function, cellular component, or biological process)
  • A definition of the function along with a cited source
  • Its relationship to other terms in the ontology

All of the information held in a GO about a certain term falls under one of three broader biological domains.4

  • Molecular Function: Activities conducted by gene products at the molecular level.
  • Cellular Component: The cellular entity in which the gene product carries out its function, or the larger complex the gene product is a part of.
  • Biological Process: The overarching process to which the gene product contributes.
     Components of a gene ontology: A gene ontology becomes less specific as you move from the bottom to the top. The key base term for an individual ontology is at the bottom of the graph. As you move up, terms that describe the role and function of the base term and subsequent terms are added in layers that become progressively more general. These terms are connected by color-coded arrows, called edges, which indicate the relationships between terms.
A gene ontology becomes less specific as you move from the bottom to the top. The key base term for an individual ontology is at the bottom of the graph. As you move up, terms that describe the role and function of the base term and subsequent terms are added in layers that become progressively more general. These terms are connected by color-coded arrows, called edges, which indicate the relationships between terms.
The Scientist

Specific ontologies are loosely hierarchal structures that graph the relationships between these terms. They are represented using directed acyclic graphs (DAGs), a system in which general “parent” terms give rise to more specific “child” terms, like a flowchart. The terms are called “nodes,” and their relationships are described using “edges.” Common edge phrases include the following.

  • Is_a: This indicates that a “child” node is a subclass of a “parent” node. For example: the endoplasmic reticulum is an organelle.
  • Part_of: This indicates a “child” node is a component of a “parent” node. In these designations, the “child” cannot exist without the “parent.” For example, the endoplasmic reticulum is part of the cell.
  • Regulates: This indicates that one process directly modifies another.   For example, certain terms may regulate transcription or regulate apoptosis. Edges that specify positive or negative regulations also exist.
  • Has_part: This indicates that a “parent” node absolutely requires the “child” node to exist or function. For example, the electron transport chain has part Coenzyme Q (CoQ). 

Gene Ontology annotations 

Another vital aspect of GO are annotations, which establish a connection between a GO term (a function) and the associated gene or gene product it describes.5 In simpler terms, GO annotations utilize GO terms to attribute a function to a gene or gene product. For instance, tumor protein 53 (TP53), breast cancer gene 1 (BRCA1), ABL proto-oncogene 1 (ABL1), and vascular endothelial growth factor (VEGF) are all genes or gene products. The GO terms included in their respective annotations could encompass DNA-binding transcription factor activity (GO:0003700), DNA repair (GO:0006281), protein tyrosine kinase activity (GO:0004713, and positive regulation of angiogenesis (GO:0045766). To ensure accuracy, annotations must reference published scientific literature that backs the association between a GO term and a gene or gene product. 

Gene Ontology Knowledgebase 

Altogether, the standard ontologies, GO-CAM, and annotations make up the GO Knowledgebase. As of April 2024, the GO Knowledgebase contained 42,255 GO terms, 7,671,375 annotations, 1,536,921 gene products, and 5,404 species.7 Included species range from model organisms such as Schizosaccharomyces pombe (fission yeast), Danio rerio (zebrafish), and Saccharomyces cerevisiae (brewer's or baker's yeast) to lesser-known organisms such as Gallus gallus (red junglefowl) and Dictyostelium discoideum (an amoeba).

The enormous range of terms, annotations, gene products, and represented species illustrates the significant expansion GO has undergone since its creation in 1998.1 It continues to grow to this day; ontologies are considered “dynamic” in that they grow and shift as scientific knowledge expands. Members of the GO Consortium collaborate with genomic databases, such as UniProt, MGI, and Reactome, to constantly update, review, and revise the information included in the Knowledgebase. The Consortium also welcomes community feedback for revisions and submissions for new annotations.

Using GO for research

Scientists mostly use GO to analyze the results of high-throughput data, such as next-generation sequencing and microarray results. Most often, GO is used in descriptive “-omics” papers that compare the genomic, proteomic, or transcriptomic differences between two cell types or conditions. Most of these experiments will yield thousands upon thousands of genes, making it unreasonable to sift through individual genes and provide a function for each.  In that case, analysis tools are employed to initially narrow down the results to a subset of critical genes exhibiting significant differential expression. Subsequently, researchers utilize GO-specific programs to infer conclusions regarding represented pathways and processes. Two foundational GO analyses are as follows.

  • Over Representation Analysis (ORA) / Enrichment Analysis:8 This is one of the most common and simple GO analyses conducted. This analysis assigns inputted genes of interest to pathways and calculates whether each pathway is more or less represented than would be expected. This first requires placing differential expression results in the context of the original raw inputs. For example, say an analysis started with 1000 genes, 100 of which were related to apoptosis. If 100 significantly differentially expressed genes were identified, 10 would be expected to be related to apoptosis. A deviation from that expected number would be flagged, but it could also happen purely by chance. To determine which over- or under-representations are truly statistically significant, GO tools use statistical models such as hypergeometric distribution, Fisher’s exact test, or chi-square test to calculate a p-value. 
  • Functional Class Scoring (FCS):9 Functional class scoring refers to a class of methodologies, but the most well-known one is gene set enrichment analysis (GSEA). This approach takes ORA analysis further by ranking genes based on their differential expression value and direction of change. Next, an enrichment score for various pathways is determined based on where their associated genes fall on the ranked list. Unlike ORA, GSEA is largely unbiased because it takes all genes as inputs, not just those that were significantly differentially expressed. 

Commonly Used Tools for Gene Ontology Analysis 

There are a variety of Web-, R-, Python-, and even Java-based tools to conduct ORA, FCS, and other analyses. Below is a table describing some of the commonly used tools in GO analysis, though is by no means an exhaustive list

Table: Commonly-used GO analysis tools 

Tool

Description

Advantages

Disadvantages 

ClusterProfiler10

R package for analysis of both coding and non-coding genomic data

Supports both ORA and GSEA, also allows for comparison of gene lists from different experimental conditions or time points

Outputs include a variety of visually appealing graphs that integrate the tidy interface

Includes databases outside of GO, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG)

Requires knowledge of R programming

gProfiler11

Web-based tool for ORA/enrichment analysis 

Also allows for retrieval of orthologous genes and single-nucleotide polymorphisms

Has a variety of visualization options 

Includes databases outside of GO, such as KEGG, Human Protein Atlas, and Human Phenotype Ontology

Web-serve is user-friendly, especially for those without a programming background. Also available as an R package

A 2019 update addressed previous limitations by expanding the included species and databases

Enrichr12

Web-based tool for ORA/enrichment analysis 

Similar benefits to gProfiler, but its user-interface makes it better for complete beginners

Enrichr has undergone improvements over the years, notably by enabling the inclusion of background gene lists. Background genes establish a standardized baseline against which differentially expressed genes can be consistently compared

DAVID13

Web-based tool 

Provides functional annotation and pathway information for input gene lists

Released in 2003, DAVID is one of the older GO analysis tools. While still a valuable resource, it may not be as up-to-date as other platforms

PANTHER14

Web-based tool for ORA/enrichment analysis 

PANTHER is run by Paul Thomas, one of the principal investigators with the GOC

Developers have frequently updated PANTHER over the past two decades. One future update that developers have identified is better handling of gene fusion events 

Drawbacks and GO Research Pitfalls

One of the drawbacks of GO is that it is incomplete. “If you ask most experts to look at the Gene Ontology and the relationship with the terms and see if their domain is covered [they will say] it is not covered that well,” said Sternberg. Despite the considerable effort put into integrating new literature into the GO, there is always more ground to cover. Consequently, the biological snapshot provided by GO annotations and GO-CAMs likely only represents part of the story. Researchers still need to conduct their investigations and utilize their knowledge base when interpreting the results of a GO analysis. Additionally, researchers must also consider what a GO analysis might be leaving out. A single gene that is differentially expressed and responsible for a vital function may be overlooked in a GO analysis if the larger gene set it belongs to is not enriched. Conversely, a pathway may be enriched in a GO analysis, but that does not guarantee that it is contributing to a significant or interesting function. 

While GO serves as a valuable tool, it should not stop researchers from applying their expertise to interpret their data and results. Additionally, encouraging investigators to contribute to GO annotations themselves enhances the quality and comprehensiveness of the GO database. Sharing knowledge and contributing to a community resource are critical aspects of promoting the utility of GO.

References

1. Ashburner M, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25-29.

2. Aleksander SA, et al. The Gene Ontology knowledgebase in 2023. Genetics. 2023;224(1):iyad031.

3. About the GO. Gene Ontology. Published May 30, 2024. Accessed June 22, 2024. http://geneontology.org/docs/introduction-to-go

4. GO term elements. Gene Ontology Resource. Published May 30, 2024. Accessed June 12, 2024. http://geneontology.org/docs/GO-term-elements

5. Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32(suppl_1):D258-D261. 

6. Hill DP, et al. Gene Ontology annotations: what they mean and where they come from. BMC Bioinformatics. 2008;9(5):S2.

7. Thomas PD, et al. Gene Ontology Causal Activity Modeling (GO-CAM) moves beyond GO annotations to structured descriptions of biological functions and systems. Nat Genet. 2019;51(10):1429-1433.

8. Gene Ontology Resource. Gene Ontology Resource. Accessed June 12, 2024. http://geneontology.org/

9. Reimand J, et al. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat Protoc. 2019;14(2):482-517.

10. Pathways and gene sets: What is functional enrichment analysis? NIH Center for Cancer Research. Accessed May 28, 2024. https://bioinformatics.ccr.cancer.gov/btep/pathways-and-gene-sets-what-is-functional-enrichment-analysis/

11. Yu G, et al. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS J Integr Biol. 2012;16(5):284-287.

12. Reimand J, et al. g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res. 2007;35(suppl_2):W193-W200.

13. Chen EY, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013;14:128.

14. Huang DW, et al. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44-57.

15. Thomas PD, et al. PANTHER: A library of protein families and subfamilies indexed by function. Genome Res. 2003;13(9):2129-2141.

          Brush Up Summaries