5 Discovering Functional Gene Clusters Using Genome Context Methods
5.1 How are FunGCs Computed?
5.2 BioCyc Ortholog Data and the Reference Genomes
5.3 Computing Pairwise Functional Linkage Scores
5.4 Computation of Functionally Linked Gene Clusters
5.5 How to Use Functional Gene Cluster Information in BioCyc
This document provides an overview of the BioCyc collection of Pathway/Genome Databases.
Although its content is limited at the current time, it will expand over time to cover additional aspects of BioCyc. The information in this document pertains to all BioCyc databases (DBs), and to most other DBs created using the Pathway Tools software. More detailed information about specific members of the BioCyc family is available as follows:
The BioCyc collection of Pathway/Genome Databases (PGDBs) provides electronic reference sources on the pathways and genomes of many organisms. BioCyc databases describe organisms with sequenced genomes. BioCyc is primarily microbial. In addition, BioCyc contains databases for humans; for important model organisms such as yeast, fly, and mouse; and for other eukaryotes whose PGDBs have been curated. One reason for collecting all these PGDBs together within BioCyc is to enable the comparative analyses that become possible when multiple PGDBs are available within one site (see Tools → Analysis → Comparative Analysis). PGDBs have been created for many other organisms, including microbes, fungi, plants (see PlantCyc.org and animals [details].
The databases (DBs) within the BioCyc collection are organized into tiers according to the amount of manual review and updating they have received.
Tier 1 PGDBs have been created through intensive manual efforts, and receive continuous updating [details of Tier 1]. The BioCyc Tier 1 DBs are:
EcoCyc, which describes Escherichia coli K-12.
MetaCyc, which describes experimentally elucidated enzymes and metabolic pathways from more than 2,740 organisms. MetaCyc does not seek to model the complete metabolic network of any one organism, but to provide a comprehensive collection of experimental pathways.
Tier 2 PGDBs were computationally generated by the PathoLogic program, and have undergone moderate amounts of review and updating. There are 71 PGDBs in Tier 2. [details of Tier 2]
We encourage scientists to adopt the Tier 2 and Tier 3 DBs for ongoing curation and updating.
Most microbial PGDBs within BioCyc have been generated computationally by SRI and are regenerated every 6-12 months to take advantage of improvements in our pathway prediction algorithms and in the MetaCyc pathway database. The PGDBs within BioCyc that have been provided by outside groups are updated with variable frequencies. Usually the date on which a PGDB was generated or last updated can be determined by selecting that PGDB as the current PGDB and then viewing the page at Tools → Analysis → Summary Statistics or Tools → Analysis → Reports → History of Updates.
Tier 3 PGDBs are generated using an automated computational pipeline. The genome data source for a specific PGDB can be determined by selecting that PGDB as current database and then executing Tools → Analysis → Summary Statistics. Genome data sources include:
Most Tier 3 genomes are obtained from RefSeq: ftp://ftp.ncbi.nih.gov/genomes/all/GCF
Human Microbiome Project genomes are obtained from: ftp://ftp.ncbi.nih.gov/genomes/HUMAN_MICROBIOM/Bacteria
The processing steps used to create BioCyc from an annotated genome are as follows.
Each genome is converted to files in the PathoLogic format that serves as input to the PathoLogic component of the Pathway Tools software.
PathoLogic generates a BioCyc PGDB for the organism (see further details below). The following PathoLogic inference components are executed: pathway predictor, pathway hole filler, operon predictor, transport inference parser.
A combination of automated and manual quality assurance checks are applied to the generated PGDBs to ensure they are free of errors. Some genomes are not included in BioCyc because their data fail these quality checks, e.g., some genomes have such a small number of proteins with predicted functions that they are not included. These checks are described in detail in the BioCyc PGDB Concepts Guide, section “Quality Checking of PGDB Data”.
Although all processing steps are over seen by our staff, no human curation is applied to the Tier 3 PGDBs.
The specific processing steps used to generate Tier 3 PGDBs are as follows.
Predict metabolic pathways .
Predict operons .
Predict pathway hole fillers .
Predict transporter reactions .
Generate an organism-specific metabolic map diagram (cellular overview) .
Import from UniProt  predicted and experimental protein functions, gene names, protein feature data (e.g., protein phosphorylation sites, active sites, and transmembrane regions), and Gene Ontology terms. Note that UniProt lacks these data for some organisms.
Data extracted from UniProt is copyright of the UniProt Consortium and subject to the Creative Commons Attribution (CC BY 4.0) License and the disclaimers at uniprot.org/help/license. The original data is available from uniprot.org.
Import organism metadata such as relationship to oxygen, geographic location of sample, host for pathogens, and human microbiome site from NCBI BioProject/BioSample, PATRIC, and other available sources such as GOLD.
Compute the presence of Pfam domains in proteins (they are depicted as protein features).
Compute ortholog relationships among proteins within all BioCyc PGDBs.
Create web links from each BioCyc genome and each BioCyc gene/protein, to as many external bioinformatics DBs as possible, including to RefSeq, UniProt, and to additional DBs via the UniProt ID. The full list of DBs to which BioCyc PGDBs can include links are:
Cazy, dip, disprot, interpro, mint, panther, pdb, phosphosite, pride, prints, prodom, proteinmodelportal, prosite, pfam, smart, smr, string
Operationally, BioCyc uses the term “ortholog” not in the strict evolutionary sense defined by Fitch  but in the looser sense of genes that are likely to be counterparts of one another because they are the most closely related in a pair of organisms. Although such genes are likely to be evolutionarily related in the sense of Fitch, we do not perform a detailed evolutionary analysis to compute BioCyc orthologs.
BioCyc ortholog information is generated by running a sequence comparison program pair-wise between all proteomes of all PGDBs. We do not compute sequence similarity information for RNA genes or pseudo genes, therefore BioCyc does not currently contain orthology information for such genes.
In the past, we used NCBI BLAST (“legacy” BLAST version 2.2.23) for the sequence comparisons, and some ortholog information is still retained that was computed by BLAST. However, since 2021, we have switched to a much faster program called Diamond (version 2.0.4), which in its “sensitive mode” still produces comparable results. New orthologs are now computed by Diamond, as well as for older PGDBs, if they are updated with newer versions of genome annotations.
We use an E-value cut-off of 0.001 when running Diamond (or BLAST), with all other parameters left at their default settings. We define two proteins A and B as orthologs if protein A from proteome PA and protein B from proteome PB are bi-directional “best” comparison hits of one another, meaning that protein B is the best hit of protein A within proteome PB, and protein A is the best hit of protein B within proteome PA. In rare cases, protein A might have multiple orthologs in proteome PB, as explained below.
The “best” hit(s) of protein A in proteome PB is defined by finding the minimal E-value among all hits in proteome PB in the sequence comparison output. There could be hits to multiple proteins in proteome PB that share that same minimal E-value. In other words, ties are possible, as in the case of exact gene duplications. We attempt to break ties using two methods: taking the hit with the maximum alignment length; and then taking the hit with the maximum alignment amino acid residue identity. For the first method, we compare the alignment lengths among all the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum alignment length is selected. For the second method, we compare the number of identical amino acid residues in the alignments between protein A and the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum number of identical amino acid residues is selected. In the case that ties still remain (as in the case of exact gene duplications), all ties are included in the final set of orthologs used by BioCyc. Thus, protein A could have multiple orthologs in PB, such as if multiple proteins B1, B2, etc, exist in PB, and have exactly the same regions align against protein A. BioCyc does not calculate paralogs. Although orthologs are defined for most pairs of PGDBs in BioCyc, some PGDBs lack ortholog data. A given pair of organisms might lack orthologs within BioCyc for the following possible reasons.
One of the organism PGDBs might not have sequence data
The two organisms are evolutionarily distant and their genes do not have any bi-directional best comparison hits
The ortholog computation is pending in the future
This aspect of BioCyc is meant to enable the discovery of new pathways, and to provide clues as to the functions of genes of unknown function. A functionally linked gene cluster, or FunGC, refers to a group of genes that are estimated by the system to operate in a common pathway, a common cellular process, or a common protein complex (note the term “cluster” does not imply physical clustering on a chromosome). There is no guarantee as to what type of functional linkages exist among the genes in a FunGC. FunGCs are predicted using comparative-genomics methods called genome-context methods [7, 6, 20, 5, 16] that search for patterns across hundreds of genomes. Details of the methods used by BioCyc are described in the next sections.
BioCyc includes FunGCs for 10 genomes as of October 2014. In later releases we plan to expand the number of genomes for which FunGCs are available.
For example, E. coli genes bioA, bioB, bioD, and bioF, which participate in the metabolic pathway for biotin biosynthesis, are all predicted by our method to form a FunGC. Another E. coli FunGC consists of genes xylF, xylH, xylA, and xylB. The xylF and xylH genes encode a xylose ABC transporter, while xylA and xylB encode a xylose isomerase and a xylulokinase, respectively, which participate in a xylose degradation pathway. The full list of E. coli FunGCs is available at http://biocyc.org/ECOLI/genome-context.
We will use E. coli gene yqeB to illustrate BioCyc FunGCs in more detail. yqeB is a gene of unknown function; it is annotated as a conserved protein with a NAD(P)-binding Rossman fold. yqeB forms a FunGC with genes xdhC (xanthine dehydrogenase, Fe-S subunit), xdhB (xanthine dehydrogenase subunit, FAD-binding domain), xdhA (xanthine dehydrogenase subunit), and paoA (aldehyde dehydrogenase, Fe-S subunit). Thus, one might infer because of its assignment to a common FunGC with these four other genes that yqeB forms a common pathway with these genes.
FunGCs are computed by a two-step process. Given all genes in a given organism,
Step 1: Compute a pairwise functional-linkage score between every pair of genes in the organism.
Step 2: Compute FunGCs by searching for highly connected sets of functionally linked genes from Step 1
In a moment we will consider these steps in more detail. But first we discuss the reliance of these methods on the ortholog data within BioCyc.
Computation of functional linkages between two genes G1 and G2 is based on comparative-genomics analyses of the orthologs of G1 and G2 that exist in many genomes. We will refer to a given such pair of orthologs as G1′ and G2′.
BioCyc computes orthologous genes (e.g., that G1 and G1′ are orthologs) as bidirectional best hits obtained from BLAST outputs among most (but not yet all) pairs of genomes within BioCyc. To determine whether yqeB and xdhA are functionally linked, we examine the orthologs of those two genes across a special set of 1800 BioCyc genomes that we call the reference genomes. The reference genomes are chosen to be taxonomically diverse because selection of multiple closely related species could bias the relationships that we consider during computational of functional relatedness (e.g., because gene order will be highly conserved in highly related genomes).
To estimate whether genes G1 and G2 are functionally linked we consider two factors. First is the co-occurrence of their orthologs in many genomes. Are the orthologs typically found in the same genomes? If so their pairwise functional-linkage score will be higher. But the algorithm also considers how many genomes they are found in. For example, if orthologs of G1 and G2 are found in virtually every reference genome, they would have no choice but to co-occur, thus decreasing their functional linkage score. The genome-occurrence profile for a gene is called its phylogenetic profile, and is represented as a vector of ones and zeros — a one when an ortholog of G is found in a given genome, and a zero when no ortholog of G is found in a given genome. If the phylogenetic profile vectors of G1 and G2 are similar then we infer that the genes are likely to be functionally linked. The exact phylogenetic profile method we use is called pp-mutual-info (znorm) in .
The second factor used to infer functional linkage is the spatial proximity of orthologs of G1 and G2. That is, when orthologs G1′ and G2′ do co-occur in the same genome, are G1′ and G2′ frequently close to one another on the same chromosome (e.g., separated by few genes or adjacent)? If so, this conserved genome proximity also provides evidence of functional relatedness. The genomic proximity method we use is a variation of the one called gn-lnX (znorm) in , where correlation between organisms is taken into account, with a resulting improvement in performance. Given this improvement and the fact that both methods are fused to get an additional boost in performance, the resulting score is significantly better than the best method presented in that paper. To our knowledge, this makes the fused scores used for the system, the best genome context scores in the literature.
Pairwise functional-linkage information alone is useful in providing clues regarding the function of one gene based on the known functions of other genes that it is related to. Even more useful is to find larger sets of genes that are functionally related to one another and that therefore may function in a common pathway or biological process.
The method described in this section computes Functionally Linked Gene Clusters, or FunGCs. The algorithm starts with the set of pairwise functional linkages computed in the previous step. The linkages are encoded in a graph whose nodes are the genes of a given organism, and whose edges are the inferred functional linkages among those genes. The algorithm searches the full graph to find highly interconnected subgraphs (they need not be fully interconnected), that is, sets of genes that have a high fraction of mutual functional linkages. Some genes will be removed from the set if they are not sufficiently highly interconnected to other genes in the set. The algorithm we use for computing FunGCs as highly interconnected graphs is CFinder .
When computing FunGCs we must make empirical choices regarding two thresholds. One is the strength of pairwise functional linkage that we deem sufficient for inclusion in the network. The other one is the clique size CFinder uses as starting point, which, in turn, determines the minimum size of the resulting clusters (currently 4). We may perform some adjustment of these thresholds over time to try to seek a proper balance between finding sufficient numbers of interesting new pairwise linkages and FunGCs, versus limiting the number of incorrect pairwise linkages and FunGCs. The current values were chosen to give a good trade-off on E. coli̇ We have performed extensive evaluation and tuning of our methods [6, 7], and in fact the methods currently used for BioCyc have improved somewhat over the published methods. For example, our use of CFinder is a new addition; it does not require full connectivity among genes within a cluster as did the previous method. And because highly-overlapping cliques are merged by CFinder into a bigger groups, this code results in fewer groups which overlap each other much less and are also more accurate (make better predictions of the pathways or protein complexes that we know of in the target organisms). CFinder is run on a network that was created with a very high threshold on the pairwise scores resulting in relatively few high-confidence groups being found. Inspection of the FunGC listing page for E. coli will show that the FunGCs present in BioCyc for genes of known function overlap very strongly with known pathways.
FunGC information is available on two new BioCyc page types:
Genome-Context Analysis Page
FunGC listing page
This page is available for genes with at least one pairwise functional linkage that exceeds our minimum cutoff. Navigate to this page from the gene page by clicking on “Genome Context Analysis” in the right-sidebar menu of the gene page. Not every gene page will have such a page available.
The Genome-Context Analysis Page has two sections: the first section lists the FunGC(s) to which a gene gene has been assigned. The second section lists the pairwise functional linkages that have been computed for a gene. Not every functionally linked gene will appear in a FunGC.
The first section depicts the FunGC as a graph with edge color indicating the strength of pairwise functional linkages. The tabs to the right of the FunGC diagram produce different views of the genes within a FunGC, such as showing any known pathways and operon structures containing these genes.
The Functional Gene Clusters page lists all FunGCs available for an organism. Navigate to this page from the Genome-Context Analysis Page for any gene, using the link near the top (“See all FunGCs for this organism”).
As genomic information becomes available for an ever growing number of organisms, it becomes essential to have up-to-date knowledge resources for organisms with large experimental communities. For example, how can an experimentalist hope to accurately interpret gene-expression microarray results without having all available knowledge about the functions of each gene product in the organism at their fingertips? We see it as essential that experts on the biology of specific organisms begin curation of organism-specific databases describing the genome and biochemical networks of those organisms.
All of the Tier 3 BioCyc PGDBs, and some of the Tier 2 PGDBs, are available for adoption. The DBs are available under an open license agreement, meaning that they are freely available to all, they may be modified, and they may be freely redistributed.
The Pathway Tools software system on which BioCyc is based provides a package of graphical editing tools that allow biologists to create and update database entities such as genes, proteins, metabolic pathways, metabolites, operons, and transcriptional regulatory interactions. Scientists can also use Pathway Tools to publish their PGDBs on their own website, in a manner analogous to the BioCyc website (which is powered by Pathway Tools).
Suggested steps for adoption of PGDBs are as follows.
Initiate the adoption process by contacting biocyc-support at ai dot sri dot com.
Begin to refine the adopted PGDB by comparing its contents against known experimental knowledge for the organism. Update the PGDB to reflect errors such as false-positive pathway predictions, and to add experimentally known pathways that are not in the PGDB. Update gene names and protein functions to reflect recent discoveries.
Enlist colleagues who are specialists in different aspects of the organism’s biology to help. A group of collaborating curators can all run Pathway Tools on separate computers to update a PGDB stored in a commonly accessed relational database manager such as MySQL.
Publish the PGDB on your website using Pathway Tools, or request that SRI include your PGDB within BioCyc. To facilitate this process, SRI will ask you to deposit your PGDB in the database registry as described in the previous step.
For more details on post-adoption care of a PGDB, see Post-Adoption Care of a BioCyc Database.
BioCyc.org is served by a network-load balancer that dispatches user requests to a collection of Linux-based computers.
The BioCyc.org website is powered by the Pathway Tools software [11, 12]. Pathway Tools runs as a long-lived web server process, with web requests handled by Franz AllegroServe and CWEST . BioCyc.org makes use of additional bioinformatics software including BLAST, PatMatch, Clustal Omega, and MSAviewer.
Users can install Pathway Tools on their own computers, where it can run as both a desktop application and as a local web server. Local installations of Pathway Tools can also be used to query and update BioCyc data via APIs in Python, Java, Lisp, Perl, and R. Pathway Tools is written in the Common Lisp language using the Allegro Common Lisp product from Franz Inc .
BioCyc databases are stored in an object-oriented database system called Ocelot , which stores its databases persistently in disk files and in MySQL databases. Data such as orthologs and SmartTables are stored in MySQL databases.
For more details on the architecture of Pathway Tools see .
BioCyc web service APIs are described here .
Please see the comparison section of the MetaCyc Guide .
BioCyc has a number of advanced operations including a number of comparative genomics tools, programs for analysis of high throughput datasets such as gene-expression data, and metabolic network analysis tools. The following information resources describe BioCyc in more detail.
Read the BioCyc Search Help
Read the BioCyc Glossary
Take the BioCyc guided tour
The Pathway Tools software [download] contains the Pathway Tools User’s Guide, a document that provides extensive coverage of all aspects of the software, including an extensive description of the database schema that underlies PGDBs.
BioCyc is grateful for the following groups:
ChemAxon for use of the Marvin chemoinformatics tool
Allegro Common Lisp.
BioCyc Web Services.
M. J. Cipriano, P. N. Novichkov, A. E. Kazakov, D. A. Rodionov, A. P. Arkin,
M. S. Gelfand, and I. Dubchak.
RegTransBase–a database of regulatory sequences and interactions
based on literature: a resource for investigating transcriptional regulation
BMC Genomics, 14:213, 2013.|
The Universal Protein Resource (UniProt) 2009.
Nuc Acids Res, 37(Database issue):D169–174, 2009.|
A.J. Enright, I. Iliopoulos, N.C. Kyrpides, and C.A. Ouzounis.
Protein interaction maps for complete genomes based on gene fusion
Nature, 402:86–90, 1999.|
L. Ferrer, J.M. Dale, and P. D. Karp.
A systematic study of genome context methods: calibration,
normalization and combination.
BMC Bioinformatics, 11:493, 2010.|
L. Ferrer, A.G. Shearer, and P. D. Karp.
Discovering novel subsystems using comparative genomics.
Bioinformatics, 27:2478–85, 2011.|
Distinguishing homologous from analogous proteins.
Systemic zoology, 19:99–113, 1970.|
M.L. Green and P. D. Karp.
A Bayesian method for identifying missing enzymes in predicted
metabolic pathway databases.
BMC Bioinformatics, 5(1):76, 2004.
P. D. Karp, Vinay K. Chaudhri, and Suzanne M. Paley.
A collaborative environment for authoring large knowledge bases.
J Intelligent Information Systems, 13:155–94, 1999.
P. D. Karp, P.E. Midford, R. Billington, A. Kothari, , M. Krummenacker, W.K.
Ong, P. Subhraveti, R. Caspi, I.M Keseler, and S. M. Paley.
Pathway Tools version 23.0 update: Software for pathway/genome
informatics and systems biology.
Brief Bioinform, 22:109––126, 2019.
P. D. Karp, P.E. Midford, S.M. Paley, M. Krummenacker, R. Billington,
A. Kothari, W.K. Ong, P. Subhraveti, I.M. Keseler, and R. Caspi.
Pathway Tools version 23.0: Integrated software for
pathway/genome informatics and systems biology [v3].
arXiv, pages 1–111, 2019.
P. D. Karp, S. M. Paley, M. Krummenacker, M. Latendresse, J.M. Dale, T. Lee,
P. Kaipa, F. Gilham, A. Spaulding, L. Popescu, T. Altman, I. Paulsen, I.M.
Keseler, and R. Caspi.
Pathway Tools version 13.0: Integrated software for
pathway/genome informatics and systems biology.
Brief Bioinform, 11:40–79, 2010.
M. Latendresse and P. D. Karp.
Web-based metabolic network visualization with a zooming user
BMC Bioinformatics, 12:176–84, 2011.
T.J. Lee, I. Paulsen, and P. D. Karp.
Annotation-based inference of transporter function.
Bioinformatics, 24:i259–67, 2008.
R. Overbeek, M. Fonstein, M. D’Souza, G.D. Pusch, and N. Maltsev.
Use of contiguity on the chromosome to predict functional coupling.
In Silico Biol., 1(2):93–108, 1999.|
S. M. Paley and P. D. Karp.
Adapting EcoCyc for use on the World Wide Web.
Gene, 172:GC43–GC50, 1996.|
G. Palla, I. Derenyi, I. Farkas, and T. Vicsek.
Uncovering the overlapping community structure of complex networks in
nature and society.
Nature, 435:814–8, 2005.|
M. A. Peabody, M. R. Laird, C. Vlasschaert, R. Lo, and F. S. Brinkman.
PSORTdb: expanding the bacteria and archaea protein subcellular
localization database to better reflect diversity in cell envelope
Nuc Acids Res, 44(D1):D663–8, 2016.|
M. Pellegrini, E.M. Marcotte, M.J. Thompson, D. Eisenberg, and T.O. Yeates.
Assigning protein functions by comparative genome analysis: Protein
Proc National Academy of Sciences, USA, 96:4285–8, 1999.|
A. G. Perez, V. E. Angarica, A. T. Vasconcelos, and J. Collado-Vides.
Tractor_DB (version 2.0): A database of regulatory interactions in
Nuc Acids Res, 35(Database issue):D132–6, 2007.|
P. Romero and P. D. Karp.
Using functional and organizational information to improve
genome-wide computational prediction of transcription units on
Bioinformatics, 20:709–17, 2004.|