Research

From BaderLab

Jump to: navigation, search


Contents

[edit] Overview: Natural and Synthetic Biological Networks

The success of genome sequencing projects makes it feasible to identify the gene and protein components that make up an organism. What remains difficult, however, is to measure or predict how these components are organized into functional units – protein complexes, pathways, organelles, and larger biological systems. Even the emerging field of synthetic biology, which aims to re-engineer cells for specific tasks, requires pathway maps as a framework for design.

My research shows how networks of genes and proteins achieve their function. The biological significance of this work has been to predict gene function based on network context, to identify functionally distinct modules representing protein complexes and pathways, and to establish links between genes and disease. Hundreds of our computational predictions have been validated experimentally (Stuart et al. 2007; Qi et al. 2008).

Background and overview. Biological networks encompass many types of interactions that coexist simultaneously in the cell. Transient protein-protein and protein-DNA interactions occur in signal transduction and gene regulatory networks, often described as the wiring diagram of a cell. Networks formed by stable physical interactions between proteins define a cell’s structural components. Basic biochemical processes are defined by enzyme-substrate relationships in metabolic networks. Beyond these networks composed of physical interactions, epistatic or genetic interaction networks are defined by logical relationships: whether genes in a regulatory network are upstream or downstream of each other, for example, or located in parallel pathway branches that provide robust back-up systems. My group’s work is in three major areas.

First, we map biological pathways as part of interdisciplinary teams. We generated a map of protein-protein interactions in Drosophila (Giot et al. 2003), the first proteome-scale map for any multi-cellular organism. We developed statistical methods that are now widely used to assess the specificity and sensitivity of interactions identified by high-throughput biological technologies. Our group continues to contribute to ongoing experimental mapping campaigns.

Second, we develop new algorithms to analyze biological networks. Key innovations have been new algorithms for analyzing genetic epistasis networks, algorithms for joint analysis of physical and genetic interactions, and new methods for predicting disease associations based on metabolic networks. Our analysis discovered previously unknown components of the phagosome, an organelle responsible for internalizing and digesting microbial pathogens. We also successfully predicted how multiple hits to partially redundant DNA repair pathways are lethal in yeast; mutations in the corresponding human genes are implicated in cancer.

Finally, we develop computational methods for synthetic biology in order to design new biological systems with desired properties. Our lab provides unique web-based resources for team-based editing of DNA sequences from the resolution of a single basepair to an entire chromosome or genome. We are applying these methods to design and build a yeast cell with fully synthetic DNA and to identify which residues of pleiotropic proteins are responsible for each specific function.

[edit] Mapping biological pathways

We develop methods that are essential for processing and analyzing raw data from high-throughput biological interaction screens. Interactions obtained from high-throughput studies are often polluted by false positives or spurious interactions. My group was among the first to address interaction data quality as a statistical classification problem. The significance of interaction confidence scores is that they permit the scale-up of protein interaction screens from protein-at-a-time efforts to full proteome scale, similar to the importance of DNA sequence quality scores in the scale-up of DNA sequencing for the human genome project. Our methods were first tested on literature data from yeast (Bader et al. 2004) and applied to an experimental screen in Drosophila (Giot et al. 2003). These methods are now widely used (Fig. 1).

Fig. 1. A protein interaction map for Drosophila shows proteins as circles, color-coded by cellular localization (green = nuclear, blue = cytoplasmic, red = extracellular, etc.), and interactions as lines, color-coded by statistical confidence (from Giot et al. 2003).  This was the first proteome-scale interaction map for any multi-cellular organism.
Fig. 1. A protein interaction map for Drosophila shows proteins as circles, color-coded by cellular localization (green = nuclear, blue = cytoplasmic, red = extracellular, etc.), and interactions as lines, color-coded by statistical confidence (from Giot et al. 2003). This was the first proteome-scale interaction map for any multi-cellular organism.

The total number of protein-protein interactions in human and simpler model organisms remains an open question. The answer to this question determines the resources required to generate an interaction map, including the possibility of diminishing returns as a project nears completion. My group developed a mathematical framework to predict the total number of interactions and the approach to full coverage for a mapping project. The method extends capture-recapture theory, which estimates population sizes from individuals recaptured in independent samples, by accounting for measurement error in high-throughput screens (Huang et al. 2007; Huang and Bader 2008). Surprisingly, this is the first time that measurement error has been considered in the capture-recapture setting. Our published work predicts roughly 1 million pairwise protein-protein interactions for the human proteome. As an intermediate result, we also provide Bayesian model selection criteria that distinguish between networks with power-law degree distributions, and networks with distributions that are heavy-tailed but not scale-free. Our results provide a new capability for cost-benefit analysis: how many new interactions will be discovered per dollar invested in an interaction screen.

Individual technologies can provide complementary information for mapping protein pathways. Low-abundance or weakly associated proteins can be missed by methods that rely on protein purifications. Other technologies, such as two-hybrid screens, may be more effective in detecting their interactions but have less in vivo context. We have developed algorithms that merge networks to predict missing components and as-yet unobserved interactions. As part of a collaborative effort to define the components of the phagosome, an organelle responsible for engulfing both food and microbial pathogens, we developed optimized heuristics to identify likely components missed by large-scale proteomics (Stuart et al. 2007). Our algorithm made the novel prediction that the exocyst complex (previously associated with exocytosis, or delivery of material to the cell exterior) is required for phagocytosis; this prediction was confirmed using RNAi (Fig. 2).

Fig. 2. The phagosome is the organelle responsible for internalizing microbes as part of innate immunity.  It matures from the endosome, which is tethered to the cell-surface phagocytic cup by the exocyst complex (red proteins).  Our computational analyses predicted the participation of exocyst components that were undetected by initial experiments.  More sensitive assays confirmed their presence, and RNAi assays confirmed their functional role (from Stuart et al. 2007).
Fig. 2. The phagosome is the organelle responsible for internalizing microbes as part of innate immunity. It matures from the endosome, which is tethered to the cell-surface phagocytic cup by the exocyst complex (red proteins). Our computational analyses predicted the participation of exocyst components that were undetected by initial experiments. More sensitive assays confirmed their presence, and RNAi assays confirmed their functional role (from Stuart et al. 2007).


We are currently developing methods for analyzing synthetic lethal interaction data in joint work with the Boeke lab at Johns Hopkins. The goal of this study is to identify all pairwise synthetic lethal interactions between non-essential genes in yeast. Our role in these studies is to create a data pipeline that analyzes the raw data, predicts novel interactions for testing, and parses a network into logically distinct modules and pathways. We helped generate significant new network data resources, including gene networks relevant to DNA damage repair (Pan et al. 2005) and histone acetylation and deacetylation (Lin et al. 2008).

[edit] Computational analysis of networks and pathways

Our group has been a leader in developing effective methods to combining different types of networks into a coherent, predictive picture of biological function. An important property of biological systems is their robustness, the ability to retain cellular fitness despite the mutation of an individual gene. By extension, human diseases may be caused by multiple genetic and environmental hits to distinct pathways rather than a mutation in a single pathway. We developed methods that integrate physical and genetic interactions show how hits to multiple pathways compromise cellular fitness (Ye at al. 2005a). Using a circuit analogy, physical interactions define serial pathways. Evolution may select for partially redundant, parallel pathways that are robust to loss of any single branch (Fig. 3). Genetic interactions define the outcome of experiments that delete pairs of genes from the network, equivalent to cutting branches. If the two genes reside in a single serial branch, the remaining branches provide backup, and the cell is viable. If the two genes reside in different branches, however, the overall circuit may no longer function, and the cell dies. This phenotype is termed synthetic lethality.

Fig. 3. Cancer may be caused by hits to multiple pathways that contribute to DNA stability and repair.  These processes can be studied in yeast using genetic interactions, where synthetic lethal interactions (red lines) suggest that two genes function in independent pathways.  The Cdc14 early anaphase release pathway (FEAR), mitotic exit network (MEN), and Sin3/Rpd3 histone deacetylase complex (HDAC) have partially redundant function in releasing Cdc14 to permit mitotic exit.  The overall network is robust to deleting any single branch but not to pairs of branches (adapted from Ye et al. 2005a).
Fig. 3. Cancer may be caused by hits to multiple pathways that contribute to DNA stability and repair. These processes can be studied in yeast using genetic interactions, where synthetic lethal interactions (red lines) suggest that two genes function in independent pathways. The Cdc14 early anaphase release pathway (FEAR), mitotic exit network (MEN), and Sin3/Rpd3 histone deacetylase complex (HDAC) have partially redundant function in releasing Cdc14 to permit mitotic exit. The overall network is robust to deleting any single branch but not to pairs of branches (adapted from Ye et al. 2005a).


This biological model was developed into several mathematical and computational approaches for joint analysis of physical and genetic interaction networks. Using data from yeast, we showed that genes that share synthetic lethal partners are more functionally related than genes with direct synthetic lethal interactions, and hence in different pathway branches (Ye et al. 2005a). Similarly, networks inferred between genes based on shared synthetic lethal partners have enriched subgraphs (network motifs) similar to physical interaction networks, whereas the synthetic lethal networks themselves have very different motifs (Ye et al. 2005b). We used these methods to predict the function of uncharacterized yeast proteins and experimentally confirmed novel members of the dynein/dynactin motor protein complex (Ye et al. 2005a).

One of the goals of computational biology is to suggest or even supplant wet-lab experiments with accurate predictions of experimental results. For biological networks, this can mean predicting which untested interactions actually exist, or more generally predicting which genes and proteins are most closely related to a pre-selected set. The general network search problem has many biological applications: predicting members of a protein complex; identifying members of signal transduction pathway; or identifying candidate genes contributing to a model organism phenotype or a human disease. We have developed several new algorithms that have improved the ability to search for functionally related genes or protein in an interaction network. We showed that using data quality scores for interactions improves search results (Bader 2003). Similarly, we have used experimental evidence to bias searches for cancer signal transduction pathways using Bayesian network algorithms (Bose et al. 2006, Guha et al. 2008). We also developed a new expectation-maximization approach for biclustering that predicts protein complexes and pathways from genetic interaction networks (Qi et al. 2005).

For genetic interaction networks in particular, we have developed new search algorithms that outperform previous methods (Qi et al. 2008). Our methods build on graph diffusion kernels, a technical term for counting the number of weighted paths that connect two genes or proteins in a network. Google’s PageRank employs this type of algorithm for Internet search, and we showed that it is effective for predicting new members of protein complexes (Huang et al. 2007b). Unfortunately, when applied to genetic interaction networks (such as the red edges in Fig. 3), this method scrambles within-pathway and between-pathway information. Our insight was to consider even-length and odd-length paths separately. When applied to genetic interaction networks, the even-length paths predict physical interactions, and the odd-length paths predict genetic interactions. We used this method to predict genetic interaction partners of ADA2 and ESA1, two yeast histone acetyltransferases. Roughly half of our predictions were experimentally verified, and we were able to identify true interactions that were missed by a high-throughput screen (Qi et al. 2008). These results demonstrate that computational predictions can augment experimental interaction maps. Our parity-specific kernels may have utility more broadly for analyzing social networks whose connections represent friendship (analogous to protein interactions) and dislike (analogous to genetic interactions).

A new focus in our group is to understand how defects in metabolic enzymes cause disease (Fig. 4). Lesch-Nyhan Syndrome is an example in which the cause is known, a purine salvage gene defect, yet the mechanism leading to this neurological disorder remains cryptic. Our goal is to develop general methods that link mutations in metabolic pathways to other biological networks and to disease, with the hope that revealing the mechanism will suggest possible therapies. Our approach has been to link enzymes based on correlated demands over different cellular states, which can reveal unsuspected relationships between metabolic pathways. We introduced a new technique for predicting enzyme flux correlations based existing flux balance metabolic reconstructions (Veeramani and Bader 2009). We showed that correlations in metabolic flux are an excellent predictor of genetic interactions in yeast. Furthermore, for Lesch-Nyhan Syndrome, we identified a hitherto unsuspected metabolic correlation between the causative mutation and a metabolic enzyme in the ribose phosphate biosynthesis pathway whose defects also cause neurological disorders.

Fig. 4. Mutations to metabolic enzymes can cause disease by shifting the balance of cellular metabolites.  Mutations to the human purine metabolism gene HPRT1 cause the neurological disorder Lesch-Nyhan Syndrome.  We investigated the structure of the purine metabolic network using the simpler yeast model to understand why mutations in other purine metabolism genes do not lead to neurological disorders.  The yeast ortholog HPT1 (yellow), genes in purine salvage pathways (red), and genes in purine biosynthesis pathways (pink) are connected by edges that indicate shared metabolites (solid lines) and correlated fluxes over different environments (dashed lines).  While HPT1 is not strongly connected to any other purine metabolism genes, it is strongly connected to the PRS complex (green).  Notably, mutations to the human orthologs of PRS1 do cause neurological disorders (from Veeramani and Bader 2009).
Fig. 4. Mutations to metabolic enzymes can cause disease by shifting the balance of cellular metabolites. Mutations to the human purine metabolism gene HPRT1 cause the neurological disorder Lesch-Nyhan Syndrome. We investigated the structure of the purine metabolic network using the simpler yeast model to understand why mutations in other purine metabolism genes do not lead to neurological disorders. The yeast ortholog HPT1 (yellow), genes in purine salvage pathways (red), and genes in purine biosynthesis pathways (pink) are connected by edges that indicate shared metabolites (solid lines) and correlated fluxes over different environments (dashed lines). While HPT1 is not strongly connected to any other purine metabolism genes, it is strongly connected to the PRS complex (green). Notably, mutations to the human orthologs of PRS1 do cause neurological disorders (from Veeramani and Bader 2009).

[edit] Synthetic biology

Synthetic biology is the engineering counterpart to genomics: designing and building individual gene or protein parts, combining parts to create pathways and devices, and even creating new organisms from the DNA up. DNA sequences are biological programs, and synthesizing new biological programs requires a development environment similar to those for large software projects. My group provides unique resources for designing DNA sequences from the level of individual nucleotides to entire genomes. Genome-scale design is provided by BIOSTUDIO (http://baderlab.bme.jhu.edu/biostudio/), which permits collaborative editing of a single DNA sequence by teams of biologists. This resource was directly motivated by integrated development environments and revision control systems used by software engineers. It is built on top of an existing genome database and visualization platform provided by the Generic Model Organism Database (GMOD) project. Functionality includes the ability to spawn development branches, perform batch edits or substitutions, accept or reject tracked changes, and export sequence files to vendors for ordering physical DNA.

A second public resource, GENEDESIGN (http://baderlab.bme.jhu.edu/gd/), provides fine-grained functionality at the gene level. It provides computer-aided-design modules for synonymous codon substitutions, restriction site design, and generation of tiling oligo order sheets that can be transmitted directly to DNA vendors. We are developing new capabilities for GENEDESIGN, such as the ability to design hypomorphic or temperatures sensitive alleles of essential genes for use in genetic screens.

These resources are being used to generate a synthetic yeast strain that can identify locally minimal genomes (genomes where one additional gene deletion causes lethality). The innovation in this project, a collaboration with the Boeke lab, is to design a yeast strain to lose pieces of DNA as a stochastic process, creating a swarm of genotypes that can be tested in parallel for viability. My lab has been responsible for developing the algorithms that permit insertion of loxP recombination sites and unique sequence tags throughout the yeast genome. We have also provided the workflow software for a student course that has been the source of much of the DNA for this project (Dymond et al. 2009).

Synthetic biology provides new capabilities for reverse genetic screens, in which libraries of gene sequences are pre-synthesized and tested for biological function. Libraries of protein mutants are valuable for pleiotropic proteins where individual residues are responsible for distinct functions. This is true of histone proteins, for which modifications to specific residues are correlated with process such as chromatin silencing, transcriptional activity, and DNA damage sensing and repair. We have developed a new database for storing and analyzing data from large-scale mutant screens (Huang et al. 2009). An innovative feature of this database is the ability to overlay phenotypes from synthetic library screens with protein structure, identifying protein regions responsible for particular biological functions. The first instance of this database, http://www.histonehits.org/, contains data from a comprehensive reverse genetics study of mutations in yeast histones (Dai et al. 2008).

Design of synthetic genes and proteins requires accurate prediction of function. As a counterpart to our synthetic biology work, we have been developing methods to predict the specificity of protein-DNA interactions using all-atom simulations of transcription factor proteins to calculate binding free energies with DNA sequences. Initial tests with homeodomain proteins suggest that we may be able to identify binding specificities for $1000 to $2000 per protein based on the amortized cost of CPU cycles (Liu and Bader 2006, 2007), a competitive cost compared to wet-lab alternatives.

Personal tools
Teaching
Private