Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Phylogenomics: Foundations, Methods, and Pathogen Analysis
Phylogenomics: Foundations, Methods, and Pathogen Analysis
Phylogenomics: Foundations, Methods, and Pathogen Analysis
Ebook1,462 pages15 hours

Phylogenomics: Foundations, Methods, and Pathogen Analysis

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Phylogenomics: Foundations, Methods, and Pathogen Analysis offers a deep overview of phylogenomics as a field, compelling recent developments, and detailed methods and approaches for conducting new research. Early chapters introduce phylogenomic taxonomies of organisms and pathogens, phylogenomic networks, phylogenomics of virus virulence, and ancient DNA analysis, with a second section offering methods, detailed descriptions and step-by-step instruction in genome assembly and annotation, horizontal gene transfer studies, Bayesian evaluation, phylogenetic tree building, microbial evolution modeling, and molecular epidemiology.

The book's final section offers various examples of phylogenomic analysis across medically significant bacteria and viruses, including Yersinia pestis, Salmonella, Shigella, Vibrio cholera, and Mycobacterium tuberculosis, amongst others.

  • Offers a full overview of phylogenetics and phylogenomics, from its foundations to methods and specialized case studies
  • Presents methodologies and algorithms for phylogenomic research studies and analyzes medically significant microorganisms
  • Considers examples of phylogenomic analysis across a range of medically significant pathogens
  • Includes chapter contributions from leading international experts
LanguageEnglish
Release dateMay 17, 2024
ISBN9780323913096
Phylogenomics: Foundations, Methods, and Pathogen Analysis

Related to Phylogenomics

Related ebooks

Medical For You

View More

Related articles

Related categories

Reviews for Phylogenomics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Phylogenomics - Igor Mokrousov

    Preface

    Igor Mokrousov and Egor Shitikov

    Phylogenetics, viewed in parallel with systematics, can become a decisive factor in proving or rejecting new biological species. The concept of phylogenetic species was once proposed as a new step in the terminological evolution of the concept of biological species. Initially, and for decades, the field primarily focused on using single genes or their portions. It was revolutionized by the advent of microarray analysis, followed by whole genome sequencing and high-throughput next-generation technologies that led to the collection of truly Big Data impossible to manage without appropriate computer resources and software tools. Phylogenomics has naturally emerged as an amalgamation of phylogenetics, genomics, and bioinformatics. This interdisciplinary approach has allowed for a new level of discrimination and insight within the tree of life, extending from interstrain variation to higher taxonomic units and back to intrastrain, genetic diversity, a consequence of within-host microevolution of the same strain. Nonetheless, there is still much to be gained from the systemic analysis of multiomics-based information. This data, when superimposed on phylogenomics and both global and local phylogenies, could ultimately lead to a more refined tree of life and a deeper understanding of the origins of life on the Earth.

    An informed understanding is crucial for adequate interpretation of the bioinformatics-based findings. The interpretation remains a challenge even more pronounced in the present time of the ever more advanced technology and algorithms. Generating or retrieving a huge amount of technical data from public databases and submitting it to seemingly sophisticated analysis appear relatively easy, but the added value is not always apparent without truly scientific interpretation. Remarkable advances have been made in deciphering genomics and the evolution of some medically important bacterial species, including the development of online tools and molecular databases. However, these advances in genome bioinformatics also exemplify what may be termed click science. Unlike fascinating and sophisticated click languages, the click science relies on the uncritical and simplified perception of knowledge and a dogmatic, iconographic view of the indications provided by the increasingly user-friendly online phylogenetics tools. This situation may be counteracted by going beyond a narrow and specialized technical field toward a broader view of the interrelated areas within the field of phylogenomics. An insightful hypothesis-making requires a broad knowledge that may be a source of inspiration, which is no less crucial for the advancement of science than technological advances.

    Phylogenomics: Foundations, Methods, and Pathogen Analysis is intended to present, albeit nonexhaustively, a wide range of topics and chapters covering the field of phylogenetics and phylogenomics, from basic concepts to the specialized papers summarizing the particular methodologies and the most relevant pathogenic microorganisms. The diversity of the topics will permit an exchange of ideas and will allow a reader to go beyond the sometimes narrow field of personal expertise to gain a broader knowledge and hopefully inspire new fruitful ideas.

    This book offers a deep overview of phylogenomics as a field, compelling recent developments, and detailed methods and approaches for conducting new research. Early chapters introduce phylogenomic analysis of viruses and bacteria, deciphering bacterial outbreaks, and evolution of drug resistance and virulence. Chapter 1 presents a genome-based perspective on the origin and evolution of viruses, the key features of their phylogenomics, principles of phylogenetic analysis, and the main methodologies for the reconstruction of phylogenomic trees and networks. Different aspects of the application of next-generation sequencing for bacterial classification are discussed in Chapter 2, including general patterns of relationship between genotypes and phenotypes and study of interactions between bacterial and host genomes. Chapter 3 demonstrates how phylogenomics can contribute to the genomic epidemiology of medically significant bacterial pathogens, with a particular focus on outbreak investigation, and the applications of these methodologies in forensic contexts are also discussed. Chapters 4 and 5 present a comprehensive review of the evolution and genomic basis of drug resistance and virulence in bacterial species, respectively.

    Part II of the book is focused on methods offering instruction in modeling evolutionary changes of k-mer patterns of bacterial genomes (Chapter 6), use of the Bayesian approach in molecular epidemiology and assessment of temporal signal (Chapter 7), evolutionary reconstruction in the presence of mosaic sequences (Chapter 8), and use of tools for SNP calling and dealing with big datasets (Chapter 9).

    Part III offers various examples of phylogenomic analysis across medically significant bacteria, such as Salmonella, mycobacteria, Shigella, Leptospira, Yersinia pestis, Staphylococcus aureus, and Corynebacterium diphtheriae (Chapters 10–16, 21, and 24); viruses, such as HIV-1, measles virus, flaviviruses, and RSV (Chapters 17–20); and yeast pathogen, such as Cryptococcus neoformans/gattii (Chapter 23). The last in this list but not the least in its interest is a chapter dedicated to the fascinating field of the research of ancient pathogens (Chapter 22).

    We believe that the unique features of this volume are as follows. First, the volume covers a wide range of topics in phylogenetics and phylogenomics, spanning from foundational concepts to specialized papers. Second, specialized chapters delve into specific methodologies and algorithms or focus on the most medically significant microorganisms. Third, the diverse range of topics encourages readers to expand beyond their specific areas of interest, exploring methods and ideas in adjacent fields, thereby gaining broader knowledge and applying this newfound understanding to generate fresh, fruitful ideas and insights.

    Summing up, this book is a comprehensive collection of research chapters that cover foundational concepts as well as recent advancements in technology and algorithms and will be of interest to scientists in various fields across microbiology and genetics who work with similar data and face comparable challenges. This volume primarily focuses on medically significant bacteria and viruses. An update on other relevant organisms as well as the use of artificial intelligence will follow in further editions.

    Part I

    General topics and foundations

    Outline

    Chapter 1 Phylogenomic analysis and the origin and early evolution of viruses

    Chapter 2 Application of next-generation sequencing for genetic and phenotypic studies of bacteria

    Chapter 3 Genomic insights into deciphering bacterial outbreaks

    Chapter 4 Drug resistance in bacteria, molecular mechanisms, and evolution

    Chapter 5 Virulence evolution of bacterial species

    Chapter 1

    Phylogenomic analysis and the origin and early evolution of viruses

    Gustavo Caetano-Anollés,    Evolutionary Bioinformatics Laboratory, Department of Crop Sciences and C.R. Woese Institute for Genomic Biology, University of Illinois, Urbana, IL, United States

    Abstract

    Phylogenomics is a wide-encompassing field that explores the evolution of genomes and downstream ome repertoires of cellular organisms and viruses. Here I explore the principles of retrodiction, phylogenetic analysis, the main methodologies for the reconstruction of rooted phylogenomic trees, and the general difficulties of building history with modern algorithmic implementations. Phylogenomic approaches to retrodiction involve alignment-dependent and alignment-free methodologies. To illustrate their power, recent research explorations of the origin and evolution of viruses reveal benefits and opportunities that molecular structure and function provide to evolutionary reconstruction. Phylogenomic data-driven insights support an early cellular origin of viruses, strongly refuting viruses being mere byproducts of cellular evolution or conflict. Instead, viruses and cells have shared a common evolutionary path since their inception, which already manifests in the comparative genomic census of protein structural domains and other biological features that are highly conserved.

    Keywords

    Domains of life; evolutionary genomics; GO terms; last universal common ancestor; molecular function; origin of viruses; protein domains; proteomes; rooting

    1.1 Introduction

    The genomic revolution fueled by next-generation sequencing methodologies changed the landscape of what was possible. The JGI-sponsored Genomes OnLine Database (GOLD) currently lists millions of manually curated genome project entries and over 600 metadata fields organized into different categories [1]. As of July 12, 2022, the sequencing projects category included 27,523 complete genome sequencing projects, 296,489 permanent drafts, and 153,982 ongoing or incomplete projects. This exponentially growing wealth of genomic information, illustrated in the accumulation plots of Fig. 1.1, enables both comparative and evolutionary explorations, which are now part of the emerging scientific field of evolutionary genomics. The genomic data deluge also enhances or makes possible a number of downstream applications that focus on other omes, including the exploration of the transcriptome with RNA-seq methodologies or experimental and computational analyses of the proteome or the metabolome. I here focus on the interface between genomics and evolution, emphasizing the exploration of biological origins and evolution with retrodiction approaches. I will explore the principles of phylogenetic analysis, the main methodologies for the reconstruction of phylogenomic trees and networks, and the general difficulties of building history with modern algorithmic implementations. I end by illustrating the challenges of interpreting phylogenies and understanding evolution with an exploration of the origin and evolution of viruses.

    Figure 1.1 Increase in the number of genome sequencing projects with time. There are currently 484,539 genome and metagenome projects listed in the GOLD database targeting cellular organisms, viruses, and environmental samples. GOLD, Genomes OnLine Database.

    1.2 Retrodiction

    Retrodiction is the ability to travel back in time, either computationally, experimentally, or both. One goal of retrodiction is predicting the present and future, for example, to advance systems and synthetic biology applications. Computational retrodiction strategies generally entail building trees with or without reticulations (phylogenies) from data and models of evolutionary change using advanced tools of computational analysis. These phylogenies describe the history of features of interest in biological data, which are placed within a framework of time and change. In contrast, experimental retrodiction strategies often involve resurrection of computationally reconstructed or preserved (sometimes ancient) macromolecules in the laboratory, or when possible, the preservation of cellular organisms or viruses in time as these are evolving. These procedures include synthesizing nucleic acids or proteins; extracting ancient DNA from amber, ice cores, or other materials; or keeping viruses or other microbes in the frozen laboratory conditions for ongoing or future research, including comparative or phylogenetic analysis. The COVID-19 pandemic illustrates how computational and experimental retrodiction strategies can productively interface to support each other. While SARS-CoV-2 viral variants are being collected, preserved, and their genomes sequenced, exhaustive phylogenomic reconstructions describing the evolution of viruses sampled from over 10 million genomic sequences are being analyzed in real-time as the pandemic unfolds (Fig. 1.2). This provides a continuously updated phylogenetic view of viral diversity and a unique window into the evolutionary understanding of a pathogen of great planetary significance. The continuous monitoring of mutations, as these unfold in the viral quasispecies responsible for the COVID-19 pandemic, reveals how seemingly random mutations are tailoring the structure and associated functionalities of the viral proteins as these interact with the human host to further viral persistence. This process leads to the accumulation of sets of mutations that are often inherited together (haplotypes) and are able to structure the viral quasispecies in response to the environment, the seasons, and human interventions [2,3].

    Figure 1.2 Tracing the evolution of the SARS-CoV-2 virus in real-time. (A) A phylogenetic tree obtained using the maximum likelihood optimality criterion describes the worldwide history of the SARS.CoV-2 genome. The time tree of 2884 genomes randomly sampled from about 10 million sequences deposited between December 2019 and July 2022 was retrieved from the automated Nextstrain system (http://nextstrain.org) on July 14, 2022. The tree unfolds the time of genome collection date from left to right. Its leaves (taxa indicated with symbols) are colored according to clade group (groups of taxa with a common evolutionary origin) and VOC nomenclature in parentheses. (B) The time tree is redrawn and placed in the context of a molecular clock describing a mutation rate of substitutions per year. Note the rise of VOC omicron and its very concerning variants (22B and 22C). VOC, Variant of concern.

    While both computational and experimental retrodiction strategies are central endeavors in biology, computational retrodiction is by far the most utilized methodology, especially because it allows us to classify biological macromolecules and extract information useful to unravel their origin and evolution. This strategy is the only one that can explore deep evolutionary questions. In fact, the field of bioinformatics, in both the algorithmic and the applied bioinformatic flavors, originated with the goal of understanding deep and proper historical relationships of the genes and molecules of the biological world [4,5].

    Retrodiction is part of the ideographic scientific method of process and history. This method attempts to explain how present events have been molded by the past through irreducible individual and unrepeatable occurrences. Together with the nomothetic scientific method that searches for universal statements capable of explaining present events with high predictive power, the ideographic method develops explanatory power capable of describing past, present, and future events in natural systems. This explanatory power is, however, restricted by the passage of time. Reality is a conjectured state of things. Karl R. Popper [6] dissected this reality in three realms or interacting worlds (Box 1.1). A set of observable features, characters, or events in these interacting realms (especially Worlds 1 and 2) describes the world of experience. When reality becomes dependent on time, it can be modeled as a Markov chain, an ordered set of states x1, x2, …, xn, where the probability of evolving from state xi to xj depends on the value |i j|, the number of time steps. The smaller the difference between i and j the larger the probability that the descendant has state xj, given an ancestor with state xi. This Markovian property describes how time erodes information in the present that can explain the past [7,8]. The information that is left provides historical traces under different models of evolutionary change, which can then be mined to reconstruct history following the ideographic framework of science. Note, however, that if the number of time steps or the time between steps tends to 0, the ability to reconstruct history increases to a point where history becomes irrelevant; our statement of reality can be simply dissected with nomothetic methods. By the same token, if we consider that natural processes that connect the past to the present are not historical but deterministic (directed by a Laplacian demon), then nomothetic methods suffice in their prediction abilities. This optimistic view is only possible in a purely deterministic world and if our demon is realistic. However, this is unwarranted for several general reasons [7]. First, regular Markov chains lose information exponentially with the passage of time; information destruction is only delayed in some cases by the presence of drift or balancing selection. Second, the data processing inequality of causal chains guarantees that information of an effect E given its cause D is less than or equal to both the information of E given a proximate cause P and the information of P given D. This inequality, which is an internal property of the chain independent of temporal loss of information, makes sure that information content of a signal cannot be increased via local operators. In other words, given a Markov chain XYZ, no deterministic or random processing of Y can increase the information that Y contains X, that is, mutual information must be I(X;Y) ≥ I(X;Z). Third, the mapping of D, P, and E may not be one-to-one. Instead, ancient ancestors may represent ancient relatives and many past events could map to the present. The more entangled the path from the past to the present the more historical is the retrodiction enterprise and the more difficult its dissection with statistical approaches such as maximum likelihood or Bayesian analysis. Thus, reconstructing history from events that are recent (such as those described in Fig. 1.1) is significantly less difficult than exploring time events that occurred in our distant past, especially if those occurred perhaps billions of years ago (Gya).

    Box 1.1

    Popper’s three worlds of reality [6].

    World 1:The physical world of reality. A world of physical and biological entities devoid of self-referential meaning expressing matter/energy and information.

    World 2:The mental world of experience embodied in sensory perception models of the real world that provide objective or subjective meaning.

    World 3:The world of products (artifacts and abstractions) that link reality and experience. This abstract and physical world enables communication.

    Unearthing historical traces in extant biology requires complying with three major evolutionary axioms (starting points of reason) and making them explicit in computations. The axioms were advanced by Edward O. Wiley [9] and should be considered fruitful principles of discovery:

    Axiom 1

    —Continuity: Evolution occurs, with change entailing spatiotemporal continuity. The axiom complies with the principle of continuity embodied in Leibniz’s lex continui or Linnaeus’ "natura non facit saltum," in both its mathematical and metaphysical conceptualizations. Epistemologically, these laws guarantee succession of events, continuity of existence, and gradual change, even in the presence of leaps (saltation events). Saltation has been recently modeled as multimutational leaps relevant under elevated mutation rates or under stress conditions [10]. These modeling exercises can span regimes in which mutations occur consecutively and are fixed or discarded one-by-one or regimes that involve a multiplicity of mutations being fixed per generation. It is interesting to see the progression of mutation rates of the SARS-CoV-2 in real-time samples over the course of two years and a half of genomic exploration. Typical rates of about 23 substitutions/year since the common ancestor (with a minimum of 22 substitutions/year in December 2020) quickly expanded to about 32 substitutions per year with the rise of the new variant of concern (VOC) Omicron variants at the beginning of the year 2022 (Fig. 1.2B). The appearance of a multiplicity of fixed mutations unfolding in the branches of the tree was accompanied by significantly higher mutation rates but also multiple haplotypes involving a number of proteins in a timeframe of only a few months [3]. This usually took the form of mutational recruitment of haplotypes and individual mutation sets that had already appeared during the first year of the pandemic but were brought together by the pressures of immune evasion triggered by vaccines and other interventions. Thus, it appears that gradual mutation accumulation can be accompanied by mutational recruitments favoring the saltation-driven rise of new viral variants. In all cases, there is a succession of mutational events and recruitments that can be traced along the history of a multitude of evolving genomes. These mutational changes comply with the principle of continuity and can be computationally modeled.

    Axiom 2

    —Singularity: Only one phylogeny of extant and extinct biological entities exists as a consequence of genealogical descent. A phylogeny is a hypothesis of history and genealogical relationship among a group of evolving entities (taxa or leaves) in the form of a tree or a network with specific connotations of ancestry and an implied time axis. Fig. 1.2 describes a hypothesis of history related to the evolution of the SARS-CoV-2 virus. Epistemologically, observations must be placed within the context of symmetry breaking and irreversibility of time [11], necessary conditions for phylogeny reconstruction. A phylogeny must be rooted (a branch must be pulled down to hold the ancestor) to fully explain the evolutionary processes being studied and the rooting implies a single evolutionary origin and a series of symmetry-breaking (cladogenetic, speciation, or furcation) events [12]. Note that the existence of multiple origins in a phylogeny due to horizontal processes of change along an evolutionary timeline does not detract from its singular nature. In fact, the existence of reticulations in a network structure must be ultimately resolved into leaves through convergent and divergent relationships and symmetry-joining events, an endeavor that is computationally hard [13,14]. Note that a phylogeny with reticulations implies the existence of other evolutionary processes besides genealogical descent that act as noise (homoplasy) and must be optimized during phylogenetic reconstruction.

    Axiom 3

    —Memory: Biological attributes (characters) are transmitted from one generation of biological entities to the next, modified or unmodified. Phylogenetic characters are observable features that hold the memory of the past. They are used as evidence to establish history and must be carefully selected to match the evolutionary questions posed. The high rates of change of amino acid or nucleic acid sequences for example make sequence sites excellent characters for relationships that are evolutionarily shallow. In contrast, the conserved nature of molecular structures makes structural atomic features suitable for untangling deep phylogenetic relationships. Characters have alternative and mutually exclusive manifestations (character states) and are powerful when they are shared and derived in a phylogeny, that is, when these so-called synapomorphies distribute most of the character state changes in the branches located in the crown of the trees or networks when these have been rooted. For example, characters describing the presence of an amino acid in a protein sequence can have 22 character states corresponding to the genetically encoded (proteinogenic) amino acids that make up proteins. Defining characters such as these involves mapping correspondences (through an alignment) between alternative homologs (character states) of the putative homology (character). Here homology is appropriately equated to common ancestry, descent from a common ancestor, or a unique origin for each derived condition. However, in every instance, the homology is putative. Its validity must be confirmed by tracing character change in the phylogenies of Axiom 2 and checking if the models of change of Axiom 1 are appropriate.

    When defining axioms, biological entities can be cellular organisms and viruses, but they can also be component parts of cellular machinery, such as proteins, ribosomal proteins, ribosomal RNA, transfer RNA, regulatory RNA, or even abstractions such as Gene Ontology (GO) definitions of molecular functions, biological processes, or cellular components. A character implies a "transformation series," a model that provides boundary conditions for transformations from one state to another in the branches of the phylogeny. These transformations transmit character state change in the form of modified or unmodified features of the biological entities that are evolving along the branches of the phylogeny. Characters must show at least two character states, one ancestral (plesiomorphic) and the other derived (apomorphic), their ancestral-derived nature made explicit when character states unfold along the branches of trees or networks. Characters embody gradual change as an axiomatic component of retrodiction but also as an analytical method explaining the rise of biological systems by the addition and interaction of their component parts. As I will advance below, the interaction of parts and wholes becomes crucial for the retrodiction enterprise.

    1.3 The analytical basis of the phylogenetic framework

    Phylogenetic reconstruction takes advantage of both inductive (bottom-up logic) and deductive (top-down logic) reasoning. Induction gathers information to propose useful hypotheses of history, which are necessary to predict and infer. Deduction tests the hypotheses generated by induction and either corroborates or refutes them. Induction alone does not suffice. It recursively relies on higher inductive principles leading to infinite regression or apriorism. Without the first-order logic of deduction, which is demonstrably complete, induction methods fail to preserve truth [15].

    The interplay of induction and deduction is manifestly significant when selecting phylogenetic characters, the basic evidential statements (premises) of a phylogenetic study, and putting them to the test. Characters are class properties that are assumed to be homologies and historical identities [16]. They are conjectures of perceived similarities, primary homologies sensu de Pinna [17], which are accepted as empirical facts for the duration of the study. Using inductive reasoning, characters and their character states must be first identified, coded for analysis, and used to make up columns in phylogenetic data matrices for tree or network reconstructions. Characters can be defined in different ways, generally using cladistic, phenetic, classification, deep learning, or other methods. For example, the flavoprotein domain superfamily, indexed by the structural classification of proteins (SCOP) [18] with the concise classification string (ccs) descriptor c.23.5, represents a collective of monomeric proteins with a wide array of redox or transferase functions (Fig. 1.3). These proteins play roles for example in photosynthesis and DNA repair, and in the removal of radicals from oxidative stress. Flavodoxins hold the flavodoxin-like fold with a 3-layered α/β/α sandwich structure and a β-sheet of 5 parallel strands with order 21345 [19]. The structure is universally present in all superkingdoms of life but has been selectively lost in many eukaryotic lineages. Its distribution already suggests its evolutionary significance. Machine learning and data mining methodologies such as hidden Markov models (HMMs) of structural recognition can reliably define protein structural domains at different levels of the SCOP hierarchy, identifying with high confidence the structure of the c.35.5 superfamily or that of the 2067 superfamilies that are currently indexed in the extended SCOP database (SCOPe v. 2.08) [20]. Algorithmic methodologies and expert curation in SCOP already assume domains in superfamilies hold structures and functions indicative of a common evolutionary origin. Furthermore, the families of these superfamilies also hold domains that are closely related at the sequence level (with pairwise identities >30%), providing further confidence in the use of fold structures in phylogenetic reconstruction. In an alternative example, the amino acid sequences of flavoproteins belonging to the NADPH-dependent FMN reductase family (c.23.5.4) of the flavoprotein domain superfamily known to be present in a number of species (illustrated in Fig. 1.3 by Azobenzene reductase of Bacillus subtilis; PDB entry 1NNI) can be aligned using a progressive iterative multiple alignment methodology (e.g., MAFFT, MUSCLE). The alignment represents a collection of primary homology statements, with every site of the alignment (position in the sequence) constituting a character that can be used as a column in the phylogenetic matrix. Significant matches in character states in conserved sites along the sequence are suggestive of common ancestry supporting characters of the sequence being putative homologies.

    Figure 1.3 The hierarchical classification scheme of the SCOPe database illustrates parent–child relationships of domains ending in the structure of the azobenzene reductase enzyme of Bacillus subtilis (highlighted in green). The enzyme is described with a 3-dimensional atomic model (in cartoon representation) typical of the NADPH-dependent FMN reductase (c.23.5.4) family of structural domains that makes up its 3-layered α/β/α sandwich structure; α-helices are colored in green and β-strands in orange. SCOPe, Structural classification of proteins–extended.

    When searching for the type of character most appropriate for the evolutionary problem being addressed, the hypothetico-deductive method (Box 1.2) reveals that the mechanisms by which hypotheses of primary homology are being proposed by induction from the world of experience are not important [23]. Instead, the classes of potential falsifiers of the hypotheses are those that weed out trivial or suboptimal primary homologies. Examples include criteria of ontogenetic or positional-topographical correspondence, which are common operations of primary homology assessment. Structure and function are widely employed potential falsifiers that use statements of similarity or dissimilarity without making reference to phylogeny. Once primary homologies have been defined, their validity must be tested by phylogenetic reconstruction, that is, by discovering historical hypotheses in the form of reconstructed trees or networks with which to trace character states and their transformation along their branches. These hypotheses H embody a disjunction of constituents (phylogenies and character state transformations), given that a phylogenetic matrix is both a 2-way (has rows and columns) and 2-mode matrix (rows and columns indexing different set of entities) of primary homology statements as well as a tree or network structure explaining them. Please note that this stems from networks being represented with adjacency matrices or phenetic trees using distance matrices. Consequently, hypothesis H of history cannot be derived directly from the matrices of primary homologies if they are to be used as consequences of explanation. H cannot be trees obtained directly from distance matrices derived from alignments. Instead, a framework of optimization must be employed that evaluates competing H in terms of their ability to explain the phylogenetic matrices of primary homologies. In other words, similarities in themselves are logically incapable of testing competing phylogenetic hypotheses and cannot be treated as evidence E when formalizing explanatory power, severity of test, or degree of corroboration (Box 1.2).

    Box 1.2

    The hypothetico-deductive method in algorithm format.

    1. Gather data (observations)

    2. Hypothesize an explanation for those observations

    3. Deduce a consequence of that explanation (prediction). Test the prediction by observing the predicted consequence

    4. Wait for corroboration¹. If there is corroboration go to step 3. If not, the hypothesis is falsified. Go to step 2

    ¹Corroboration is the degree to which a hypothesis has been tested, and of the degree it has stood up to the test 23. Popper’s corroboration measure C of a conjecture H explains evidence E and background knowledge b within sentences of a first-order language L according to Eq. (1.1),

    Equation (1.1)

    where p(E|Hb) denotes the probability (and likelihood) of E given H and b, for pairs of sentences in L, p(E|b) is the probability of E given b, and p(EH|b) is the probability of E and H being unified (their intersection) given b, with p(EH|b)=p(E|Hb) p(H|b) from the axioms of probability. A test of the conjecture (hypothesis) must prove its mettle by maximizing the numerator (the denominator only fulfills normalization) and fulfilling nine adequacy conditions defining statistical relevancy within both a logical and statistical framework. Positive values of C indicate corroboration. Negative values indicate falsification. Thus, C is not a probability. Sprenger [21] posits, however, that there is no measure of corroboration that fits Popper’s unification of both statistical and logic-based aspects of testability, suggesting instead a Bayesian philosophy of science [22]. Note that corroboration can only occur if H is falsifiable, makes predictions consistent with E, and demands a minimum number of ad hoc hypotheses fitting evidence with auxiliary assumptions that are maximally explanatory.

    Building phylogenetic trees with or without reticulations requires formalizing retrodiction within a paradigm of optimization of common ancestry, that is, a paradigm of shared and derived features (synapomorphies). A phylogeny must be selected out of all possible phylogenies using an explicit criterion of optimization, typically parsimony or likelihood. When considering parsimony as the optimality criterion, for example, the goal is to select the trees that increase phylogenetic explanation by minimizing ad hoc hypotheses of character change. The consequence is the maximization of the explanatory power of H, the hypothesis of history. Optimization is an NP-hard problem. The number N of possible trees increases quickly with the number of taxa n, Nu=(2n−5)!/2n−3(n−3)! for unrooted trees and Nr=(2n−3)!/2n−2(n−2)! for rooted trees. For example, there are about 3.2×10²³ and 2.7×10⁷⁶ possible rooted trees with 21 or 50 taxa, respectively. These spaces of trees are of magnitude comparable to the dimensionless Avogadro’s constant (6.02×10²³) and Eddington’s calculation of the number of electrons in the visible universe (1.6×10⁷⁹), respectively. Reconstructing large trees is thus a computationally challenging proposition that will never achieve exact solutions. Instead, heuristic searches with hill-climbing algorithms can only dissect small regions of tree space in search of optimal solutions.

    Once one or more optimal trees are identified, tracing character state changes along branches allows us to evaluate primary homologies with congruence tests, which assess how individual character changes (homologs) agree or disagree with each other along the branches of the most parsimonious trees. Agreements permit to legitimize primary homologies by corroboration, turning them into secondary homologies [17]. In other words, if statements of primary and secondary homologies coincide, there will be a perfect fit and primary homologies will have been fully corroborated. In contrast, disagreements will subdivide primary homologies into several statements of secondary homologies, lowering the level of universality of the observed similarities. This mismatch will require ad hoc explanations of change, homoplasies, which will take the form of reversals, convergences, or parallelisms. The larger the number of homoplasies, the lower the degree of corroboration of primary homologies. This dynamics of explanatory power evaluates how each primary homology statement of character evolution agrees with the favored hypothesis of history obtained from all available and useful data.

    To illustrate the analytical phylogenetic framework, Fig. 1.4 reconstructs the evolutionary history of viral realms, the highest level of the virus classification proposed by the International Committee on Taxonomy of Viruses (ICTV) [24]. Six virus realms are currently recognized by ICTV, each of which is defined by highly conserved features. Adnaviria groups archaeal filamentous viruses encased in a unique capsid protein with genomes made of double-stranded (ds) DNA of the A-form type. Duplodnaviria encompasses all dsDNA encoding capsid proteins with the HK97-like fold. Monodnaviria contains all single-stranded (ss) DNA viruses and HUH endonuclease enzymes. Riboviria groups all RNA viruses encoding RNA-dependent RNA polymerases (RdRp) and all RNA and DNA viruses encoding reverse transcriptase enzymes. Ribozyviria contains viruses with circular negative-sense ssRNA genomes. Finally, Varidnaviria contains all dsDNA viruses encoding a capsid protein with a vertical jelly roll fold. Because the most crucial functional feature of a virus is its replication strategy, I used the Baltimore [25] classification of replication strategies to index the replication strategies present in each one of the 6 viral realms. This information was then used to build a phylogenetic tree of viral realms using maximum parsimony as an optimality criterion. This simple and straightforward phylogenetic exercise recovered the single most parsimonious unrooted or rooted trees, which suggests Riboviria and Ribozyviria were ancestral realms and their time of origin was earlier than the origin of realms holding viruses with DNA genomes. Riboviria includes RNA viruses encoding RdRp and DNA-dependent RNA polymerase (RdDp) retroviral enzymes and Ribozyviria is a realm defined by genomic and antigenomic ribozymes of the delta virus type. The reconstructed tree also supports the sister group relationship of Monodnaviria and Varidnaviria. Primary homologies revealed how conjectures drawn from induction led to both their corroboration and that of the history of viral realms. Since the best test of homology is common ancestry [9], the maximum consistency index (CI) and minimum homoplasy index (HI) values confirm there is a strong vertical phylogenetic signal in Baltimore replication strategies, an intuition that was already proposed [26]. The example makes a good case for teaching phylogenetic reconstruction and shows the power of morphological-type characters and alignment-free methodologies, which I will later discuss.

    Figure 1.4 Reconstructing the evolution of viral realms with Baltimore class data. Viruses have been classified into 6 realms by the ICTV and their replication strategies have been grouped into 7 Baltimore classes (labeled with Roman numbers). Unrooted (left) and rooted (right) phylogenetic trees describing the evolution of viral realms were reconstructed from Baltimore class data coded as unordered binary characters with 0 and 1 representing absence or presence of replication strategy. Four phylogenetically uninformative characters were excluded from analysis. Phylogenetic reconstruction using maximum parsimony resulted in one optimal unrooted tree of 3 steps in length that was selected as optimal solution out of 105 possible unrooted trees or one optimal rooted tree of 3 steps out of 945 possible rooted trees (see tree statistic tables). The optimal tree showed a perfect agreement of character state tracings, with CI of 1 and HI of 0. Thus, primary homologies have been fully corroborated. All suboptimal trees showed lower CI and higher HI values and Goloboff fits, transferring suboptimal corroboration from characters to trees. An exercise of character state reconstruction traces character state changes (indexed lines) onto the branches of the most parsimonious trees. BS analysis with 1000 replicates assessed branch reliability (in bold) and significant support for individual clades. Trees were rooted using Lundberg, which places the root in the most parsimonious branch. BS, Bootstrap support; CI, consistency index; HI, homoplasy index; ICTV, International Committee on Taxonomy of Viruses.

    1.4 Rooting trees

    Modern phylogenetic analysis favors the reconstruction of unrooted trees (with or without reticulations) because the space of these trees is smaller and computationally more tractable than the space of rooted trees (see Fig. 1.4 for an example) and because generating an unrooted tree waivers no commitment to a rooting strategy. However, the retrodiction is incomplete for several reasons. First, phylogenies must be rooted to portray history. Without a root, the origin of the phylogeny cannot be unambiguously defined and character state-vectors of ancestors cannot be properly reconstructed. Second, the frustrated interplay of homology and homoplasy cannot be unfolded, including an inability to distinguish the ancestral-derived order of transformational homologs, that is, distinguishing apomorphic from plesiomorphic character states. Third, phylogenies must be rooted to fully explain the evolutionary process.

    Nelson [27] proposed two general types of rooting methods that make use of formal auxiliary hypotheses, indirect and direct methods (Table 1.1). Indirect methods require character information from taxa external to the group of taxa being studied (the ingroup). In contrast, direct methods focus exclusively on ingroup taxa and the ingroup node, the most recent common ancestor of the ingroup. A number of alternative strategies are available that utilize the two general rooting methodologies (reviewed by Caetano-Anollés et al. [12]), which I here briefly describe.

    Table 1.1

    aDetailed description of rooting methods can be found in Caetano-Anollés et al. [12].

    Indirect methods are the most widely used, especially the outgroup comparison method [28]. Unrooted trees describing the evolution of both the ingroup and an ad hoc external group defined as being ancestral (the outgroup) are rooted by pulling down the external sister group to the base of the tree. Successive expansions of the number of outgroup taxa can be used to enhance the phylogenetic stability of the ingroup and the testability and explanatory power of phylogenetic reconstruction [29]. In a variant of the outgroup method, hypothetical ancestors (e.g., all zero pseudo-outgroups) are used as sister groups. These hypothetical ancestors are treated as extant taxa and should not be combined with direct methods to avoid complications [30]. Indirect methods are limited. They cannot root the Tree of Life (ToL) or groups of organisms that have not been properly surveyed. A final indirect method alternative simply annotates or pulls down an additional branch added to an unrooted tree a posteriori and without phylogenetic optimization, invoking argumentative auxiliary hypotheses. This approach leads to apriorism and should be avoided.

    Direct methods are powerful direct character polarization methods that only require an appropriate survey of character state distributions in the ingroup. They include the generality criterion, optimization-based polarization, and distance and parametric-based rooting methodologies.

    The generality criterion, in its three implementations, is based on the distribution of homologous character states in branches of optimal trees describing the evolution of ingroup taxa. The stratigraphic (paleontological) criterion establishes that the character states of older fossils are ancestral when compared to those of younger fossils. Either the oldest fossil taxon or a hypothetical ancestor that summarizes character state vectors of fossil taxa root the trees. Note that the stratigraphic method is not exclusive of fossil data as it can be extended to molecular data. Nelson's ontogenetic criterion is restricted to morphological characters that unfold in ontogenetic stages, usually made explicit when comparing distinct life history stages that are developmentally nested. Since the Nelson rule is a special case of the generality criterion, the generality principle can be best illustrated using the third implementation, the Weston rule, which applies to any type of character. Weston proposed that when ancestral character states were retained in descendants (i.e., were evolutionarily stable), they would distribute widely along the growing branches of the trees, more so than character states that were more derived. Focusing on the evolution of species, Weston's rule specifies: Given a distribution of two homologous character (states) in which one, X is possessed by all of the species that possess its homolog, character (state) Y, and by at least one other species that does not, then Y may be postulated to be apomorphous relative to X [31]. The generality criterion can be satisfied by reconstructing optimal unrooted trees describing the evolution of ingroup taxa and then rooting the trees a posteriori using the Lundberg rooting method [32]. The Lundberg optimization strategy attaches a hypothetical ancestor most parsimoniously to the internode of one or more unrooted trees recovered during phylogenetic optimization. It is not done a priori by specifying ancestors as wrongfully claimed. Instead, the standard implementation of Lundberg sets all character states of the ancestor as being unknown (missing) and proceeds to optimize attachment of all possible ancestors to the optimal unrooted tree. Alternatively, Lundberg attaches arbitrarily defined ancestors most parsimoniously. Comparing the rooting optimality of these arbitrary ancestors to the standard implementation allows us to test the adequacy and empirical support of evolutionary models [33]. In Fig. 1.4, for example, rooting the tree of viral realms with Lundberg using the standard and an all-0 ancestor recovered rooted trees of equal length that were topologically isomorphic. This finding supports the premise that the presence of a replication strategy in a viral realm is a novelty that is not easily lost and that character state 1 should be considered apomorphic, that is, evolutionarily derived, relative to character state 0.

    Optimization-based polarization offers a second direct rooting methodology. Characters are directly polarized during optimization by spelling out an asymmetric character state matrix of transformation costs (a step matrix) that specifies the costs (distances usually measured in steps) of all possible transformations between character states. In these step matrices, the number of steps between any two character steps in one direction does not necessarily match those in the opposite direction. This is in sharp contrast to undirected (static) step matrices widely used in phylogenetic analysis such as those of typically ordered (additive Wagner optimization) or unordered (nonadditive Fitch optimization) characters used to describe morphological and sequence data, respectively. These step matrices are symmetric. When directionality of character state transformation is allowed, differences in transformation costs impose a directionality of change (a polarization) that in itself roots the trees during phylogenetic optimization. Because optimization-based polarization can violate the triangle inequality condition of distances that impact the validity of phylogenetic reconstruction, the use of asymmetric step matrices must be carefully justified, especially because transformations embody auxiliary hypotheses and require dynamic programming computational methodologies during phylogenetic reconstruction.

    Finally, distance and parametric-based rooting offers a third type of direct rooting method. These strategies include: (1) midpoint rooting obtained by calculating all leaf-to-leaf distances and then placing the root half-way between the most distantly separated leaves, and (2) parametric-based rooting methods that make use of either assumptions of strict or relaxed molecular clocks or nonreversible sequence substitution models. The validity of all of these methods must be carefully justified, especially because they involve probabilities of edge transformations and time parameters in maximum likelihood applications or the integration of model and time parameters in Bayesian methods of phylogenetic reconstruction.

    1.5 Phylogenomic analysis

    Phylogenomics extends phylogenetic analysis and evolutionary thinking to the entire genome and genome-dependent omic repertoires. This includes the genome, the transcriptome, the proteome, the transposome, the metabolome, and the functionome. Phylogenomics, therefore, changes the initial focus of retrodiction, from genetic information in only one or few genes or genetic elements, to exhaustive information in genetic sets. The need for such a change of focus is the recognition that genes and genomes are historical patchworks, that is, that distinct genes carry distinct evolutionary histories and that individual genes are by themselves historically heterogeneous. These heterogeneities arise from pervasive horizontal exchange of genetic information and widespread recruitment in biology [34]. They result in phylogenies built from different genes being seldomly congruent with each other. Two initial strategies have been used to collectively capture genomic information from individual gene sequences. In the supermatrix approach, alignments of individual genes or protein sequences are concatenated by stitching together phylogenetic data matrices into a supermatrix, which is then used to build a phylogenomic tree using the standard methods of maximum parsimony, maximum likelihood, or Bayesian inference [35]. Alternatively, the supertree approach builds individual trees from individual data matrices, which are then joined together into a single phylogeny [36] using for example coalescent methods [37]. These approaches are demanding, often require manual curation, and are dependent on the accuracy and assumptions of sequence alignments. They fall within the broad category of alignment-dependent methodologies, which include the most widely used bioinformatic methods of BLAST and CLUSTAL [38,39].

    Alignment-dependent methodologies are by far the most widely used but are error-prone [40,41]. Homology errors include the alignment of different paralogues to each other, alignment of different exons, and even alignment of exons to introns, all of which weaken the power of retrodiction. More troubling is the problem that gene sequences embody patchworks of segments with different evolutionary histories [42]. This historical segmental heterogeneity affects the validity of using sequence alignments in general, even if alignments are from concatenated gene sets. Typically, sequence alignments are built without recognizing that the structure of proteins and nucleic acids is modular. For example, structural domains are the structural, functional, and evolutionary units of proteins. Many proteins harbor more than one domain, which often rearranges in the course of evolution [43]. Domains have gradually appeared and accumulated since the origin of proteins 3.8 billion years ago (Gya). However, the evolutionary combination and rearrangement of domains was massive about 1.5 Gya following a combinatorial big bang [44]. Consequently, each domain carries its own history making multidomain proteins evolutionary patchworks. Similarly, structural domains are made up of elementary functional loops, each of which carries its own evolutionary origin and history [45]. Remarkably, the existence of distinct domains or loops is for the most part neither considered in sequence alignments nor in evolutionary models for alignment and phylogeny reconstruction. For example, concatenated sequence alignments of highly conserved gene sets were used to build an unrooted ToL and support a two-superkingdom model of organismal diversification [46]. The concatenated alignment did not consider the existence of domains, which in many instances had been transposed in evolution [12]. In addition, trimming of amino acid positions with more than 50% gaps (partial deletions) introduced considerable uncertainties, including the elimination of variable but central segments that could carry significant phylogenetic information. Other problems such as sampling of the expanding diversity of Archaea and their proteins, fast-evolving lineages, anomalous behavior of Asgard proteins, and effects of horizontal genetic exchange added uncertainties to phylogenetic reconstructions [47]. Thus, irreconcilable misalignments from domain rearrangements, biases introduced by trimming, and other technical difficulties have challenged the entire retrodiction exercise.

    Another problem of alignment-dependent retrodiction is the violation of character independence, the idea that characters must serve as independent evolutionary hypotheses of history (homology). If characters do not evolve independently from one another, interacting characters are overweighted in the analysis and the resulting phylogeny fails to represent true history [48]. Character dependencies can arise from biases introduced by structure, function, physiology, development, and even ecological influences on the molecules. Dependencies obscure phylogenetic signals and must be either encoded into the phylogenetic model of change via parameters or differential weighting schemes or avoided by excluding one or both interacting offenders. To illustrate the significance of character independence, we take advantage of the simple thought experiment proposed by Penny and Collins [49] some years ago:

    1. Take into consideration a protein sequence alignment, with columns representing sequence sites (characters) and rows representing proteins (taxa);

    2. Shuffle columns (characters) in the phylogenetic data matrix, randomizing the order of sequence sites in the alignment;

    3. Ask if the phylogeny recovered from the randomized alignment has changed; and

    4. Finally, ask if anything has been lost in the process.

    The answer to question (3) is that randomization produces exactly the same phylogeny because each character (row) is an independent statement of homology in a sequence alignment and follows its own model of evolution. Exchanging rows in the data matrix of Fig. 1.4 will not change the results of the analysis (the reader is welcome to experimentally confirm this fact). However, the answer to question (4) is tantalizing and reveals a significant problem for sequence analysis in general. Randomization of protein sequence sites in a data matrix effectively destroys protein structure and its associated functions. A similar thought experiment can be proposed for RNA or DNA, but also for other characters that require alignment. Modeling the effect of macromolecular structure on phylogenetic inference can quantify the impact of sequence site dependencies [50]. In these studies, failure to account for even small amounts of character dependencies due to secondary or tertiary structure in proteins and RNA led to inaccurate tree topologies and errors in phylogenetic estimation, especially if dependencies were strong and tree lengths large. Furthermore, these dependencies are complicated by the existence of multiple interactions between sites, and in cases of deep evolutionary relationships, phylogenomic information in small subsets of sequence sites present in only a handful of genes that are highly conserved and are driving the retrodiction exercise [51]. This challenges the suitability of sequences for deep evolutionary studies and questions the insistence on using this information to propose ToL diversification scenarios.

    Extending alignment-dependent methodologies to entire genomic sequences is by itself a daunting task, but is currently impractical given significant levels of alignment uncertainty. Currently, ToLs are being reconstructed using concatenated gene repertoires of presumably highly conserved gene sets associated with translation with a goal of covering as much diversity as possible. The ToL of Hug et al. [52] for example was built from 16 ribosomal proteins (r-proteins) and covers 92 named bacterial phyla, 26 archaeal phyla, and all 5 of the eukaryotic supergroups. One general problem is that these gene sets carry only a minute fraction of the entire genomic information of an organism This type of sampling bias can be highly misleading and should preclude making broad concluding inferences. One alternative is the use of methods of phylogenetic reconstruction that are alignment-independent. Alignment-free methodologies have been used in genome biology since the dawn of this field. Examples include building phylogenomic trees from gene content [53] or protein fold occurrence and abundance [54]. The utility of fold structure in phylogenomic analysis has been recently reviewed [43] and has been used to build ToLs describing the evolution of proteomes or trees of structural domains describing the evolution of the protein world. Many alignment-free methods of sequence comparison have been developed since the mid-1970s. Some of them have been recently benchmarked [55,56]. They include methods based on word counting such as exact and inexact k-mer counts, word matching, variable length word counts, and single nucleotide polymorphism counts. Microalignments, common substring lengths, Fourier transformations, split from common subsequences, and methods based on information theory that discriminate signal from noise have also been developed and benchmarked. Some tools are generic enough to be applied to several types of benchmarking datasets, including the alignment-free k-mer statistics [57], the word-match MinHash approach [58], and the distance-based feature frequency profiles (FFP) from k-mer counts [59].

    1.6 Deep evolutionary explorations with alignment-free methods

    Counting words or string alphabets such as n-grams in linguistics or k-mers in genomics has the potential to access global phylogenetic information embedded in genomes and downstream ome repertoires without the complications and limitations of alignment-dependent methodologies. For example, the FFP method surveys k-mer string alphabets for individual genomes, proteomes, or transcriptomes and builds divergence distance matrices using a divergence statistic (Jensen-Shannon) that describes (within a scale between 0 and 1) how close or distant are features from each other [58,59]. While the algorithmic methodology is greedy and straightforward, homology in distance methods of these types cannot be evaluated, short-circuiting the retrodiction enterprise. Similarly, the evolutionary significance of string alphabets cannot be evaluated either. There is no clear functional or structural rationale that would link k-mer statistics to molecular functions. Despite these difficulties, k-mers seem to extract significant evolutionary information.

    Fig. 1.5 shows a whole-proteome ToL reconstructed from k-mer alphabets with the FFP method [59]. The ToL was rooted using the outgroup comparison method with hypothetical ancestors, sets of extant proteomes with sequences that have been shuffled. The assumption is that the ancestor of diversified life had a stochastic makeup, which is highly unlikely given significant biases already present in the prebiotic pool that gave rise to polypeptides and primordial nucleic acid molecules. In addition, the possibility of an early cellular origin of viruses pushes the universal cellular ancestor back in time further dispelling the stochastic makeup of the primordial lineage that would hold the outgroup of Fig. 1.5A (see discussion below). These considerations question the validity of the most ancient splits of the tree, especially the sisterhood of Archaea and Bacteria. The rooted tree shows splits of major lineages appearing at its base, indicating an evolutionarily deep burst of organismal diversity. The tree also shows particularly long branches in Eukarya, suggesting remarkable accumulation of diverse k-mer vocabularies in evolution of higher eukaryotic taxa. One notable topology is the sisterhood of plants and animals, which goes counter to the canonical opisthokont topology that traditionally unifies fungi and animals. Protists also emerge as seven smaller groups associated with other kingdoms, which contrasts with their placement in other ToL reconstructions. Despite of difficulties, it is notable that the optimal feature length for proteome sequence was 10 amino acids or longer in these studies (13 was used to build the ToL), suggesting that protein segments of the typical size of a protein loop were able to harvest the most significant historical information with the k-mer distance method.

    Figure 1.5 Phylogenomic ToLs reconstructed using alignment-free methodologies. (A) A whole-proteome tree describing k-mer string alphabets of 4023 proteomes suggests both an evolutionarily deep burst of organismal diversity and the generation of remarkably diverse vocabularies in higher eukaryotic taxa [59]. The ToL is rooted in alphabets of randomized genomes representative of extant genome diversity. Circles portray collapsed clades, which are colored according to kingdoms. (B) A rooted tree of cellular life built from a genomic census of 1924 terminal GO terms of molecular function in 248 free-living organisms supports the early diversification of Archaea and a 3-superkingdom ToL [61]. The tree of functionomes was rooted using the generality criterion with Lundberg. The phylogenetic data matrix used to build the phylogenomic tree describes the genomic abundance of GO terms retained after exclusion of a set of 115 terms enriched in horizontal genetic transfer identified with the hypergeometric distribution (P<.05). The unrooted phylogenetic network on the right was reconstructed with the neighbor-net algorithm of SplitsTree. Terminal nodes of Archaea, Bacteria, and Eukarya are labeled in red, blue, and green, respectively. Reticulations reveal homoplasy-driven conflicts in phylogenetic reconstruction. GO, Gene ontology; ToL, Tree of Life.

    The availability of structural and functional data associated with genomic sequences offers the opportunity to use biologically explicit, more conserved, and more reliable molecular features for deep evolutionary studies. In this regard, a survey of GO terms of molecular function has been already used to explore the evolution of functionomes [60–63]. Fig. 1.5B shows a ToL and its corresponding network built using standard methods of phylogenetic reconstruction [61]. This includes using multistate taxa to represent serial homologies depicting an abundance of terminal GO terms, building unrooted trees with maximum parsimony and rooting them a posteriori, and using the generality criterion of rooting with Lundberg. The reconstructions show GO data holds significant evolutionary information, especially when parasitic and obligate parasitic organisms and GO terms exhibiting significant horizontal gene transfer (HGT) were excluded from the analysis. The ToL described a tripartite cellular world that was rooted paraphyletically in Archaea in lineages that were thermophilic, a result supported by ToLs built from domain structure and other phylogenetic reconstructions conducted over a decade of research exploration (reviewed by Caetano-Anollés et al. [64]). Monophyletic relationships of major eukaryal groups were strong and revealed a close relationship between plants and animals similar to that of the ToL of Fig. 1.5A. The use of the genealogical sorting index to measure the degree of monophyly showed significantly high monophyly degrees for half of all organismal groups analyzed (indexed in Fig. 1.5B), spanning the entire tree and supporting the validity of the reconstructions. A network reconstructed with the neighbor-net algorithm uncovered patterns of reticulation in the data with this distance approach (inset of Fig. 1.5B). Measures of evolutionary reticulation were minimum in Eukarya and maximum in Bacteria. However, the massive role of HGT in microbes did not materialize in the reconstruction. Delta(δ)-scores for individual taxa, which measure reticulation levels on a scale from 0 (absence of reticulations) to 1 (complete absence of vertical signal), showed values ranged from 0.16 for mammals to 0.39 for Acidobacteria, Bacteroidetes, Gemmatimonadetes, and Verrucomicrobia. While bacteria were as expected the largest contributors to genetic exchange, δ-scores in general showed GO term exhibited limited HGT. The network also showed long branches associated with Eukarya, suggesting a significant generation of functional novelty in these lineages, matching the results obtained with k-mers in Fig. 1.5A.

    The structure of proteins and nucleic acids also carries evolutionary information that is significantly deep, as our research has recurrently shown (beginning with Caetano-Anollés [65,66] and Caetano-Anollés and Caetano-Anollés [54]). In the case of proteins, a census of structural domains identified with HMMs of structural recognition can be used to generate phylogenetic data matrices of occurrence or abundance of domains in proteomes with which to build ToLs describing the evolution of proteomes. Fig. 1.6 describes a ToL reconstructed from protein domain structures defined at homologous superfamily (H) level of CATH and rooted using the generality criterion and Lundberg [67]. Together with SCOP [18], CATH is a gold standard of protein domain classification [68]. The study analyzed 41 archaeal, 189 bacterial, and 65 eukaryal genomes that were free-living to avoid reductive evolution effects induced by parasitic and obligate parasitic relationships. The ToL of Fig. 1.6 is one of many with an equal number of taxa randomly sampled from the initial proteome set. This approach avoided effects of unequal taxon sampling known to alter phylogenetic inference [69]. Collectively, ToLs again described a tripartite cellular world that was rooted paraphyletically in Archaea, made plants and animals sister taxa, and placed fungi and other eukaryotes in more basal positions of the tree. To determine if the time of origin (evolutionary age) of domain structures affected phylogenetic signal, ToLs were also built from character sets that included either very ancient domains common to all superkingdoms (appearing 3.8–3.1 Gya), ancient characters appearing during the early rise of superkingdoms (3.1–2.5 Gya), or younger characters (2.5–0 Gya). The time of origin of characters was derived from a phylogenomic tree of structural domains that was placed within a geological time scale with a clock of fold structures. Remarkably, the different character sets produced ToLs with topologies that were significantly rearranged, with the very ancient and ancient character sets rooting the trees in Archaea and the younger character set rooting the tree in Bacteria. The exercise indicates that historical heterogeneity strongly affects ToL reconstruction. This historical effect must be identified and removed. One solution to the problem is to avoid building ToLs altogether. Instead, a better approach is to reconstruct trees of parts (e.g., trees of structural domains), which are less affected by violations of character independence, and derive phylogenetic statements about the diversification of wholes (e.g., proteomes) from them [12]. We pursued this approach when placing viruses in a ToL and when we explored their origin and evolution.

    Figure 1.6 Trees of proteomes reconstructed from abundance of structural domains defined at homologous superfamily (H) level of CATH protein classification reveal the historical heterogeneity of phylogenetic characters. Circular cladograms describe the evolution of 123 equally sampled proteomes from superkingdom Archaea, Bacteria, and Eukarya. The rooted trees were generated from abundance counts of 2221H fold structures or subsets representing structures with different times of origin (most ancient, ancient, and younger), with time expressed in billions of years (Gya). Arrows indicate the most ancient supergroup in the rooted tree. Source: Data from Bukhari SA, Caetano-Anollé G. Origin and evolution of protein fold designs inferred from phylogenomic analysis of CATH domain structures in proteomes. PLoS Comput Biol 2013;9(3):e1003009.

    1.7 Untangling the origin and evolution of viruses with structural phylogenomics

    The origin and evolution of viruses remain a vexing problem for scientific inquiry. The current COVID-19 pandemic and its widespread effects on human health and the global economy have raised both public and scientific interest as well as many foundational questions related to viruses. Are they living or nonliving? Are

    Enjoying the preview?
    Page 1 of 1