пятница, 2 марта 2012 г.

OrthoMaM: A database of orthologous genomic markers for placental mammal phylogenetics.(Database)

Authors: Vincent Ranwez (corresponding author) [1,2]; Fr�d�ric Delsuc [1,2]; Sylvie Ranwez [3]; Khalid Belkhir [1,2]; Marie-Ka Tilak [1,2]; Emmanuel JP Douzery [1,2]

Background

Mammalian systematics has been a pioneering field in the use of molecular sequence data for inferring phylogenetic relationships. Molecular phylogenies at different levels of the mammalian evolutionary tree have accumulated since the seminal studies published in the early 1990s. Among the first genes to be used was the mitochondrial Cytochrome b (

MT-CYB ) [1] which since became the "barcode" marker for mammals with more than 20,000 sequences currently available representing about 2,000 species. The mitochondrial 12S rRNA gene has also early been considered but its use was limited by its less straightforward alignment [2]. Acknowledging the limits of single gene phylogenies, the potential of complete mitochondrial genomes for reconstructing placental orders relationships was early explored, but with a somewhat limited success as judged a posteriori [3]. In parallel, the first efforts to identify single-copy orthologous nuclear markers for mammalian phylogenetics have been made from conserved, large-sized exons. The exon 1 of the Retinol Binding Protein 3 (RBP3 ) - also known as the Interphotoreceptor Retinoid-Binding Protein (IRBP ) - was the first to be developed [4] later followed by the exon 28 of the von Willebrand Factor gene (VWF ) [5]. Since then, other nuclear genes have acquired the status of "standard" mammalian phylogenetic markers. This is for example the case of the intronless Recombination Activating Gene 1 (RAG1 ) and [alpha]-2B Adrenergic Receptor (ADRA2B ) genes, the Growth Hormone Receptor (GHR ) exon 10, the c-myc proto-oncogen (MYC ) exon 3, and the Breast Cancer Associated protein 1 (BRCA1 ) exon 11. Either used in single gene phylogenies at their beginnings or in combination later [6, 7], this handful of markers has proven to be useful for unravelling unsuspected clades at different levels of the mammalian taxonomy, like for instance Afrotheria [8], Cetacea + Hippopotamidae [9], and the grouping of shrews and hedgehogs to the exclusion of moles within Eulipotyphla [10].

The choice of these useful markers has nevertheless been mainly empirical. In fact, their initial development has been almost entirely dependent upon the availability in public databases of human, murine, bovine, and canine sequences for allowing primer design. This historical constraint in marker choice involved that the phylogenetic informativeness of these genes is likely to be non optimal for many of the phylogenetic studies in which they have been used [11]. Selecting the genes with the appropriate resolving power for a given phylogenetic problem is a difficult task, and theoretical work has so far provided only limited insight for guiding this choice [12, 13]. In practice, however, it has long been realized that there is an optimal evolutionary rate associated with a given phylogenetic question [14], and empirical procedures such as saturation plots [15] have been designed to evaluate the limits in resolving power of a given molecular marker.

In mammals, the first attempt at specifically selecting multiple nuclear genes for tackling a circumscribed phylogenetic question was made by Murphy and co-workers who specifically targeted genes scattered throughout the mammalian genome to resolve the earliest placental divergences [16]. Their pragmatic approach was based on the initial selection of exons long enough for easy PCR amplification from whole genomic DNA (> 200 bp) and for which the nucleotide identity between human and mouse ranged between 80 and 95%. This simple procedure was successful at identifying a dozen of new phylogenetically informative nuclear markers for resolving the long-standing question of the evolutionary relationships among placental orders [16].

Mammalian phylogenetics is now turning into phylogenomics [17] with such large-scale sequencing initiatives as the ENCODE project [18, 19] and the NISC Comparative Vertebrate Sequencing Program [20]. The availability of mammalian whole genome sequences provides a gold mine for the identification of new phylogenetic markers to further resolve the mammalian tree at different taxonomic levels. However, in this new genomic era, the main problem perhaps resides in the determination of orthology relationships among the different genomes. Bioinformatic tools have been developed for processing whole genome sequences such as INPARANOID [21] and OrthoMCL [22] resulting in dedicated databases of clusters of orthologous groups for eukaryotes [23, 24]. A recent comparison of different orthology detection strategies has shown that phylogenetically based methods perform better than classical similarity search based methods [25]. Accordingly, the 2007 version of the EnsEMBL database now implements such a phylogenetically-based strategy using maximum likelihood and tree reconciliation methods for orthology assignment among vertebrate genomes [26].

In an effort to synthesize all these genomic information, we built upon the EnsEMBL database for constructing a mammalian centred database called OrthoMaM (Orthologous Mammalian Markers). Our aim is to provide a flexible resource for identifying new candidate markers for future use in mammalian phylogenetic studies. Similar approaches based on available genomic data have been recently conducted in plants [27] and ray-finned fishes [28] but they include their own determination of orthology in the corresponding bioinformatic pipelines. By directly relying on the EnsEMBL orthology assessment procedure, our approach has the advantage of being relatively easy to update as more mammalian complete genomes will become available and annotated.

We focused on orthologous exons rather than on full-length transcripts in order to provide biologists with single continuous fragments potentially amplifiable from genomic DNA. Working with RNA extraction followed by RT-PCR would require a quality of tissue preservation that is not achieved in the vast majority of cases. Moreover, working with genomic DNA avoids the practical problems induced by potential differences of intron length among taxa during the PCR amplification, provided that exons are specifically targeted. We selected individual exons of more than 400 bp long. Increasing this arbitrary threshold might preclude the use of old tissue samples or museum specimens that often contain altered total DNA. Also, lowering this threshold length would involve keeping a total of 7,206 human, murine, and canine exons among which the shorter is only 84 bp long. The minimum length for an exon to be included in the database was thus set up to 400 bp because it offers a reasonable compromise between technical (PCR) constraints, the number of selected candidates, and subsequent sequencing efforts.

Until now, the choice of phylogenetic markers for mammalian systematics has been governed more by historical constraints than by explicit criteria. This is the reason why we developed a bioinformatics pipeline to derive evolutionary descriptors related to the potential phylogenetic informativeness of each exon. Quantifying the substitution pattern of genes is important to understand the potential biases that might affect phylogenetic inferences [29]. Ideally, a good marker would have an optimal evolutionary rate for the given phylogenetic question, equilibrated and homogeneous base frequencies [30, 31], and homogeneous distribution of site variability [11]. Yet the characteristics of a valuable marker depends on the goals of the study it will be used for, and certainly also vary from one investigator to another. Therefore, rather than subjectively selecting a subset of these candidate exons, we provide evolutionary descriptors for all of them. The values of our evolutionary descriptors are indicated for each exon on individual web pages with links to EnsEMBL for full description of the loci. The corresponding sequence alignment and its associated maximum likelihood phylogenetic tree with model parameter estimates are also presented. Note that this phylogeny should be considered cautiously since it might not be optimal in terms of topology because of the use of a suboptimal model (see below), but it should nevertheless provide reliable estimates of model parameters [32]. In any case, markers cannot be selected directly from the ML topology they have produced in order to avoid any potential misuse of the database biased by

a priori phylogenetic beliefs.

A number of these evolutionary characteristics can be queried directly through the web-interface. The value and reliability of some of these descriptors are strongly dependent with one another. For example, the global GC content is strongly related to the GC percentage at the third codon position. Moreover, the variance on model parameter estimates will be reduced with longer …

Комментариев нет:

Отправить комментарий