You are viewing a non-interactive page that is intended for the crawler. You probably want to see this page: http://phytozome.jgi.doe.gov:80/pz/portal.html#!info?alias=Org_Egrandis


Phytozome v10: Info
Actions
My Data (0)
Settings

Eucalyptus grandis v1.1 (Eucalyptus)

Overview

A major challenge for the achievement of a sustainable energy future is our understanding of the molecular basis of superior growth and adaptation in woody plants suitable for biomass production. Eucalyptus species are among the fastest growing woody plants in the world, with mean annual increments up to 100 cubic meter per hectare. Eucalyptus is the most valuable and most widely planted genus of plantation forest trees in the world (ca. 18 million hectares) due to its wide adaptability, extremely fast growth rate, good form, and excellent wood and fiber properties.

Eucalyptus is also listed as one of the U.S. Department of Energy's candidate biomass energy crops . Genome sequencing is essential for understanding the basis of its superior properties and to extend these attributes to other species. Genomics will also allow us to adapt Eucalyptus trees for green energy production in regions (such as the Southeastern USA) where it cannot currently be grown. The unique evolutionary history, keystone ecological status, and adaptation to marginal sites make Eucalyptus an excellent focus for expanding our knowledge of the evolution and adaptive biology of perennial plants.

(from JGI - The Joint Genome Institute )

Statistics

This is a release of the initial 8X mapped Eucalyptus grandis BRASUZ1 genome assembly and a version 1.1 annotation.

Genome
The main genome assembly is approximately 691 Mb arranged in 4952 scaffolds
Approximately 641 Mb arranged in 32,762 contigs (~ 7.3% gap)
Scaffold N50 (L50) = 5 (53.9 Mb)
Contig N50 (L50) = 2261 (67.2 kb)
300 scaffolds are > 50kb in size, representing approximately 94.2% of the genome
Loci
36,376 total loci containing protein-coding transcripts
33,917 loci containing protein-coding transcripts on the 11 main linkage groups/chromosome assemblies (93% of total above)
Alternative Transcripts
9939 total alternatively spliced transcripts
9741 alternatively spliced transcripts on the 11 main linkage groups/chromosome assemblies (98% of total above)

This new annotation was produced by manually filtering 8620 low-confidence gene models from the original v1.0 annotation. Stricter c-score and protein homology coverage thresholds were employed in this case, especially when considering partial transcripts missing a modeled start or stop codon. EST support was also examined to check that aligned coverage followed the same intron splicing pattern as the gene model. Filtered gene models were removed from consideration in the Phytozome v8 gene family generation, but remain searchable in Gbrowse and can be displayed as an additional transcript track. Associated FASTA and annotation info files are available on the FTP site.

Note : As of August 22, 2011, accession IDs and transcript names have been updated to better reflect gene locus location in the current v1.0 assembly:

  1. for gene loci as defined by the primary transcript dataset on the 11 main chromosome linkage groups
        - scaffold_1 is designated A, scaffold_2 is designated B, ... scaffold_11 is designated K
        - loci are numbered sequentially on each linkage group, beginning with 00001
        - primary transcripts receive a .1 suffix
        - alternatively spliced transcripts receive the suffix .2, .3, etc. as needed
  2. for gene loci on the remaining scaffolds (12 and above)
        - all scaffolds are designated L to indicate they are not in the main chromosome-level assembly
        - all loci are numbers sequentially, beginning with 00001
        - primary transcripts receive a .1 suffix
        - alternatively spliced transcripts receive the suffix .2, .3, etc. as needed

The older accession IDs (e.g. Egrandis_v1_0.052539m) remain available for keyword searching and are displayed throughout Phytozome as E. grandis gene aliases. A comprehensive list of these old to new accession ID mappings is available here in the Eucalyptus grandis FTP site as a file named Egrandis_201_synonym.txt.

Sequencing, Assembly, and Annotation
How was the assembly generated?
The genome was assembled with Arachne by Jeremy Schmutz at HudsonAlpha.
How were repeats identified?
A de novo repeat library was made by running RepeatModeler (Arian Smit, Robert Hubley) on the genome to produce a library of repeat sequences. Sequences with Pfam domains associated with non-TE functions were removed from the library of repeat sequences and the library was then used to mask ~38% of the genome with RepeatMasker.
How were ESTs aligned?
We aligned ~2.9M E. grandis EST sequences and ~2.4M EST sequences from sister Eucalyptus species using Brian Haas's PASA pipeline which aligns ESTs to the best place in the genome via gmap, then filters hits to ensure proper splice boundaries.
How were plant proteins aligned?
Rice, Arabidopsis and grapevine proteins were downloaded from MSU, TAIR and Genoscope respectively. Soybean proteins were generated in our internal annotation pipeline at the JGI. All proteins were aligned to the soft-masked genome using gapped BLASTX; high-scoring sequence pairs (HSPs) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in orange) spans adjacent exons and the intervening intron(s). Also, small exons are often missed.
Gene prediction
To produce the current "egrandis1.0" gene set, we used the homology-based FgenesH and GenomeScan predictions. The best gene prediction at each locus is picked and integrated with EST assemblies using the PASA program. The gene set shown on the browser was generated from the above input gene models by Richard D. Hayes at JGI.
The gene prediction pipeline has the following components: proteins from diverse angiosperms and ~260,000 EST assemblies (from ~2.9M filtered E. grandis ESTs and ~2.4M EST sequences from sister Eucalyptus species, assembled with PASA) were aligned to the genome, and their overlaps used to define putative protein-coding gene loci. The corresponding genomic regions were extended by 1kb in each direction and submitted to FgenesH (provided by Asaf Salamov at JGI) and GenomeScan, along with related angiosperm proteins and/or ORFs from the overlapping EST assemblies. Fgenesh identifies likely protein coding exons, favoring regions that align well to the given homologous proteins.
These two sets of predictions were integrated with expressed sequence information using PASA (Haas et al. 2003) against ~260,000 Eucalyptus EST assemblies. The results were filtered to remove genes identified as transposon-related.
Contacts
Zander Myburg (University of Pretoria) (email: zander DOT myburg AT up DOT ac DOT za)
Dario Grattapaglia (EMBRAPA and Catholic University of Brasilia) (email: dario AT cenargen DOT embrapa DOT br)
Jerry Tuskan (Oak Ridge National Laboratory ? JGI) (email: tuskanga AT ornl DOT gov)
Associated Publications
Myburg AA, Grattapaglia D, Tuskan GA, Hellsten U, Hayes RD, Grimwood J, Jenkins J, Lindquist E, Tice H, Bauer D, Goodstein DM, Dubchak I, Poliakov A, Mizrachi E, Kullan AR, Hussey SG, Pinard D, van der Merwe K, Singh P, van Jaarsveld I, Silva-Junior OB, Togawa RC, Pappas MR, Faria DA, Sansaloni CP, Petroli CD, Yang X, Ranjan P, Tschaplinski TJ, Ye CY, Li T, Sterck L, Vanneste K, Murat F, Soler M, Clemente HS, Saidi N, Cassan-Wang H, Dunand C, Hefer CA, Bornberg-Bauer E, Kersting AR, Vining K, Amarasinghe V, Ranik M, Naithani S, Elser J, Boyd AE, Liston A, Spatafora JW, Dharmwardhana P, Raja R, Sullivan C, Romanel E, Alves-Ferreira M, Külheim C, Foley W, Carocha V, Paiva J, Kudrna D, Brommonschenkel SH, Pasquali G, Byrne M, Rigault P, Tibbits J, Spokevicius A, Jones RC, Steane DA, Vaillancourt RE, Potts BM, Joubert F, Barry K, Pappas GJ, Strauss SH, Jaiswal P, Grima-Pettenati J, Salse J, Van de Peer Y, Rokhsar DS, Schmutz J, The genome of Eucalyptus grandis. , Nature . 2014 Jun 19; 510 7505 356-62