I have recently started working on a sugarcane genome assembly project, so I figured I would share some interesting points from my literature review. Sugarcane is obviously a very economically important crop, having applications in both food and energy production. Despite this economic importance, genomic resources are lacking for sugarcane due to the complexity of the genomes of modern cultivars. Let's look into the specifics of why that is.
Sugarcane is in the Poaceae family, the same mega-grass family as major crops such as rice, barley, maize, and wheat. When broadly referring to sugarcane, people are often talking about the Saccharum genus, in which there are a few prominent species. From my reading, it seems that the two key species that have influenced the modern production of sugarcane are S. officinarum and S. spontaneum, with many cultivars being interspecific hybrids between them. Specifically, around the late 19th century, important traits from S. spontaneum such as disease resistance and high yield were introgressed into S. officinarum varieties which already had high sugar content. Given that sugar cane has relatively long breeding cycles (10-15 years), there are relatively few generations separating modern varieties from these interspecific hybrid events1,2,3.
In trying to anticipate potential challenges for genome assembly, this taxonomy raises some red flags. Interspecific hybrids can often be challenging to assembly given the relatively high degree of structural variation (and heterozygosity in general). Sugarcane takes this notion to the extreme.
Ploidy and Genome Size¶
Here are some numbers off-the-bat
|S. spontaneum||2n =5x-16x=40-128||Octoploid|
Modern interspecific hybrids between S. officinarum and S. spontaneum exhibit allopoloploidy as well as aneuploidy. They are allopolyploids in that distinct portions of the genome are contributed to by these two ancestors. Specifically, about 70-80% (I've also read 75-85%) of the modern cultivars genome are of S. officinarum ancestry, while the rest comes from S. spontaneum 1. Perhaps most interesting is the aneuploidy. For lack of a better way of putting it, this means that each chromosome has a distinct number of homologous/homeologous chromosomes (i.e. chromosome 1 may have 10 homologous/homeologous chromosomes, while chromosome 2 has 13 homologous/homeologous chromosomes). This is best visualized in the diagram below provided by Grivet et al, where each row represents a different chromosome group.
"Schematic organization of the genome of a current sugarcane cultivar. Each bar represents a chromosome; yellow coloring represents regions originating from S. officinarum and green coloring from S. spontaneum. Chromosomes aligned within the same row are homologous (or homoeologous). The key characteristics of this genome are the high level of ploidy, the aneuploidy, the bispecific origin of the chromosomes (with at least three-quarters inherited from S. officinarum), the existence of structural differences between chromosomes of the two origins, and the occurrence of interspecific chromosome recombinants." 1
The above diagram represents one particular cultivar, though the exact structure may change from individual to individual.
Naturally, along with this ploidy comes a large genome size. Rough genome size estimates I have read for modern cultivars is around 10 Gbp (for context, the human genome size is roughly 3.3 Gbp). This puts it in the same class as other monster genomes that have recently been assembled by some of my colleagues here at Johns Hopkins. These include loblolly pine and hexploid wheat.
Genome Assembly Strategy¶
Given the immense genome size of modern sugarcane cultivars as well as the aneuploidy/polyploidy, some refer to sugar cane as the hardest economically important genome to assemble. A quick NCBI search yeilds one assembly for the Saccharum genus with a contig N50 of 8.4 kbp and a total assembly size of ~ 1 Gbp. Though this is almost a comically low N50, I think it speaks more to the difficulty of assembling this genome more than anything else.
Obtaining long-read data alone (such as pacbio) is quite expensive and time-consuming. To achieve 40X coverage, a decent starting place for noisy read assemblers such as FALCON and canu, one must obtain 400Gbp of sequencing data! Then, one must deal with the unexpected behavior of genome assemblers given this genome size. I have already had some issues with k-mer counting using canu (perhaps more on this later). The aforementioned pine and wheat genomes are great starting places for me to research how to work with exceptionally large genomes.
The most interesting and challenging aspect of this genome assembly project is the ploidy. For those not familiar with genome assembly, I won't get into it here, but polyploid genome assembly presents many challenges compared to diploid genome assembly. To mitigate this, a common strategy when working with polyploids is to sequence diploid progenitor species of the species in question. For example, genomic resources for such a progenitor species in wheat were used for the hexaploid wheat assembly. However, such a resource is not currently available for modern sugarcane cultivar progenitor species. This is presumably because both progenitor species are polyploids (see Table 1) and would be difficult to assemble in their own right.
Sugarcane, however, is not completely devoid of genetic resources. There is an abundance of linkage maps available for both progenitor Saccharum species as well as a few modern cultivars4,5. Here is a map of Saccharum spontaneum ‘SES 208 from Al-Janabi, Salah M., et al. published in 1993.
And here are just a few linkage groups of the Louisiana cultivar from a more recent (2016) study from Liu, Pingwu, et al.
I will certainly be investigating how we might be able to leverage the abundance of linkage maps available in the Saccharum to aid with the assembly process, specifically anchoring whatever contigs we might produce. Of course, this will still be pretty challenging since any assembly we produce will probably be fairly fragmented, which means fewer contigs will be anchored using linkage maps.
Grivet, Laurent, and Paulo Arruda. "Sugarcane genomics: depicting the complex genome of an important tropical crop." Current Opinion in Plant Biology 5.2 (2002): 122-127.
Butterfield, M. K., A. D’hont, and N. Berding. "The sugarcane genome: a synthesis of current understanding, and lessons for breeding and biotechnology." Proc S Afr Sug Technol Ass. Vol. 75. 2001.
Raboin, Louis-Marie, et al. "Analysis of genome-wide linkage disequilibrium in the highly polyploid sugarcane." Theoretical and Applied Genetics 116.5 (2008): 701-714.
Al-Janabi, Salah M., et al. "A genetic linkage map of Saccharum spontaneum L. 9SES 2089." Genetics 134.4 (1993): 1249-1260.
Liu, Pingwu, et al. "Identification of quantitative trait loci controlling sucrose content based on an enriched genetic linkage map of sugarcane (Saccharum spp. hybrids) cultivar ‘LCP 85-384’." Euphytica 207.3 (2016): 527-549.