Nowadays, assembling large eukaryotic genomes is more accessible than ever. This is in large part due to the fact that Oxford Nanopore Technologies (ONT) has made tremendous strides towards improving the throughput of their sequencers.
For example, Michael et al. recently described the ability to obtain highly-contiguous Arabidopsis thaliana assemblies from a single MinION flow-cell. It's also becoming clear that the PromethION is reaching insanely high levels of throughput, producing Terabases of data in just days.
In addition to this, recent advances in genome assemblers may allow researchers to keep up (at least somewhat) with these large volumes of data. Though I always like to use the Canu assembler for my ONT data, if I am limited on time, I use wtdbg2, a super fast tool which can assemble large genomes in just hours, though it does not error correct reads.
All of this is to say that highly contiguous long-read de novo draft assemblies are becoming more abundant, and researchers are faced with decisions about how to further scaffold these assemblies into pseudomolecules. In our recent preprint, we propose RaGOO as a way to quickly scaffold assembly contigs according to Minimap2 alignments to a closely related reference genome. Because RaGOO uses Minimap2 instead of nucmer or BLAST, it is very fast (Arabidopsis contigs in ~1 minute) and can handle large eukaryotic genomes (largest I have done is human). For your convenience, RaGOO will also align pseudomolecules back to the reference genome and call structural variants with an integrated version of Assemblytics.
What about reference bias?¶
In my experience, people have been wary about using a reference to inform any stage assembly. Reference-guided assembly/scaffolding could lead to masking true biological variation between the genotype of the assembly and that of the reference, or it could lead to errors in the reference being passed on to the new assembly. Though this is a valid concern, I would argue that since current de novo assembly contigs are so contiguous, using a reference to order and orient them could be a solid option, especially if using a closely related reference of the same species. This is because RaGOO leaves de novo assembly contigs in-tact, and with highly contiguous assemblies, most biological variation, even structural, will be contained within these contigs.
To demonstrate this, we used our reference guided scaffolding approach and a popular de novo Hi-C based scaffolding approach to scaffold a highly contiguous tomato genome assembly (Fig. 1). On the left, we see dotplots made by aligning the reference-guided (RaGOO) and de novo (SALSA2) scaffolds to the tomato reference genome. Our RaGOO pseudomolecules are perfectly syntenic with the reference genome, but that is, by construction, the expected result, since we used the same reference to inform scaffolding. On the other hand, the de novo scaffolds (not quite chromosome-scale) show some pretty wacky structural variation with respect to the reference genome. To help us see if these events are indicative of misassemblies, we aligned the same Hi-C data back to the pseudomolecules/scaffolds. The Hi-C heatmaps on the right show that the RaGOO chromosome 12 seems structurally accurate, while the 12th longest de novo scaffold seems to have a bunch of misassemblies.
The misassemblies in the Hi-C de novo scaffolds may not come as a surprise to some of you. This technique is susceptible to these types of misassemblies, especially false inversions. What may be more surprising is how structurally accurate the reference-guided pseudomolecule is. However, as before, I would argue that since our input assembly was highly contiguous, and since these two genotypes are closely related, it is less surprising that we see these positive results.
Of course, there are many applications where de novo scaffolding is not only ideal, but the only option, such as in cases where there is no reference genome available. Though I still think Hi-C based scaffolding is a tremendous tool, one that I will be using for many future projects, I also think there are now many applications where a reference-guided approach may be just as good, if not better.
Update (June 4, 2019)¶
In version 1.1 of RaGOO, we have added a new misassembly correction module. Misassembly correction consists of two steps. Firstly, we identify potential misassemblies by finding suspicious discrepancies in the alignments between the query and reference genomes. From these, we establish candidate contig breakpoints that would correct these discrepancies. Secondly, given these candidate breakpoints, we align sequencing reads (short or error corrected long reads) to the contigs in order to validate the presence of a misassembly. Misassemblies should show exceptionally high or low coverage at breakpoints, whereas non-misassemblies should show normal coverage. This allows us to leverage the reference genome to identify potential misassemblies while having some validation that we are not simply masking true structural variation.
I would like to acknowledge Steven Salzberg and Aleksey Zimin who helped with this feature of RaGOO. The core algortihms for misassembly correction are from components of Aleksey's MaSuRCA assembler.