Posted on Fri 06 September 2019 in genomics

"Polishing" is the process of using short accurate Illumina sequencing reads to correct errors in noisy draft genome assemblies. Here, I will try out a few popular short-read genome assembly polishing tools/techniques and compare their efficacy by looking at assembly k-mer content. We will start with a draft genome assembly of a modern cultivated tomato line. This line was sequenced with ONT and the reads were assembled with the wtdbg2 assembler (v2.3). We will then polish this input assembly with the following polishing tools:

1. freebayes/bcftools (v1.2.0-10-g8a0dbee/v1.9)
2. Racon (v1.3.3)
3. Pilon (v1.22)

Key to understanding the success of these polishers is establishing an independent dataset to measure assembly accuracy. For this particular accession, I have randomly divided a set of Illumina sequencing reads into two distinct subsets ("polishing" and "accuracy"), each amounting to roughly 40X coverage. We feed the "polishing set" into our polishing tools in order to correct our assemblies. We can then use the "accuracy set" to measure the relative accuracy of each polished genome. I will describe our accuracy measurement once we get there. Let's get started!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid", palette="muted", color_codes=True)
MED_FONT = 16

%matplotlib inline


# Background¶

## The Tomato Genome¶

This particular accession is a modern inbred Solanum lycoperiscum cultivated line. For the purposes of this notebook, it is important to note that the genome should be very homozygous. Also, the genome size is around 950 Mbp. We sequenced this accession with a combination of MinION and GridION flowcells.

## The assembly¶

For those who are interested, the assembly was produced with the following commands:

wtdbg2 -f -t -g 950m -L 16873 -x ont -i reads.fasta -o asm
wtpoa-cns -i asm.ctg.lay.gz -fo asm.ctg.fasta
minimap2 -ax map-ont asm.ctg.fasta reads.fasta > alns.sam
samtools view -bS alns.sam -o alns.bam
samtools sort -@ 4 alns.bam -o alns.sorted.bam
samtools view alns.sorted.bam -o alns.sorted.sam
wtpoa-cns -d asm.ctg.fasta -i alns.sorted.sam -fo asm.ctg.cns.fasta


We can see that this assembly went through an extra consensus round with wtpoa-cns. This yielded an N50 of 1,408,448 bp and a total assembly size of 731,724,146 bp.

# Polishing¶

As stated before, we polished this wtdbg2 assembly with freebayes/bcftools, Pilon and Racon. The assembly was polished with our "polishing set" of Illumina reads, which were aligned to our assembly with bowtie2 using the following commands:

bowtie2 --local -p 8 -x asm.ctg.cns.fasta -U trimmed.both.renamed.fastq -S alns.sam


Freebayes/bcftools polishing was run with the same parameters as described in the VGP GitHub repo. For pilon, I used the "--fix bases" and "--diploid" flags. Finally, I ran Racon with default parameters (including "-u").

# Polishing Performance Comparison¶

Finally, we get to the good part: comparing the polished assemblies to see which polisher performed best. For this, we will use KAT. Given our independent "accuracy set" of Illumina reads, KAT will establish a database of k-mers. We can then compare k-mers in our assemblies with k-mers in our short reads to observe general trends in the assembly.

Using kat comp, we establish the following matrix for each of our polished assemblies.

Row index $i$ = k-mer frequency in the reads

Column index $j$ = k-mer frequency in the assembly

$e_{i, j} =$ the number of distinct k-mers with frequency $i$ in the reads and frequency $j$ in the assembly.

With this matrix, we can do cool things like look for k-mers in the assembly to that do not appear in the short reads. Or we can look for k-mers in the short reads that do not appear in the assembly. Let's take a closer look.

In [2]:
df_fb = pd.read_csv("fb_kat_comp-main.mx", sep=" ", skiprows=12, names=["V" + str(i) for i in range(1001)])
df_pi = pd.read_csv("pi_kat_comp-main.mx", sep=" ", skiprows=12, names=["V" + str(i) for i in range(1001)])
df_ra = pd.read_csv("ra_kat_comp-main.mx", sep=" ", skiprows=12, names=["V" + str(i) for i in range(1001)])
df_wt = pd.read_csv("wtdbg2_no_polish_comp-main.mx", sep=" ", skiprows=12, names=["V" + str(i) for i in range(1001)])


## Missing k-mers¶

First, we will check how (likely) true k-mers are missing from the assembly.

In [7]:
plt.figure(figsize=(15,7))
plt.plot(df_fb["V0"], label="wtdbg2 + Freebayes")
plt.plot(df_pi["V0"], label="wtdbg2 + Pilon")
plt.plot(df_wt["V0"], label="wtdbg2")
plt.plot(df_ra["V0"], label="wtdbg2 + racon")
plt.xlim(0,60)
plt.ylim(0,14000000)
plt.legend(prop={'size': 20})
plt.rc('xtick', labelsize=MED_FONT)
plt.rc('ytick', labelsize=MED_FONT)
plt.ylabel("Total k-mers")
plt.title("k-mers Missing From Assemblies")
plt.show()


To find the number of assembly k-mers that are missing from the reads, all we have to do is plot the first column of our matrix. Remember that the first column ( j=0 ) refers to k-mers that appear with frequency=0 in the assembly. Then, each element index in this vector will correspond to a frequency in the reads. Above, we plot this vector for each assembly. We also note that, given these curves, we can informally say that k-mers in the reads with frequency less than ~15 are probably error k-mers. However, k-mers that have a frequency >= 15 are expected to be present in an accurate assembled representation of our genome.

Firstly, we see that our unpolished assembly is missing the most k-mers. That is expected since the unpolished assembly will have the most assembly errors, and those errors will yield incorrect k-mers (k-mers that span the error) in place of correct k-mers. Next, we see that, of the polished assemblies, our Pilon assembly is missing the most k-mers. Finally, it is close, but the Racon assembly is missing a few more k-mers than the Freebayes assembly. Let's make the same plot, but zoom in again.

In [4]:
plt.figure(figsize=(15,7))
plt.plot(df_fb["V0"], label="wtdbg2 + Freebayes")
plt.plot(df_pi["V0"], label="wtdbg2 + Pilon")
plt.plot(df_wt["V0"], label="wtdbg2")
plt.plot(df_ra["V0"], label="wtdbg2 + racon")
plt.xlim(0,60)
plt.ylim(0,4000000)
plt.legend(prop={'size': 12})
plt.rc('xtick', labelsize=MED_FONT)
plt.rc('ytick', labelsize=MED_FONT)
plt.ylabel("Total k-mers")
plt.title("k-mers Missing From Assemblies")
plt.show()


## Present k-mers¶

Similarly, we can observe how many k-mers in each assembly are present in the short-reads k-mers. For this, we just sum columns 1 through n, since we don't care how frequent a k-mer is in the assembly.

In [8]:
plt.figure(figsize=(15,7))
plt.plot(np.sum(df_fb.values[:, 1:], axis=1), label="wtdbg2 + Freebayes")
plt.plot(np.sum(df_pi.values[:, 1:], axis=1), label="wtdbg2 + Pilon")
plt.plot(np.sum(df_ra.values[:, 1:], axis=1), label="wtdbg2 + Racon")
plt.plot(np.sum(df_wt.values[:, 1:], axis=1), label="wtdbg2")
plt.xlim(0,60)
plt.ylim(0,26000000)
plt.rc('xtick', labelsize=MED_FONT)
plt.rc('ytick', labelsize=MED_FONT)
plt.legend(prop={'size': 20})
plt.ylabel("Total k-mers")
plt.title("k-mers Present In Assemblies")
plt.show()


With regards to relative performance, the tools rank the same as before. In our unpolished assemblies, we have the fewest assembly k-mers present in the short-read k-mer database, whereas our Freebayes polished assembly has the most k-mers present in the short-read k-mer database. Let's zoom in again.

In [10]:
plt.figure(figsize=(15,7))
plt.plot(np.sum(df_fb.values[:, 1:], axis=1), label="wtdbg2 + Freebayes")
plt.plot(np.sum(df_pi.values[:, 1:], axis=1), label="wtdbg2 + Pilon")
plt.plot(np.sum(df_ra.values[:, 1:], axis=1), label="wtdbg2 + Racon")
plt.plot(np.sum(df_wt.values[:, 1:], axis=1), label="wtdbg2")
plt.xlim(0,60)
plt.ylim(15000000,26000000)
plt.rc('xtick', labelsize=MED_FONT)
plt.rc('ytick', labelsize=MED_FONT)
plt.legend(prop={'size': 20})