Assembly-Based Inversion Calling

Posted on Tue 11 September 2018 in genomics

Both WGS alignment methods and whole genome alignment methods are used to computationally identify structural variants (SVs). Assemblytics is an example of a whole genome alignment method, as it scans nucmer alignments of a query genome to a reference genome to call variants. However, Assemblytics only calls insertions and deletions (and expansions and contractions, but we can consider those insertions and deletions respectively). In an effort to expand this software, I want to add the ability to call more types of variants such as inversions and translocations. Here, I focus on inversions and begin to investigate how very simple inversions appear in genome-genome alignments.

I will be using Minimap2 to perform genome-genome alignments. In general, as is done by Assemblytics, one can observe CIGAR strings within alignments, or patterns between alignments to call SVs. For example, insertions and deletions may be traversed in a single alignment and would be indicated by "I" and "D" codes respectively in the CIGAR string. But indels can also be inferred by observing discrepancies in the "start" and "end" positions of alignments with respect to the reference and query genomes.

For inversions, however, there is no way to directly encode such a variant in a CIGAR string. Therefore, one would expect that any inversion detectable using this method should be "between" alignments. For example, if there are 3 sorted alignments with "+", "-", and "+" strands, we might call the sequence implicated in the middle alignment an inversion. With this in mind, I want to answer the following questions:

  1. How long must an inversion be to induce its own alignment?
  2. If a single alignment of flanking sequence traverses an inversion, how is that reflected in the CIGAR string?

To answer these questions, I will set up a very simple simulation. First, I will cut out the first 1 Mbp of sequence from chromosome 1 of the SL3.0 Heinz tomato reference genome. Next, half-way between this sequence, I will introduce inversions of varying sizes. Finally, I will align the sequences containing inversions back to the original sequence using Minimap2.

For this notebook, we will only need Minimap2 and Samtools.

Simulating Data

First, let's grab the tomato genome we will be using.

In [3]:
!wget ftp://ftp.solgenomics.net/tomato_genome/assembly/build_3.00/S_lycopersicum_chromosomes.3.00.fa
!samtools faidx S_lycopersicum_chromosomes.3.00.fa
!samtools faidx S_lycopersicum_chromosomes.3.00.fa SL3.0ch01:0-1000000 > chr1_1mb.fasta

The above code downloaded the tomato reference genome, indexed it, and created a new fasta file containing the first 1 Mbp of chromosome 1. Now that we have our template sequence, I will define a utility to induce an inversion at a specified location.

In [4]:
def reverse_complement(seq):
    complements = str.maketrans("ACGTNURYSWKMBVDH", "TGCANAYRSWMKVBHD")
    return seq.translate(complements)[::-1]


def add_inv(in_file, start, size, out_file, out_header):
    # Read the input sequence
    with open(in_file, 'r') as f:
        header = f.readline()
        seq = ''
        for line in f:
            seq += line.rstrip().upper()
    
    x = seq[:start]
    y = seq[start:start+size]
    z = seq[start+size:]
    
    with open(out_file, 'w') as f:
        f.write(out_header + '\n')
        f.write(x + reverse_complement(y) + z)

Now, lets make 3 new sequences. Each sequence will have a single inversion introduced half-way between the template (500 kbp). We will give one a 20 bp inversion, one a 500 bp inversion, and one a 10 kbp inversion.

In [5]:
add_inv('chr1_1mb.fasta', 500000, 20, 'chr1_1mb_small.fasta', '>inv_500k_20') 
add_inv('chr1_1mb.fasta', 500000, 500, 'chr1_1mb_medium.fasta', '>inv_500k_500') 
add_inv('chr1_1mb.fasta', 500000, 10000, 'chr1_1mb_large.fasta', '>inv_500k_10000') 

Now that we have our simulated data, we can align the sequences back to the original templates with Minimap2.

It is important to note that there are gaps within the template sequence, and accordingly, in the 3 simulated sequences. None of the inverted sequence contains gaps, but the gaps may cause some minor inconsistencies in the final alignments.

Aligning Simulated Sequences

Let's align the simulated sequences back to the template.

In [6]:
!/usr/local/src/minimap2/minimap2 -ax asm5 data/chr1_1mb.fasta chr1_1mb_small.fasta > small.sam
!/usr/local/src/minimap2/minimap2 -ax asm5 data/chr1_1mb.fasta chr1_1mb_medium.fasta > medium.sam
!/usr/local/src/minimap2/minimap2 -ax asm5 data/chr1_1mb.fasta chr1_1mb_large.fasta > large.sam

Small Inversion

Let's look at the SAM file produced when aligning the simulated sequence with a 20 bp inversion against the template.

Here we see only 2 alignments. The second alignment roughly spans the first quarter of the reference. I assume there are two alignments instead of one as a result of gap sequence. The first alignment picks up where the second alignment leaves off, soft clipping the first quarter of the reference and consuming the rest of the sequence. This means that our 20 bp inversion has been traversed by the first alignment. Let's pick apart the CIGAR string to see what is going on:

  • 255601S244399M14D6M14I499980M

The first half of the template is consumed by soft clipping and matches/mismatches. This brings us exactly to the 500 kbp mark where the inversion begins. As we noted, the soft clipped portion is accounted for by the second alignment in the same file.

  • 255601S244399M14D6M14I499980M

We know that the next portion of the CIGAR string must account for the inversion, and here we see exactly that. The inversion appears as a 14 bp deletion, followed by 6 match/mismatches, followed by a 14 bp insertion.

  • 255601S244399M14D6M14I499980M

The second half of the template, excluding the 20 bp inversion, is consumed by match/mismatches.

So the inversion was sufficiently small such that it was traversed in a single alignment of flanking sequence. Furthermore, the inversion was characterized as a 14 bp indel with 6 bp of match/mismatches. Clearly, under these conditions, the inversion was too small to be detected. In an uncontrolled environment, there would be no way to tell that these modifications are indicative of an inversion rather than the indicated indel/match/mismatch.

Medium Inversion

Now, lets repeat the process for the 500 bp inversion. Here is the SAM file:

Now, we see four alignments as opposed to two. Alignments two and three consume the first half of the template, again, leading right up to the start of the inversion. The fourth alignment spans the exact 500 bp that we inverted and is indicated by the SAM flag as being on the reverse strand. In other words, this fourth alignment exactly covers the inverted sequence in a single alignment. Finally, the first alignment consumes the remaining half of the template, starting at the end of the inversion.

Large Inversion

Finally, here is SAM file associated with the 10 kbp inversion.

Here, we see roughly the same story as the 500 bp "medium" inversion.

Switching Point

We observed that the 20 bp inversion was so small that it was consumed within a single alignment of flanking sequence, whereas the 500 bp and the 10 kbp inversion were large enough to garner their own alignments. Since it is only really feasible to flag these inversions if they are covered in their own alignments, I want to know how large an inversion must be to be implicated in its own alignment. To find this, I repeated the same steps as before using a binary search to narrow down the exact length of consequence.

Inversion Length (bp) Receives Alignment
20 No
500 Yes
260 Yes
140 No
200 No
230 Yes
215 Yes
207 Yes
203 Yes
201 Yes

So we can conclude that under these conditions, inversions larger than 200 bp receive their own alignment, while inversions less than or equal to 200 bp are traversed by other alignments of flanking sequence.

Since I have more examples of inversions that are traversed by alignments of flanking sequence, I wanted to see if there were other ways by which CIGAR strings accounted for these inversions. We saw in the first test that a 20 bp inversion was represented by a 14 bp indel and 6 match/mismatches. However, if we look at the 140 bp inversion, we see something else:

Here we see that the 140 bp inversion is actually soft clipped out of the alignments. In other words, instead of the inversion being represented by some combination of indels and match/mismatches, it simply doesn't align at all.

Conclusions

I will start by answering the questions I posed at the beginning:

  1. How long must an inversion be to induce a new alignment?

Under these conditions, at least 201 bp.

  1. If a single alignment traverses an inversion, how is that reflected in the CIGAR string?

Either by a combination of indels and match/mismatches or by clipping.

It is safe to say that inversion calling will be done by analyzing the strands of sorted alignments. That said, a more realistic simulation is needed to get a better sense of what to expect with real data. To do this, more SVs, as well as SNPs, will need to be introduced into longer simulated sequences.