k-mer counting, part I: Introduction

Bernardo J. Clavijo

2018/09/17

Motivation

For the last 5 years, I have been advocating the analysis of k-mer spectra (in short, the frequencies at which different k-mers appear on different datasets) to evaluate genome assemblies. I have been using k-mers for many other analyses too, as do most people doing Bioinformatics. I also train others to perform k-mer analyses and have a long overdue debt to them to write down a more coherent version of the training. So here it goes, with the hope that it gets more people introduced to this simple and straightforward way to analyse sequence data. It is just counting, and everyone can count.

keep calm and count k-mers
This will be the first of a (yet undetermined) number of posts about k-mer counting on sequences, reads, assemblies, and more. They will be part conceptual, part practical. I will show practical bits using KAT, the k-mer analysis toolkit. Among other things, KAT introduced the spectra comparison as a way to assess genome assemblies vs WGS reads, and we use it for all kinds of tricks.




k-mers from a sequence

What is a k-mer anyway? A k-mer is just a sequence of k characters in a string (or nucleotides in a DNA sequence). Now, it is important to remember that to get all k-mers from a sequence you need to get the first k characters, then move just a single character for the start of the next k-mer and so on. Effectively, this will create sequences that overlap in k-1 positions.

So, by way of example, the sequence ATCGATCAC contains the following 3-mers (k-mer of size 3):

Sequence: ATCGATCAC
3-mer #0: ATC
3-mer #1:  TCG
3-mer #2:   CGA
3-mer #3:    GAT
3-mer #4:     ATC
3-mer #5:      TCA
3-mer #6:       CAC

Or, in a nice looking table which we’ll expand later:

Offset 0 1 2 3 4 5 6
3-mer ATC TCG CGA GAT ATC TCA CAC

Obviously, we don’t need to use k=3 every time, so here you have the same example with k=4:

Sequence: ATCGATCAC
4-mer #0: ATCG
4-mer #1:  TCGA
4-mer #2:   CGAT
4-mer #3:    GATC
4-mer #4:     ATCA
4-mer #5:      TCAC

So far, so good, so easy. Notice how there are 1 fewer 4-mer than 3-mers in the same sequence. Any sequence of length L will contain L - k + 1 k-mers, which has implications for k-mer analyses when k is close to the length of the sequence. When k << L, as typically happens when we chop the millions or billions of bp of a genome into k-mers with k in the low hundreds or even tens, you can pretend there’s a k-mer at every position in the genome and forget about it.

I hope you can easily imagine the many analyses that can be performed with k-mers, but if not, don’t worry, I will show some along the way).

Decomposing a sequence into its k-mers for analysis allows this set of fixed-size chunks to be analysed rather than the sequence, and this can be more efficient. K-mers are very useful in sequence matching (string matching with n-grams has a rich history), and set operations are faster, easier, and there are a lot of readily-available algorithms and techniques to work with them. A simple example: to check if a sequence S comes from organism A or from organism B, assuming the genomes of A and B are known and sufficiently different, we can check if S contains more k-mers present in A or in B. Yes, there are many tools that do just that.

Basically, using k-mers simplifies bioinformatics to counting and comparing whether things are there or not.

Reverse complement and canonical k-mers

*Note: for this section I assume double-strandness is the typical use case. This is a fair assumption, but there will be exceptions

DNA is normally double stranded, with bases paired on the opposite strands and we normally read (or sequence) on either of the two strands. However, we would like to consider every location of the genome once, no matter on which strand we happened to have landed.

In short: if we read the sequence ATCGAC that is an observation for that sequence and its reverse complement GTCGAT to exist in the genome. One appears when reading the genome in one direction and the other on its opposite, we could have sequenced any of them. So for the sake of completeness we should perform all analyses by considering this sequence ATCGAC/GTCGAT.

Luckily, there is a method in k-mer analysis (or in sequence analysis in general) to be sure we always talk about the same sequence in the same way: we can use only the canonical sequence of a k-mer pair, this is the lexicographically smaller of the two reverse complementary sequences. In other words, the one that comes earliest in alphabetical order.

Most k-mer counting tools will either count in canonical k-mers or have an option to switch between canonical and non-canonical. Most of the time canonical counting will be the default. This means if GTCGAT appears it will be counted as ATCGAC. Most of the time this is exactly what we want.

Here are the 3-mers of ATCGATCAC, this time including the reverse complement and the canonical of each 3-mer too.

Offset 0 1 2 3 4 5 6
3-mer ATC TCG CGA GAT ATC TCA CAC
Reverse Complement GAT CGA TCG ATC GAT TGA GTG
Canonical ATC CGA CGA ATC ATC TCA CAC

As you can see, in some cases the canonical k-mer appeared in the original sequence, but in other cases it was the reverse complement.

did_you_know Why do many tools use odd values for k ?

To write a k-mer based tool, we need to account for the relationship within a k-mer and its reverse complement because we don’t know which direction we are reading from. In many tools this relationship is used to detect sequence orientation: if sequence A has the same k-mers as sequence B, in opposite order and in reverse complement, B is in opposite orientation than A. But when using even values of k, a k-mer can be its own reverse complement, which makes it difficult to handle the reverse complements and orientations. This happens when the first half of the sequence is the reverse complement of the second half, as in TCGCGA. A k-mer of odd k (an odd-mer, if you will) can never be its own reverse complement, because the base in the middle of its sequence should be its own complement. Hence, many tools stick to odd values for k.

Distinct, unique, and total k-mers

People that count k-mers talk about unique k-mers, and there is also the total and the distinct counts to keep in mind. All of these can be also done in canonical or non-canonical ways. Sounds confusing? It’s actually not.

Total k-mers are the simplest of all. As we have shown with the example of ATCGATCAC, the amount of k-mers from a sequence varies with k, and that is basically it. ATCGATCAC has 7 total 3-mers and 6 total 4-mers.

Distinct k-mers are counted only once, even if they appear more times, while Unique k-mers are those that appear only once. These two categories of k-mers are greatly affected by counting canonically or not-canonically, because reverse complements in canonical counting are considered the same k-mer.

This is the 3-mer composition for ATCGATCAC again, in non-canonical form, grouping its k-mer occurrences:

k-mer Total count Distinct count Unique count
ATC 2 1 0
TCG 1 1 1
CGA 1 1 1
GAT 1 1 1
TCA 1 1 1
CAC 1 1 1
Sum 7 6 5

And this is the 3-mer composition for ATCGATCAC again, but in canonical form:

Canonical k-mer Total count Distinct count Unique count
ATC (incl. GAT) 3 1 0
CGA (incl. TCG) 2 1 0
TCA 1 1 1
CAC 1 1 1
Sum 7 4 2

As expected, in this canonical counting version TCG has been grouped into CGA, and GAT has been grouped into ATC. So the count has fewer distinct k-mers, and fewer unique k-mers. Canonical counting will always yield fewer (or at most the same), distinct k-mers and unique k-mers.

did_you_know What makes unique k-mers so special ?

Unique k-mers make analyses simple. For instance, if a k-mer appears only once in a genome, it can be used to check if a particular region of the genome is present or not in a sample, or to check for variation, or as anchor to align other genome’s content, or reads. Unique content within a genome is a good thing, enabling independent analysis of each part of the genome. Genes being usually made up of quite different content than the “repeats” of a genome is the reason why it is in general easier to assemble the genic part of a genome.

Sadly, repetitive content as a concept is poorly defined between biological and computational terms. While in biological terms a loose definition of repeat could be “the stuff between the genes, which is assumed to be similar and low-complexity”, in computational terms it always means “a portion of sequence that appears exactly the same more than one time”. This is why sometimes gene families (groups of very similar genes) are very hard to analyse as they appear multiple times in the genome. At the same time, this dissonance also gives some light at the end of the tunnel in unexpected scenarios: a biologically important repeat may turn out to have a few unique k-mers every time it appears, which in turn may make it a relatively simple region to analyse with the right choice of sequencing technologies and algorithms.




terminal Counting k-mers in a (small) genome

We will start with an easy example first: the phi-X174 genome has 5386 bp and is a simple non-repetitive genome.

We can use kat hist to count 27-mers on the genome and check how many times each 27-mer appears (we start with k = 27 because KAT uses that as default):

$ kat hist -o phiX.hist phiX.fasta

Checking the phiX.hist histogram (A.K.A. kmer spectrum) file, every 27-mer in the genome appears only once. After the header lines starting with #, every line has a copy number (A.K.A. frequency) and a number of k-mers.

# Title:27-mer spectra for: phiX.fasta
# XLabel:27-mer frequency
# YLabel:# distinct 27-mers
# Kmer value:27
# Input 1:../genomes/phiX.fasta
###
1 5360
2 0
3 0
4 0
...

It is easy to see all 5360 27-mers appear once. Just to be sure we understand the logic, let’s create a file where all the phi-X genome appears twice, count the 27-mers and check the spectrum file.

$ cat phiX.fasta phiX.fasta > phiX_twice.fasta
$ kat hist -o phiX_twice.hist phiX_twice.fasta
# Title:27-mer spectra for: phiX_twice.fasta
# XLabel:27-mer frequency
# YLabel:# distinct 27-mers
# Kmer value:27
# Input 1:phiX_twice.fasta
###
1 0
2 5360
3 0
4 0
...

Now all 5360 27-mers appear twice. With the basics done, it is time to get into the most popular question on k-mer based analysis.

Choosing k: the size of the universe and specificity vs. sensitivity

Choosing k is not a trivial topic on k-mer analysis. Using a k that is too small will result in many unrelated sequences being composed of the same k-mers, in a textbook case of specificity loss because there being very few possible k-mers of that size. In the extreme of the small k, k=1 only distinguishes two canonical k-mers: A and C. 1-mer analysis is incredibly popular in biology, but it is best known by the name of GC content analysis. On the other hand, using extremely large k values would sacrifice many of the benefits and sensitivity of k-mer analyses in the first place, such as decomposing the larger problem into small easy to solve exact problems that can produce an aggregated solution. This leads to problems in inexact matching which I will describe in future posts.

Knowing how many different k-mers can be constructed at a certain k is a good starting point to make an informed choice of k. We call all possible k-mers the universe, and their count is our size of the universe for k. Since each of the k nucleotides in a k-mer can take any of the A, C, G or T values, the possible combinations of k positions are computed as 4^k. When counting canonical k-mers not every value will be considered, but only those that are canonical. For odd values of k that are half of the k-mers, and the size of the universe becomes (4^k)/2. For even values, due to some k-mers being their own reverse complement, the size is slightly higher at (4^k + 4^(k/2) ) / 2, because these k-mers do not group with others under the same canonical representation. Again, sticking to odd k values makes life simpler, but don’t worry about the formulae too much.

Let’s check now what happens if we change that k value between two carefully chosen values of 9 and 8 in the Phi-X genome (we will discuss how these were chosen soon). For historical reasons (due the to options from Jellyfish, on which KAT is based), KAT uses the -m parameter to set k. If k is small enough, the k-mers will appear more than once in the genome, even for this very simple genome.

So counting with k=9 is done like this:

$ kat hist -o phiX_9mer.hist -m 9 phiX.fasta

Then the phiX_9mer.hist file looks like this:

# Title:9-mer spectra for: phiX.fasta
# XLabel:9-mer frequency
# YLabel:# distinct 9-mers
# Kmer value:9
# Input 1:phiX.fasta
###
1 4972
2 189
3 8
4 1
5 0
6 0
7 0
8 0
9 0
...

Not every 9-mer in the genome appear only once: there are 189 9-mers appearing twice, 8 9-mers appearing trice, and 1 9-mer that appears four times.

So, the number of distinct 9-mers in the phi-X genome is 4972 + 189 + 8 + 1 = 5170, of these only 4972 are unique 9-mers (they only appear once in the genome), and the total number of 9-mers is 4972 * 1 + 189 * 2 + 8 * 3 + 1 * 4 = 5378.

The relationship between total, distinct and unique k-mers in a genome gives insight into its composition and the expected performance of different analyses, and depends on the choice of k , as illustrated by the effect of using 8-mers on phi-X for the same analysis.

$ kat hist -o phiX_8mer.hist -m 8 phiX.fasta

Now the histogram file looks like this:

# Title:8-mer spectra for: phiX.fasta
# XLabel:8-mer frequency
# YLabel:# distinct 8-mers
# Kmer value:8
# Input 1:phiX.fasta
###
1 4159
2 491
3 67
4 8
5 1
6 0
7 0
8 0
9 0

Here, only 4159 8-mers are unique, out of 4726 distinct 8-mers, that are present in the genome’s 5377 total 8-mers.


The full response of the unique, distinct and total k-mer counts for phi-X can be computed by running successive KAT hist analyses, and then a bit of clever awk.

#!/bin/bash
#Counting k-mers for different k on phiX
echo "k,unique,distinct,total"
for k in {1..15}; do
	kat hist -o phiX_$k.hist -m $k phiX.fasta >/dev/null 2>&1
	egrep -v '^#' phiX_$k.hist|awk '{if ($1==1) u=$2; d+=$2; t+=$2*$1;}\
	END{print "'$k',"u","d","t}'
done

Executing this script gives a csv version of the following table:

k unique distinct total
1 0 2 5386
2 0 10 5385
3 0 32 5384
4 0 135 5383
5 10 506 5382
6 434 1727 5381
7 2295 3547 5380
8 4159 4726 5379
9 4972 5170 5378
10 5254 5315 5377
11 5346 5361 5376
12 5369 5372 5375
13 5374 5374 5374
14 5373 5373 5373
15 5372 5372 5372

Which plotted alongside the corresponding sizes of the universe is:

phix_k_response

From these results, it is clear there is not a lot to look at for k<5 (see how the unique k-mers are 0), these universes are so small they get completely saturated with the same k-mers over and over again. While frequency analyses like GC content and codon usage can be done there, it is slightly tangential to our topic today.

In the range 5 <= k <= 10, there is a progressive growth on unique and distinct k-mers. Here, the k value makes a difference and will affect the results as the shown in the cherry-picked 9-mer and 8-mer cases earlier.

For k > 10, pretty much every k-mer is unique. As a matter of fact for k >= 13, every k-mer is a unique k-mer.


To make things more interesting, the same analysis can be done for the E. coli genome:

#!/bin/bash
#Counting k-mers for different k on ecoli
echo "k,unique,distinct,total"
for k in {1..31}; do
	kat hist -o ecoli_$k.hist -m $k -h 3000000 ecoli.fasta >/dev/null 2>&1
	egrep -v '^#| 0$' ecoli_$k.hist|awk '{if ($1==1) u=$2; d+=$2; t+=$2*$1;}\
	END{print "'$k',"0+u","d","t}'
done

Note: notice we adjusted the maximum count in the KAT histogram and modified the grep/awk to just skip 0 counts, and there area lot of 0 counts on these files as we are slightly abusing KAT beyond its intended purpose for these examples.

And the corresponding plot looks like this:

ecoli_k_response

We notice two things straight away: in this larger, relatively more complex genome, there is high saturation of the small universes up to k=10, with all their k-mers fully occupied and no unique k-mers ; and for the higher values of k, the unique and distinct counts do not fully reach the total k-mer number. Even well past desaturation, when the numbers of distinct and unique k-mers separate from the fast-growing size of the universe and become closer to the total k-mers, and up to the last point where k=20, there are still some k-mers that are not unique: these are computational repeats.


did_you_know Why do many tools use k = 31 ?

31 is the largest odd value of k such that a k-mer can be translated into a 64-bit integer. Most computers nowadays can manipulate 64-bit integers very well and most programming languages will make it easy to work efficiently with 64-bit integers (yes, this has to do with computers using 64-bit architectures). With 31-mers being specific enough that a large number of them are unique both in mammalian-sized genomes (and some more complex genomes too) and in bacterial genome databases, bioinformatics tools just stick to that.




Coming soon…

In future posts we will go over k-mer counting in WGS reads and the read spectrum, onwards to spectra comparisons, and all kinds of k-mer intersection based analyses.


Version History

2018-09-17: initial post.

2018-09-18: corrected small typos and readability.

2018-09-19: more small typos and readability; and clearer explanation of saturation/desaturation on E. Coli genome counting (thanks James Mallet!).

2020-03-04: corrected the number of distinct 8-mers in Phi-X. Thanks @Erdos!!!