Example: tagged paired-end reads

This is an example generated from this source file: tg-example.jl You are seeing the online documentation version. The corresponding notebook can be found here: tg-example.ipynb and the script can be found here: tg-example.jl

Let's see how we might simulate something like an 10x sequencing experiment.

For this simulation script we will:

Create a pool of 5000 copies of a reference genome.
Fragment the DNA molecules in the pool, to an average length of 40,000bp.
Tag the long molecules in the pool randomly with a set of 1,000,000 tags.
Fragment the molecules in the pool to an average length of 700bp.
Subsample the molecules in the pool to achieve approximatly 50x coverage.
Create a set of 250bp paired-end reads.
Apply errors to the paired-end reads at a rate of 0.001 (.1%).
Generate an output FASTQ file.

using Pseudoseq

Using the `sequence` method

First, let's see how we do this with the sequence method. The first two parameters we give to the function will be the input genome we want to sequence, and the destination FASTQ file for output reads. Here we are setting:

The number of genome copies in the molecule pool to 5,000.
The number of possible tags to one million.
The average fragment size, prior to tagging, to 40,000bp.
The average fragment size after tagging, to 700bp.
The sampling coverage to 50x.
The read length to 250bp.
The per base read error rate to 0.001.
The fact we want paired-ends of fragments to be read (paired) to true, which is the default.

sequence("ecoli-ref.fasta", "tagged_reads.fastq"; ng = 5000, tusize = 1000000, taggedflen = 40000, flen = 700, cov = 50, rdlen = 250, err = 0.1)

[ Info: - ✔ Created pool of 5000 copies of a 4639675bp genome
[ Info: - ✔ Created pool of fragments with an average length of 40000bp
[ Info: Attaching 1000000 tags randomly to each molecule in pool...
[ Info: Done
[ Info: - ✔ Created pool of tagged fragments with an average length of 40000bp
[ Info: - ✔ Created pool of fragments with an average length of 700bp
[ Info: - ✔ Subsampled pool at 50X coverage (463967 molecules)
[ Info: - ✔ Created set of 250bp paired-end reads
[ Info: - ✔ Applied sequencing errors at a per-base rate of 0.1
[ Info: - ✔ Wrote 655810 paired end reads to FASTQ file

Using the Pseudoseq API

Here's how to achieve the same thing, using the Pseudoseq API. It is nessecery to use the API if you want to do something that is not anticipated by the available functionality of the sequence method: the cost of conveinience is fewer options.

Let's start with a pool of 5000 copies of a genome contained in a FASTA file:

dnapool = makepool("ecoli-ref.fasta", 5000)

Pool of 5000 molecules:
 All molecules are of the same size: 4639675

Now let's cut up the molecules to an average length of 40,000bp

cutpool = fragment(dnapool, 40000)

Pool of 575000 molecules:
 Maximum molecule size: 509901
 Average molecule size: 40345
 Minimum molecule size: 1

Ok, now we will tag these large fragments randomly. Once you tag a fragment in a universe, any other fragments that are derived from that tagged fragment will inherit the same tag.

taggedpool = tag(cutpool, 1000000)

[ Info: Attaching 1000000 tags randomly to each molecule in pool...
[ Info: Done
Pool of 575000 molecules:
 Maximum molecule size: 509901
 Average molecule size: 40345
 Minimum molecule size: 1
 Number of distinct tags: 437358

Here I'm going to use a pool of 1,000,000 distinct tags. Which fragment gets a certain tag is random. The size of the tag pool, and the number of fragments in your universe will determine how likely it is that any two fragments get the same tag. Now we'll fragment the pool again

taggedcutpool = fragment(taggedpool, 700)

Pool of 32863532 molecules:
 Maximum molecule size: 10734
 Average molecule size: 705
 Minimum molecule size: 1
 Number of distinct tags: 437358

Subsample the pool of tagged molecules.

genome_size = 4639675
expected_coverage = 50
read_length = 250

N = needed_sample_size(expected_coverage, genome_size, read_length)
N = div(N, 2) # Divide by 2 as we're doing paired end sequencing.

sampledpool = subsample(taggedcutpool, N)

Pool of 463967 molecules:
 Maximum molecule size: 8545
 Average molecule size: 704
 Minimum molecule size: 1
 Number of distinct tags: 227219

Now let's make some 250bp tagged paired reads and generate some erroneous positions.

tagged_reads = make_reads(TaggedPairs, sampledpool, 250)
tagged_w_errs = mark_errors(tagged_reads, 0.001)

Output to FASTQ:

generate("tagged_reads.fastq", tagged_w_errs)#-

[ Info: - ✔ Wrote 653570 tagged paired end reads to FASTQ file

This page was generated using Literate.jl.

Example: tagged paired-end reads

Using the sequence method

Using the Pseudoseq API

Using the `sequence` method