PuzzleMaker: Core concepts & basic workflow
Pseudoseq
allows you to create puzzles; planned genomes and chromosome that have a certain set of features and peculiarities of interest.
The purpose for creating such puzzle genomes is not to recreate biology perfectly. The purpose is to create problems you understand fully (where the repeated content is, which positions are heterozygous and so on).
Using such genomes can help you both understand and develop an intuition of what current genome assembly tools are doing, and also to help design assembly tools, and perhaps even plan sequencing experiments and form hypotheses.
PuzzleMaker
is the Pseudoseq
submodule that contains the functionality for doing this. This manual includes several examples showing how to create genomes with certain characteristics. But the core workflow, and important concepts are explained below.
The MotifStitcher
The MotifStitcher
is the type that is central to PuzzleMaker
. It is by creating and interacting with a MotifStitcher
, you create the plan for your puzzle in 3 simple steps. First you define motifs, then you decide on the order of motifs, and then you generate one or more instances of the puzzle by calling make_puzzle
on your MotifStitcher
.
Let's take a look at step 1.
1. Define Motifs
The MotifSticher
allows you to programmatically create puzzle haplotype sequences by first defining a set of motifs, motifs are shorter chunks of user specified or randomly generated sequence, that will be "stitched" together to form the final haplotype sequences.
There are three kinds of motif that can be added to the MotifStitcher.
- A random motif.
- A fixed motif.
- A sibling motif.
Random motifs
A random motif is a motif that will change between calls to make_puzzle
. The reason is so as you can trivially make multiple replicate puzzles with the same properties, but with different specific sequences.
You can add a random motif to a MotifStitcher
by using the add_motif!
method, and specifying only a length (in bp) for the random motif, for example:
julia> ms = MotifStitcher()
Pseudoseq.PuzzleMaker.MotifStitcher(0, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(), Array{Int64,1}[])
julia> add_motif!(ms, 10_000)
Pseudoseq.PuzzleMaker.MotifStitcher(1, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(1 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25]))), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(), Array{Int64,1}[])
By default, a random motif is constructed using a sampler that gives equal weighting to the four nucleotides. If you wanted more control over some of the nucleotide biases. You can construct a RandomMotif
yourself, and provide it with your own nucleotide sampler:
julia> ms = MotifStitcher()
Pseudoseq.PuzzleMaker.MotifStitcher(0, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(), Array{Int64,1}[])
julia> smp = SamplerWeighted(dna"ACGT", [0.2, 0.3, 0.3,]) # Sampler biased toward GC.
BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_C, DNA_G, DNA_T], [0.2, 0.3, 0.3, 0.19999999999999996])
julia> add_motif!(ms, RandomMotif(10_000, smp)) # RandomMotif of 10_000 bp in length, using custom sampler.
Pseudoseq.PuzzleMaker.MotifStitcher(1, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(1 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_C, DNA_G, DNA_T], [0.2, 0.3, 0.3, 0.19999999999999996]))), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(), Array{Int64,1}[])
Only SamplerWeighted{DNA}
types are accepted.
Fixed motifs
Unlike a random motif, a fixed motif has its sequence defined and constant over multiple calls of make_puzzle
.
This is useful for situations where your puzzles must always include a certain DNA sequence. Perhaps one of biological interest or known to confuse a heuristic.
You add a fixed motif to a MotifStitcher
simply by passing it a DNA sequence:
julia> ms = MotifStitcher()
Pseudoseq.PuzzleMaker.MotifStitcher(0, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(), Array{Int64,1}[])
julia> add_motif!(ms, dna"ATCGATCG")
Pseudoseq.PuzzleMaker.MotifStitcher(1, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(1 => ATCGATCG), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(), Array{Int64,1}[])
Sibling motifs
A sibling motif is a motif that is randomly generated for each call of make_puzzle
just like random motifs. However, unlike random motifs, a sibling motif is defined in terms of another motif already defined in the MotifStitcher
.
To define a sibling motif, you specify an already existing motif. That motif's sequence forms the base sequence of the new sibling motif. To define a sibling motif you also need to provide a value that specifies the proportion of bases in the new sibling motif's sequence, that should differ in their nucleic acid from the base motifs sequence.
So sibling motifs then make it simple to define motifs that have a certain level of sequence similarity / homology / shared ancestry, with another motif. Creating portions of a simulated diploid genome might be one practical application of sibling motifs.
You add a sibling motif to the MotifStitcher
by providing the add_motif!
method with an Pair{Int,Float64}
. where the integer is the ID of the chosen base motif already defined in the MotifStitcher
, and the floating point number specifies the proportion of differing bases in the new motif's sequence:
julia> ms = MotifStitcher()
Pseudoseq.PuzzleMaker.MotifStitcher(0, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(), Array{Int64,1}[])
julia> add_motif!(ms, 10_000) # A random first 10,000bp motif. Has ID = 1.
Pseudoseq.PuzzleMaker.MotifStitcher(1, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(1 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25]))), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(), Array{Int64,1}[])
julia> add_motif!(ms, 1 => 0.01) # Add a sibling motif that will have ~10 bases which differ from motif #1. Will have ID = 2.
Pseudoseq.PuzzleMaker.MotifStitcher(2, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(1 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25]))), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(2 => Pseudoseq.PuzzleMaker.SiblingMotif(1, 0.01)), Array{Int64,1}[])
2. Specify haplotypes
Once you have a set of motifs defined, you can build a set of haplotypes by specifying sequences of motifs.
You specify the motifs using a vector of ID numbers. If you use a negative ID number -N then it means the reverse complement of the sequence of motif N.
A motif's ID can be repeated in such a vector any number of times, so you can create repeat structures in a haplotype.
You add a haplotype by using the add_motif_arrangement!
method with a MotifStitcher
and a vector of motif IDs.
For example, let's make a sequence that would form a hair-pin like structure, with repeats when turned into a DeBruijn graph.
julia> ms = MotifStitcher()
Pseudoseq.PuzzleMaker.MotifStitcher(0, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(), Array{Int64,1}[])
julia> add_motifs!(ms, 10000, 600, 10000, 600, 10000, 10000, 10000) # Use add_motifs! to add multiple random motifs at once.
Pseudoseq.PuzzleMaker.MotifStitcher(7, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(7 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),4 => Pseudoseq.PuzzleMaker.RandomMotif(600, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),2 => Pseudoseq.PuzzleMaker.RandomMotif(600, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),3 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),5 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),6 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),1 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25]))), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(), Array{Int64,1}[])
julia> add_motif_arrangement!(ms, [1, 2, 3, 4, 5, -4, -6, -2, 7])
Pseudoseq.PuzzleMaker.MotifStitcher(7, Dict{Int64,Pseudoseq.PuzzleMaker.RandomMotif}(7 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),4 => Pseudoseq.PuzzleMaker.RandomMotif(600, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),2 => Pseudoseq.PuzzleMaker.RandomMotif(600, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),3 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),5 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),6 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25])),1 => Pseudoseq.PuzzleMaker.RandomMotif(10000, BioSequences.SamplerWeighted{BioSymbols.DNA}(BioSymbols.DNA[DNA_A, DNA_T, DNA_C, DNA_G], [0.25, 0.25, 0.25, 0.25]))), Dict{Int64,BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}}(), Dict{Int64,Pseudoseq.PuzzleMaker.SiblingMotif}(), [[1, 2, 3, 4, 5, -4, -6, -2, 7]])
3. Generate haplotype sequences
With the motifs defined and the haplotypes defined you can now generate sequences!
Simply call the make_puzzle
method on the MotifStitcher
to get a vector of haplotype sequences. Repeatedly call make_puzzle
to get independently generated sequences from the same specification:
julia> make_puzzle(ms)
1-element Array{BioSequences.LongSequence{BioSequences.DNAAlphabet{2}},1}:
CCTCGTGAAGTAAATGACTCACCACTTTTATGGGACAGT…GGTTGGCCACATACATCTGCTACGGTACAACAAGGAAGC
julia> make_puzzle(ms)
1-element Array{BioSequences.LongSequence{BioSequences.DNAAlphabet{2}},1}:
GCACCACGATAAAACACGAGTGGACTAGAGACTATCTAT…TCAGGGCTTCTTGAGCATGGTGGGCGAAACAAAACGTGT
If you provide a filename, the haplotypes will be written to file in FASTA format, instead of returned as a value:
julia> make_puzzle(ms, "myhaplos.fasta")
1-element Array{BioSequences.LongSequence{BioSequences.DNAAlphabet{2}},1}:
CTGCCGAAACACAACCAATCACCCTGAAGGCTGAAAAAA…GCTAGGCATCGGTAATCTCCCGAACGTTCCTTGTCAACC
That's all there is to it. Now you can try a simulated sequencing experiment on your haplotypes.