Build-a-Genome: Core concepts & basic workflow

Pseudoseq allows you to plan and build genomes and chromosomes that have a certain set of features and peculiarities. The purpose for doing this is not to recreate biology perfectly. The purpose is to create genomes you understand fully (where the repeated content is, which positions are heterozygous and so on).

Using such genomes can help you both understand and develop an intuition of what current genome assembly tools are doing, and also to help design assembly tools, and perhaps even plan sequencing experiments and form hypotheses.

This manual includes several examples showing how to plan genomes with certain characteristics. But the core workflow, and important concepts are explained below, in 3 steps.

1. Creating chromosome blueprints

Chromosome blueprints are the backbone of simulating genomes with Pseudoseq.

Chromosome blueprints determine the nature of one chromosome in a genome.

You can think of chromosome blueprints in Pseudoseq as: a collection of operations which, when applied to some seed sequence, result in a set of N homologous sequences.

Note

Chromosome blueprints are immutable: Adding an operation to a chromosome blueprint creates a new blueprint, which is a copy of the old blueprint with the addition of the new operation.

Note

Ideally, for any chromosome blueprint you construct with Pseudoseq, for each operation is possesses, it must be possible (at least in principle) to be able to intuit what the effect of that operation will be on:

The Khmer distribution produced by sequencing reads of the fabricated chromosome.
The structure of a sequence graph produced by sequencing reads of the fabricated chromosome.

Therefore, we have made the design decision that no two operations in a chromosome blueprint may affect the same position(s) of the genome in a conflicting manner. To meet this requirement, certain operations "consume" a region of the chromosome planned in a blueprint. If a region is consumed, another operation that would also affect that region cannot be added to the blueprint.

Depending on the genome, any given chromosome may be present in multiple copies. Diploids, for example have two copies of every chromosome.

The first step in simulating any artificial genome is to create one or more blank chromosome blueprints. The plan_chrom function is used for this.

For example, this:

julia> using Pseudoseq

julia> c = plan_chrom(100, 2)
Specification for a chromosome:
 Number of copies: 2
 Length of chromosomes: 100
 100bp available in the following stretches:

will create a blank blueprint for 2 copies of a chromosome of 100bp length. From the output you can see that it prints for you the number of chromosome copies, the length of the chromosomes, and a list of available, unconsumed regions of the chromosome (see note above).

2. Adding planned features to a chromosome blueprint

After creating a fresh chromosome blueprint, no plans (operations) have been added yet.

If you were to fabricate this blank blueprint, you would get N identical DNA sequences as output, where N is the number of copies the blueprint was planning.

Once you have one or more chromosome blueprints, you can add features to them.

This is done with a series of consistently named plan_* functions.

Note

Remember; blueprints are immutable, so every time one of these plan_* functions is used to add a feature to a chromosome blueprint, a new chromosome blueprint is created.

Repetitions

A repetition is a segment of a sequence, that has the exact same motif, as another segment of the sequence.

In Pseudoseq, to plan a repetition, you specify a region the repetition will copy, sometimes called the from region. You also specify a region where the motif in the from region will be replicated, called the to region. So you might find it helpful to imagine a planned repetition as a kind of copy-paste operation that occurs during fabricate.

Note

Repetitions consume the to region of the chromosome blueprint to which they are added. Repetitions do not consume the from region, so other operations are free to affect the motif in the from region. Just remember that the repetition will replicate anything in the from region, including other features such as heterozygosity that occur in from.

Repetitions are added to a genome blueprint using the plan_repetition function.

Heterozygosity

A heterozygosity describes a base position at which the copies of the chromosome in the blueprint differ.

For a blueprint with 2 copies, both copies will differ at a given position.

For a triploid, at a heterozygous position, all 3 copies might differ from each other. Alternatively, it is possible that 2 copies are the same, but they differ from a 3rd copy. This applies for blueprints with higher copy numbers too: at a heterozygous position some copies will differ from each other at that position, but some of the copies might be the same.

Note

Heterozygosity operations consume the position at which they are defined.

Heterozygous positions are planned using the plan_het function.

For example, below will make it so as about 50% of the two chromosome copies are heterozygous:

julia> using Pseudoseq

julia> c = plan_chrom(100, 2)
Specification for a chromosome:
 Number of copies: 2
 Length of chromosomes: 100
 100bp available in the following stretches:

julia> chet = plan_het(c, .50, 2)
Specification for a chromosome:
 Number of copies: 2
 Length of chromosomes: 100
 50bp available in the following stretches:

The above use of plan_het allows the function to choose which sites in the blueprint are heterozygous, and how to allocate the bases at the heterozygous sites. We only instruct the function that 50% of the genome should be heterozygous, and that there should be 2 possible bases at each position (the only option really: the blueprint plans for a diploid - 2 chromosome copies).

You can also take more fine-grained control:

julia> using Pseudoseq, BioSequences

julia> c = plan_chrom(100, 2)
Specification for a chromosome:
 Number of copies: 2
 Length of chromosomes: 100
 100bp available in the following stretches:

julia> chet = plan_het(c, 20, [DNA_T, DNA_A])
Specification for a chromosome:
 Number of copies: 2
 Length of chromosomes: 100
 99bp available in the following stretches:

Here I planned a heterozygous position, at one site in the chromosome (20), and I set the state of the first copy to DNA_T, and the state of the second copy to DNA_A.

So as you can see, plan_het is very flexible. Check it's API documentation, and the "Build-A-Yeast" example walkthrough to see more examples of plan_het use.

Utility functions

There is a utility function suggest_regions available to help you plan where to place features.

Say you wanted to see where you could place 3 5bp repetitions, you could do the following:

julia> r = suggest_regions(chet, 5, 6)
6-element Array{UnitRange{Int64},1}:
 4:8
 38:42
 73:77
 22:26
 14:18
 46:50

So now you have 6 5bp regions, every 2 regions defining a from and a to region for a single repetition.

You can provide r as an input to plan_repetition.

julia> crep = plan_repetition(chet, r)
Specification for a chromosome:
 Number of copies: 2
 Length of chromosomes: 100
 84bp available in the following stretches:

Another utility function suggest_alleles is available to help you plan nucleotide patterns at heterozygous sites.

Say you wanted to plan a pattern in which two of three chromosome copies had the same base, and a third differed.

julia> suggest_alleles(3, 2)
3-element Array{BioSymbols.DNA,1}:
 DNA_T
 DNA_C
 DNA_C

The above asks for an allele pattern for 3 copies of a chromosome, with 2 possible alleles, which chromosome copy gets which allele is random.

You can also take more control and specify which chromosome copy gets which state:

julia> suggest_alleles(chet, [1, 1, 2])
ERROR: MethodError: no method matching suggest_alleles(::Pseudoseq.ChromosomeBlueprint, ::Array{Int64,1})
Closest candidates are:
  suggest_alleles(!Matched::Array{Int64,1}...) at /home/travis/build/bioinfologics/Pseudoseq.jl/src/build_a_genome/plan/polymorphism.jl:98
  suggest_alleles(!Matched::Int64, ::Any...) at /home/travis/build/bioinfologics/Pseudoseq.jl/src/build_a_genome/plan/polymorphism.jl:140

julia> # Or alternatively put:
       suggest_alleles(chet, [1, 2], [3])
ERROR: MethodError: no method matching suggest_alleles(::Pseudoseq.ChromosomeBlueprint, ::Array{Int64,1}, ::Array{Int64,1})
Closest candidates are:
  suggest_alleles(!Matched::Array{Int64,1}...) at /home/travis/build/bioinfologics/Pseudoseq.jl/src/build_a_genome/plan/polymorphism.jl:98
  suggest_alleles(!Matched::Int64, ::Any...) at /home/travis/build/bioinfologics/Pseudoseq.jl/src/build_a_genome/plan/polymorphism.jl:140

In the above, which chromosomes get the same allele is user-specified, but which bases those alleles are, is randomly determined.

You can get many suggestions at once:

julia> suggest_alleles(5, [1, 1, 2])
5-element Array{Array{BioSymbols.DNA,1},1}:
 [DNA_C, DNA_C, DNA_G]
 [DNA_C, DNA_C, DNA_A]
 [DNA_T, DNA_T, DNA_G]
 [DNA_C, DNA_C, DNA_G]
 [DNA_G, DNA_G, DNA_A]

See the API docs for suggest_alleles for more details on the arguments permitted by the different methods.

3. Fabricate a FASTA from chromosome blueprints

Once you have a set of chromosome blueprints with the features planned that you desire, you can fabricate the sequences of these chromosomes.

To do that, use the fabricate method.

You can simply fabricate the sequences, for use in the interactive session:

julia> fabricate(chet)
2-element Array{BioSequences.BioSequence{BioSequences.DNAAlphabet{2}},1}:
 ATTGAAAAAGATGACGAACTGTTCCAGTCTTTTCCCGCA…CGTGCGCGTATTAAACTGTTGGCCTGAGATATGCGGGCA
 ATTGAAAAAGATGACGAACAGTTCCAGTCTTTTCCCGCA…CGTGCGCGTATTAAACTGTTGGCCTGAGATATGCGGGCA

Or have the sequences output to a FASTA formatted file.

julia> fabricate("mychrom.fasta", chet)
[ Info: Opening FASTA file at: mychrom.fasta
[ Info: Fabricated 1 chromosomes and written to file

A randomly generated seed sequence is used to start the fabrication process unless you provide one. See the API docs for fabricate for more details.

And that's all there is to building a chromosome with Pseudoseq. To build a genome with more than one chromosome, simply build a set of chromosome blueprints.

Now check out the examples to see how various genomes can be built with Pseudoseq.