Example: Build-a-Yeast
This is an example generated from this source file: build-a-yeast.jl
You are seeing the online documentation version. The corresponding notebook can be found here: build-a-yeast.ipynb
and the script can be found here: build-a-yeast.jl
We are going to use Pseudoseq to create a simple set of fake genomes, based on chromosome 1 of the reference genome for yeast.
- Load the sequence of chromosome 1 of the yeast reference from FASTA file.
- Create blank chromosome blueprints for different genome designs.
- Edit the blueprints to achive a: a. Diploid chromosome b. Triploid chromosome c. Tetraploid chromosome
using BioSequences, Pseudoseq
Load the seed sequence
First load the sequence from FASTA file. This uses tools from BioSequences, which is a dependency of Pseudoseq and so the user should have it available, as julia's package manager would have installed any dependencies. The lines below open a fasta file, with a fasta reader, and load a single FASTA record, and get its sequence.
refseq = open(FASTA.Reader, "yeast-chr1.fasta") do rdr
FASTA.sequence(BioSequence{DNAAlphabet{2}}, read(rdr))
end
reflen = length(refseq)
230218
This results in a 230,218nt long sequence called refseq
.
Designing a diploid
First lets make a plan for a diploid genome with 1% heterozygosity. Pseudoseq has a lot of flexibilty in how you can define the heterozygosity, but for ease here I'm going to let it decide where the sites should be, and what the alleles at each site should be, by calling plan_het
, providing just the proportion of sites I want to be heterozygous, and the number of alleles I want at each of these heterozygous sites (2).
dploid = plan_chrom(reflen, 2)
dploidhet = plan_het(dploid, .01, 2)
Specification for a chromosome:
Number of copies: 2
Length of chromosomes: 230218
227916bp available in the following stretches:
Now fabricate the FASTA, with fabricate
, providing a filename, and then a series of blueprint => seed
pairs. In this case we just give 1 blueprint-seed
# pair.
fabricate("build-a-yeast-di.fasta", dploidhet => refseq)
[ Info: Opening FASTA file at: build-a-yeast-di.fasta
[ Info: Fabricated 1 chromosomes and written to file
Thats it for the diploid genome.
Designing a triploid
Now let's make a triploid genome. This should also be simple enough. It's much like the process for making the diploid genome, but specifying 3 chromosome copies instead of 2.
trploid = plan_chrom(reflen, 3)
Specification for a chromosome:
Number of copies: 3
Length of chromosomes: 230218
230218bp available in the following stretches:
When I add some heterozygous sites, just like with the diploid case, I'm going to simply ask for a proportion of heterozygous sites (and let Pseudoseq decide where they should occur. Like last time also I'm going to make all the sites have 2 alleles, so for every heterozygous site, two of the three chromosome copies willl have the same base, and one of the copies will be different. There are other ways to define some heterozygous sites for example all 3 chromosome copies could have different bases
trploidhet = plan_het(trploid, .01, 2)
Specification for a chromosome:
Number of copies: 3
Length of chromosomes: 230218
227916bp available in the following stretches:
Now fabricate the FASTA, with fabricate
, providing a filename, and then a series of blueprint => seed
pairs. In this case we just give 1 blueprint-seed
# pair.
fabricate("build-a-yeast-tri.fasta", trploidhet => refseq)
[ Info: Opening FASTA file at: build-a-yeast-tri.fasta
[ Info: Fabricated 1 chromosomes and written to file
That's it for the triploid genome.
Designing a tetraploid genome
Lets make a tetraploid genome, that consists of two diploid genomes, an A genome and a B genome. First I'm going to create a chromosome blueprint with 4 copies:
tetploid = plan_chrom(reflen, 4)
Specification for a chromosome:
Number of copies: 4
Length of chromosomes: 230218
230218bp available in the following stretches:
We want to plan the heterozygosity in this tetraploid such that the number of heterozygous sites within the A and B genomes is ~1%. But we want ~3% of sites to be heterozygous between the A and B genomes.
First we will say copies 1 + 2 in our blueprint are the A genome, and copies 3 + 4 are the B genome. Next consider heterozygous sites with 2 different alleles, A and B, 6 different patterns are possible:
Copy 1 | Copy 2 | Copy 3 | Copy 4 |
---|---|---|---|
A | A | B | B |
B | B | A | A |
A | B | A | B |
B | A | B | A |
A | B | B | A |
B | A | A | B |
The first 2 patterns in the table constitute fixed mutations between genomes A and B, but not polymorphism within genomes A and B. This is because Copy 1 and 3 differ, as do copies 2 and 4, yet copies 1 and 2 are the same, as are copies 3 and 4. To achieve a ~3% divergence between genomes A and B, let's plan some divergent mutations by grouping the copy numbers together when planning some heterozygous sites:
thet = plan_het(tetploid, .03, [1, 2], [3, 4])
Specification for a chromosome:
Number of copies: 4
Length of chromosomes: 230218
223312bp available in the following stretches:
By using the groups [1, 2]
and [3, 4]
we ensure chromosome copy 1 and 2 will have the same allele, as will chromosomes 3 and 4
Next we'll just add some of the other heterozygosity patterns to achieve the 1% internal heterozygosity desired. We'll use a heterozygosity pattern with two states.
thet = plan_het(thet, .01, [1, 3], [2, 4])
fabricate("build-a-yeast-tet.fasta", thet => refseq)
[ Info: Opening FASTA file at: build-a-yeast-tet.fasta
[ Info: Fabricated 1 chromosomes and written to file
This page was generated using Literate.jl.