Tutorial
Pre-defined tuples
First, we need to load the required packages:
using GCATBase, BioSequencesWe can access predefined sets of k-mers like codons, dinucleotides, and tetranucleotides:
GCATBase.codons64-element Vector{Union{BioSequences.LongDNA, BioSequences.LongRNA}}:
AAA
TAA
CAA
GAA
ATA
TTA
CTA
GTA
ACA
TCA
⋮
GTG
ACG
TCG
CCG
GCG
AGG
TGG
CGG
GGGGCATBase.dinucs16-element Vector{Union{BioSequences.LongDNA, BioSequences.LongRNA}}:
AA
TA
CA
GA
AT
TT
CT
GT
AC
TC
CC
GC
AG
TG
CG
GGGCATBase.tetranucs256-element Vector{Union{BioSequences.LongDNA, BioSequences.LongRNA}}:
AAAA
TAAA
CAAA
GAAA
ATAA
TTAA
CTAA
GTAA
ACAA
TCAA
⋮
GTGG
ACGG
TCGG
CCGG
GCGG
AGGG
TGGG
CGGG
GGGGIt also possible, to create other sets, e.g. all di-nucleoties over the alphabet {A, T}:
alltuples((DNA_A, DNA_T), 2)4-element Vector{Union{BioSequences.LongDNA, BioSequences.LongRNA}}:
AA
TA
AT
TTTuples from sequences
A sequence can be split into non-overlapping tuples of a given length using the Base.split function. Here, we generate a random DNA sequence of length 10 and split it into codons (tuples of length 3):
seq = randseq(DNAAlphabet{4}(), 10)
println(seq)
codons = split(seq; l=3)
println(codons)CGACATTATA
BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}[CGA, CAT, TAT]Shifting sequences
circshift(dna"AGCT")4nt DNA Sequence:
GCTALet's shift some RNA in the other direction:
circshift(rna"AGCU"; k=-1)4nt RNA Sequence:
UAGCWe can also use GCATBase.circshift together with Base.split to generate all cyclic permutations of k-mers in a sequence:
seq = rna"CAGCUUGAG"
join(circshift.(split(seq, l=3)))"AGCUUCAGG"Genetic code tables
GCATBase provides mappings for genetic code tables. GCATBase.translateAA2Codons creates a dictionary which maps an amino acid to a set of associated codons. This might be useful if a frequent lookup for the mapping is required.
a2c = translateAA2Codons()
println(a2c)
println(a2c[AA_G])
println(a2c[AA_Term])Dict{BioSymbols.AminoAcid, Set{BioSequences.LongDNA}}(AA_V => Set([GTT, GTA, GTG, GTC]), AA_A => Set([GCG, GCC, GCT, GCA]), AA_E => Set([GAA, GAG]), AA_K => Set([AAA, AAG]), AA_I => Set([ATA, ATC, ATT]), AA_P => Set([CCT, CCA, CCG, CCC]), AA_D => Set([GAC, GAT]), AA_G => Set([GGA, GGG, GGC, GGT]), AA_F => Set([TTC, TTT]), AA_S => Set([AGC, TCT, TCA, AGT, TCG, TCC]), AA_C => Set([TGT, TGC]), AA_N => Set([AAT, AAC]), AA_L => Set([TTA, CTG, CTC, TTG, CTT, CTA]), AA_Y => Set([TAC, TAT]), AA_Term => Set([TAG, TGA, TAA]), AA_Q => Set([CAA, CAG]), AA_T => Set([ACA, ACG, ACC, ACT]), AA_M => Set([ATG]), AA_H => Set([CAT, CAC]), AA_W => Set([TGG]), AA_R => Set([CGA, CGG, CGC, AGA, CGT, AGG]))
Set(BioSequences.LongDNA[GGA, GGG, GGC, GGT])
Set(BioSequences.LongDNA[TAG, TGA, TAA])We can also specify the genetic code as listed at NCBI https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi. BioSequences.ncbi_trans_table provides part of it as a list:
using BioSequences
ncbi_trans_tableTranslation Tables:
1. The Standard Code (standard_genetic_code)
2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)
4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)
5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)
6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)
9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)
10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)
11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)
12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)
13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)
14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)
15. Blepharisma Macronuclear Code (blepharisma_macronuclear_genetic_code)
16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)
21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)
22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)
23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)Now a translation is created for the vertebrate mitochondrial code (index 2) which has four stop codons. translateAA2Codon expects two parameters: 1) the genetic code table and 2) which nucleoties should be used (i.e. BioSequences.DNA or BioSequences.RNA).
a2c = translateAA2Codons(ncbi_trans_table[2], DNA)
println(a2c[AA_V])
println(a2c[AA_Term]) # 4 stop codonsSet(BioSequences.LongDNA[GTT, GTA, GTG, GTC])
Set(BioSequences.LongDNA[TAG, AGA, TAA, AGG])We could also create codons based on RNA.
a2cRNA = translateAA2Codons(ncbi_trans_table[2], RNA)
println(a2cRNA[AA_V])Set(BioSequences.LongRNA[GUU, GUA, GUG, GUC])It is also possible to obtain the amino acid which is encoded by a codon. GCATBase.translateCodon2AA creates a dictionary which maps a codon to its amino acid. This is similar to BioSequences.translate with the difference that the amino acid is not a sequence but the amino acid itself (of type AminoAcid).
c2a = translateCodon2AA() # Standard genetic code / DNA
println(c2a[dna"ATG"])MAPI
Base.circshift — Method
circshift(seq::BioSequences.BioSequence; k) -> Any
Shift the elements in a sequence for one position to the left (AUGC -> UGCA).
Base.split — Method
split(seq::BioSequences.BioSequence; l) -> Vector
Create tuples of length l by splitting a sequence seq. seq must implement the collect method.
GCATBase.alltuples — Method
alltuples(
alphabet::NTuple{N, Union{BioSymbols.DNA, BioSymbols.RNA}},
l::Int64
) -> Vector{Union{BioSequences.LongDNA, BioSequences.LongRNA}}
Create all possible tuples of length l from the given alphabet. The tuples are returned as a vector of LongDNA or LongRNA sequences, depending on the type of the alphabet (DNA or RNA).
GCATBase.translateAA2Codons — Method
translateAA2Codons(
code::BioSequences.GeneticCode,
S::Union{Type{BioSymbols.DNA}, Type{BioSymbols.RNA}}
) -> Union{Dict{BioSymbols.AminoAcid, Set{BioSequences.LongDNA}}, Dict{BioSymbols.AminoAcid, Set{BioSequences.LongRNA}}}
Dictionary which maps an amino acid to a set of corresponding codons for a given genetic code. The codons can either be BioSequences.DNA or BioSequences.RNA as specified in parameter S. See BioSequences.ncbi_trans_table for a list of all known genetic codes.
Examples
The Vertebrate Mitochondrial Code (index 2) is used. This code has four stop codons.
using GCATBase, BioSequences
a2c = translateAA2Codons(ncbi_trans_table[2], DNA)
sort([a2c[AA_Term]...]) # Stop signal (set is sorted)
# output
4-element Vector{LongSequence{DNAAlphabet{4}}}:
AGA
AGG
TAA
TAGGCATBase.translateCodon2AA — Method
translateCodon2AA(
code::BioSequences.GeneticCode,
S::Union{Type{BioSymbols.DNA}, Type{BioSymbols.RNA}}
) -> Union{Dict{BioSequences.LongDNA, BioSymbols.AminoAcid}, Dict{BioSequences.LongRNA, BioSymbols.AminoAcid}}
Dictionary which maps a codon to its encoded amino acid for a given genetic code. See BioSequences.ncbi_trans_table for a list of all known genetic codes.
Examples
The Vertebrate Mitochondrial Code (index 2) is used. This code has four stop codons.
using GCATBase, BioSequences
c2a = translateCodon2AA(ncbi_trans_table[2], DNA)
c2a[dna"ATG"]
# output
AA_M