Tutorial

Pre-defined tuples

First, we need to load the required packages:

using GCATBase, BioSequences

We can access predefined sets of k-mers like codons, dinucleotides, and tetranucleotides:

GCATBase.codons
64-element Vector{Union{BioSequences.LongDNA, BioSequences.LongRNA}}:
 AAA
 TAA
 CAA
 GAA
 ATA
 TTA
 CTA
 GTA
 ACA
 TCA
 ⋮
 GTG
 ACG
 TCG
 CCG
 GCG
 AGG
 TGG
 CGG
 GGG
GCATBase.dinucs
16-element Vector{Union{BioSequences.LongDNA, BioSequences.LongRNA}}:
 AA
 TA
 CA
 GA
 AT
 TT
 CT
 GT
 AC
 TC
 CC
 GC
 AG
 TG
 CG
 GG
GCATBase.tetranucs
256-element Vector{Union{BioSequences.LongDNA, BioSequences.LongRNA}}:
 AAAA
 TAAA
 CAAA
 GAAA
 ATAA
 TTAA
 CTAA
 GTAA
 ACAA
 TCAA
 ⋮
 GTGG
 ACGG
 TCGG
 CCGG
 GCGG
 AGGG
 TGGG
 CGGG
 GGGG

It also possible, to create other sets, e.g. all di-nucleoties over the alphabet {A, T}:

alltuples((DNA_A, DNA_T), 2)
4-element Vector{Union{BioSequences.LongDNA, BioSequences.LongRNA}}:
 AA
 TA
 AT
 TT

Tuples from sequences

A sequence can be split into non-overlapping tuples of a given length using the Base.split function. Here, we generate a random DNA sequence of length 10 and split it into codons (tuples of length 3):

seq = randseq(DNAAlphabet{4}(), 10)
println(seq)
codons = split(seq; l=3)
println(codons)
CGACATTATA
BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}[CGA, CAT, TAT]

Shifting sequences

circshift(dna"AGCT")
4nt DNA Sequence:
GCTA

Let's shift some RNA in the other direction:

circshift(rna"AGCU"; k=-1)
4nt RNA Sequence:
UAGC

We can also use GCATBase.circshift together with Base.split to generate all cyclic permutations of k-mers in a sequence:

seq = rna"CAGCUUGAG"
join(circshift.(split(seq, l=3)))
"AGCUUCAGG"

Genetic code tables

GCATBase provides mappings for genetic code tables. GCATBase.translateAA2Codons creates a dictionary which maps an amino acid to a set of associated codons. This might be useful if a frequent lookup for the mapping is required.

a2c = translateAA2Codons()
println(a2c)

println(a2c[AA_G])
println(a2c[AA_Term])
Dict{BioSymbols.AminoAcid, Set{BioSequences.LongDNA}}(AA_V => Set([GTT, GTA, GTG, GTC]), AA_A => Set([GCG, GCC, GCT, GCA]), AA_E => Set([GAA, GAG]), AA_K => Set([AAA, AAG]), AA_I => Set([ATA, ATC, ATT]), AA_P => Set([CCT, CCA, CCG, CCC]), AA_D => Set([GAC, GAT]), AA_G => Set([GGA, GGG, GGC, GGT]), AA_F => Set([TTC, TTT]), AA_S => Set([AGC, TCT, TCA, AGT, TCG, TCC]), AA_C => Set([TGT, TGC]), AA_N => Set([AAT, AAC]), AA_L => Set([TTA, CTG, CTC, TTG, CTT, CTA]), AA_Y => Set([TAC, TAT]), AA_Term => Set([TAG, TGA, TAA]), AA_Q => Set([CAA, CAG]), AA_T => Set([ACA, ACG, ACC, ACT]), AA_M => Set([ATG]), AA_H => Set([CAT, CAC]), AA_W => Set([TGG]), AA_R => Set([CGA, CGG, CGC, AGA, CGT, AGG]))
Set(BioSequences.LongDNA[GGA, GGG, GGC, GGT])
Set(BioSequences.LongDNA[TAG, TGA, TAA])

We can also specify the genetic code as listed at NCBI https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi. BioSequences.ncbi_trans_table provides part of it as a list:

using BioSequences

ncbi_trans_table
Translation Tables:
  1. The Standard Code (standard_genetic_code)
  2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
  3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)
  4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)
  5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)
  6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)
  9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)
 10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)
 11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)
 12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)
 13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)
 14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)
 15. Blepharisma Macronuclear Code (blepharisma_macronuclear_genetic_code)
 16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)
 21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)
 22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)
 23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
 24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
 25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)

Now a translation is created for the vertebrate mitochondrial code (index 2) which has four stop codons. translateAA2Codon expects two parameters: 1) the genetic code table and 2) which nucleoties should be used (i.e. BioSequences.DNA or BioSequences.RNA).

a2c = translateAA2Codons(ncbi_trans_table[2], DNA)

println(a2c[AA_V])
println(a2c[AA_Term]) # 4 stop codons
Set(BioSequences.LongDNA[GTT, GTA, GTG, GTC])
Set(BioSequences.LongDNA[TAG, AGA, TAA, AGG])

We could also create codons based on RNA.

a2cRNA = translateAA2Codons(ncbi_trans_table[2], RNA)

println(a2cRNA[AA_V])
Set(BioSequences.LongRNA[GUU, GUA, GUG, GUC])

It is also possible to obtain the amino acid which is encoded by a codon. GCATBase.translateCodon2AA creates a dictionary which maps a codon to its amino acid. This is similar to BioSequences.translate with the difference that the amino acid is not a sequence but the amino acid itself (of type AminoAcid).

c2a = translateCodon2AA() # Standard genetic code / DNA

println(c2a[dna"ATG"])
M

API

Base.circshiftMethod
circshift(seq::BioSequences.BioSequence; k) -> Any

Shift the elements in a sequence for one position to the left (AUGC -> UGCA).

source
Base.splitMethod
split(seq::BioSequences.BioSequence; l) -> Vector

Create tuples of length l by splitting a sequence seq. seq must implement the collect method.

source
GCATBase.alltuplesMethod
alltuples(
    alphabet::NTuple{N, Union{BioSymbols.DNA, BioSymbols.RNA}},
    l::Int64
) -> Vector{Union{BioSequences.LongDNA, BioSequences.LongRNA}}

Create all possible tuples of length l from the given alphabet. The tuples are returned as a vector of LongDNA or LongRNA sequences, depending on the type of the alphabet (DNA or RNA).

source
GCATBase.translateAA2CodonsMethod
translateAA2Codons(
    code::BioSequences.GeneticCode,
    S::Union{Type{BioSymbols.DNA}, Type{BioSymbols.RNA}}
) -> Union{Dict{BioSymbols.AminoAcid, Set{BioSequences.LongDNA}}, Dict{BioSymbols.AminoAcid, Set{BioSequences.LongRNA}}}

Dictionary which maps an amino acid to a set of corresponding codons for a given genetic code. The codons can either be BioSequences.DNA or BioSequences.RNA as specified in parameter S. See BioSequences.ncbi_trans_table for a list of all known genetic codes.

Examples

The Vertebrate Mitochondrial Code (index 2) is used. This code has four stop codons.

using GCATBase, BioSequences
a2c = translateAA2Codons(ncbi_trans_table[2], DNA)
sort([a2c[AA_Term]...]) # Stop signal (set is sorted)
# output
4-element Vector{LongSequence{DNAAlphabet{4}}}:
 AGA
 AGG
 TAA
 TAG
source
GCATBase.translateCodon2AAMethod
translateCodon2AA(
    code::BioSequences.GeneticCode,
    S::Union{Type{BioSymbols.DNA}, Type{BioSymbols.RNA}}
) -> Union{Dict{BioSequences.LongDNA, BioSymbols.AminoAcid}, Dict{BioSequences.LongRNA, BioSymbols.AminoAcid}}

Dictionary which maps a codon to its encoded amino acid for a given genetic code. See BioSequences.ncbi_trans_table for a list of all known genetic codes.

Examples

The Vertebrate Mitochondrial Code (index 2) is used. This code has four stop codons.

using GCATBase, BioSequences
c2a = translateCodon2AA(ncbi_trans_table[2], DNA)
c2a[dna"ATG"]
# output
AA_M
source