Non-standard genetic code
Source:vignettes/non_standard_genetic_code.Rmd
non_standard_genetic_code.Rmd
Introduction
cubar
supports the codon usage bias analysis of
sequences utilizing non-standard genetic codes, such as those found in
mitochondrial or chloroplast protein-coding sequences. To illustrate its
application, we demonstrate the calculation of effective number of
codons (ENC) for human mitochondrial CDS sequences.
Main analysis
First, Load sequences and get the corresponding codon table.
human_mt
#> DNAStringSet object of length 13:
#> width seq names
#> [1] 681 ATGAACGAAAATCTGTTCGCTTC...CTACCTGCACGACAACACATAA MT-ATP6
#> [2] 346 ATAAACTTCGCCTTAATTTTAAT...AAAGGATTAGACTGAACCGAAT MT-ND3
#> [3] 956 ATACCCATGGCCAACCTCCTACT...CCAGCATTCCCCCTCAAACCTA MT-ND1
#> [4] 207 ATGCCCCAACTAAATACTACCGT...TTCATTGCCCCCACAATCCTAG MT-ATP8
#> [5] 1141 ATGACCCCAATACGCAAAACTAA...AACAAAATACTCAAATGGGCCT MT-CYB
#> ... ... ...
#> [9] 1042 ATTAATCCCCTGGCCCAACCCGT...CCTTTTATACTAATAATCTTAT MT-ND2
#> [10] 525 ATGATGTATGCTTTGTTTCTGTT...TGAGATTGCTCGGGGGAATAGG MT-ND6
#> [11] 1542 ATGTTCGCCGACCGTTGACTATT...ACCCGTATACATAAAATCTAGA MT-CO1
#> [12] 684 ATGGCACATGCAGCGCAAGTAGG...AGGGCCCGTATTTACCCTATAG MT-CO2
#> [13] 784 ATGACCCACCAATCACATGCCTA...TCCATCTATTGATGAGGGTCTT MT-CO3
ctab <- get_codon_table(gcid = '2')
head(ctab)
#> aa_code amino_acid codon subfam
#> <char> <char> <char> <char>
#> 1: F Phe TTT Phe_TT
#> 2: F Phe TTC Phe_TT
#> 3: L Leu TTA Leu_TT
#> 4: L Leu TTG Leu_TT
#> 5: S Ser TCT Ser_TC
#> 6: S Ser TCC Ser_TC
We do not check CDS length and stop codons as incomplete stop codons are prevalent among MT CDSs.
human_mt_qc <- check_cds(
human_mt,
codon_table = ctab,
check_stop = FALSE,
rm_stop = FALSE,
check_len = FALSE,
start_codons = c('ATG', 'ATA', 'ATT'))
human_mt_qc
#> DNAStringSet object of length 13:
#> width seq names
#> [1] 678 AACGAAAATCTGTTCGCTTCATT...CTACCTGCACGACAACACATAA MT-ATP6
#> [2] 343 AACTTCGCCTTAATTTTAATAAT...AAAGGATTAGACTGAACCGAAT MT-ND3
#> [3] 953 CCCATGGCCAACCTCCTACTCCT...CCAGCATTCCCCCTCAAACCTA MT-ND1
#> [4] 204 CCCCAACTAAATACTACCGTATG...TTCATTGCCCCCACAATCCTAG MT-ATP8
#> [5] 1138 ACCCCAATACGCAAAACTAACCC...AACAAAATACTCAAATGGGCCT MT-CYB
#> ... ... ...
#> [9] 1039 AATCCCCTGGCCCAACCCGTCAT...CCTTTTATACTAATAATCTTAT MT-ND2
#> [10] 522 ATGTATGCTTTGTTTCTGTTGAG...TGAGATTGCTCGGGGGAATAGG MT-ND6
#> [11] 1539 TTCGCCGACCGTTGACTATTCTC...ACCCGTATACATAAAATCTAGA MT-CO1
#> [12] 681 GCACATGCAGCGCAAGTAGGTCT...AGGGCCCGTATTTACCCTATAG MT-CO2
#> [13] 781 ACCCACCAATCACATGCCTATCA...TCCATCTATTGATGAGGGTCTT MT-CO3
As stop codons are present, now we manually remove them.
len_trim <- width(human_mt_qc) %% 3
len_trim <- ifelse(len_trim == 0, 3, len_trim)
human_mt_qc <- subseq(human_mt_qc, start = 1, end = width(human_mt_qc) - len_trim)
human_mt_qc
#> DNAStringSet object of length 13:
#> width seq names
#> [1] 675 AACGAAAATCTGTTCGCTTCATT...CCTCTACCTGCACGACAACACA MT-ATP6
#> [2] 342 AACTTCGCCTTAATTTTAATAAT...AAAAGGATTAGACTGAACCGAA MT-ND3
#> [3] 951 CCCATGGCCAACCTCCTACTCCT...CTCCAGCATTCCCCCTCAAACC MT-ND1
#> [4] 201 CCCCAACTAAATACTACCGTATG...TCATTCATTGCCCCCACAATCC MT-ATP8
#> [5] 1137 ACCCCAATACGCAAAACTAACCC...AAACAAAATACTCAAATGGGCC MT-CYB
#> ... ... ...
#> [9] 1038 AATCCCCTGGCCCAACCCGTCAT...CCCTTTTATACTAATAATCTTA MT-ND2
#> [10] 519 ATGTATGCTTTGTTTCTGTTGAG...AATTGAGATTGCTCGGGGGAAT MT-ND6
#> [11] 1536 TTCGCCGACCGTTGACTATTCTC...AGAACCCGTATACATAAAATCT MT-CO1
#> [12] 678 GCACATGCAGCGCAAGTAGGTCT...AATAGGGCCCGTATTTACCCTA MT-CO2
#> [13] 780 ACCCACCAATCACATGCCTATCA...CTCCATCTATTGATGAGGGTCT MT-CO3
Finally, we calculate codon frequencies and ENC.
# calculate codon frequency
mt_cf <- count_codons(human_mt_qc)
# calculate ENC
get_enc(mt_cf, codon_table = ctab)
#> MT-ATP6 MT-ND3 MT-ND1 MT-ATP8 MT-CYB MT-ND4L MT-ND4 MT-ND5
#> 46.44871 44.93068 42.42264 50.54562 42.97931 47.29975 42.65228 43.53521
#> MT-ND2 MT-ND6 MT-CO1 MT-CO2 MT-CO3
#> 45.18387 45.56454 44.83277 49.19525 47.07683
It is important to note that the check_cds
function and
stop codon trimming are optional steps, and you can implement your own
quality control procedures. However, it is crucial to ensure that your
input sequences are suitable for codon usage bias analysis. Failure to
do so may lead to ambiguous and misleading results from problematic
sequences.