Skip to contents

The codon_optimize function optimizes sequences according to the synonymous optimal codon while also integrating IDT’s sequence optimization approach. Additionally, it employs the CodonTransformer method for sequence optimization and utilizes SpliceAI to identify potential splice sites within the optimized sequences. It is essential for users to set up a command-line environment using conda or mamba prior to applying the CodonTransformer method and SpliceAI in cubar:

conda create -n cubar_env python=3.12 r-base blas=*=netlib r-reticulate
conda activate cubar_env
# install CodonTransformer and SpliceAI
pip install CodonTransformer tensorflow spliceai

The default “naive” method involves optimizing the sequence through the use of synonymous codons. Any codon that is not currently optimized will be substituted with an appropriate optimized codon from its respective family or subfamily, if available.

library(cubar)

seq <- 'ATGCTACGA'
cf_all <- count_codons(yeast_cds)
#> Loading required namespace: Biostrings
optimal_codons <- est_optimal_codons(cf_all)
seq_opt <- codon_optimize(seq, optimal_codons)
print(seq_opt)
#> 9-letter DNAString object
#> seq: ATGCTACGT

The “IDT” method originates from the principle of the codon optimization tool of Integrated DNA Technologies. It randomly selects codons based on the frequency of codons at the family or subfamily level, but excludes rare codons below 10%.

seq_opt <- codon_optimize(seq, cf = cf_all, method = "IDT")
print(seq_opt)
#> 9-letter DNAString object
#> seq: ATGCTGCGA

The “codontransformer” method originates from the machine learning tool CodonTransformer, which integrates its predict_rna_Sequence function and uses its pre trained model to optimize sequences.

seq_opt <- codon_optimize(seq, method = "CodonTransformer", organism = "Saccharomyces cerevisiae")
print(seq_opt)

Cubar can generate several optimized sequences at the same time using the argument num_sequences with the method “IDT” and “CodonTransformer”. When num_sequences is greater than 1, identical duplicate sequences will be retained as a single copy, potentially resulting in a final sequence count less than the specified value.

seqs_opt <- codon_optimize(seq, cf = cf_all, method = "IDT", num_sequences = 10)
print(seqs_opt)
seqs_opt <- codon_optimize(seq, method = "CodonTransformer", organism = "Saccharomyces cerevisiae",
num_sequences = 10, deterministic =FALSE, temperature = 0.4)
print(seqs_opt)

In addition, cubar integrated the deep learning tool SpliceAI to identify potential splice sites with the argument spliceai. When the probability scores of non-splice site for each base are greater than 0.5, it is considered that there are no potential splice junction sites, and the Possible_splice_junction in the output is marked as FALSE, otherwise it is marked as TRUE.

seqs_opt <- codon_optimize(seq, cf = cf_all, method = "IDT", num_sequences = 10, spliceai = TRUE)
print(seqs_opt)
seq_opt <- codon_optimize(seq, method = "CodonTransformer", organism = "Saccharomyces cerevisiae", spliceai = TRUE)
print(seq_opt)