Optimize codons — codon_optimize • cubar

codon_optimize takes a coding sequence (without stop codon) and replace each codon to the corresponding synonymous optimal codon.

Usage

codon_optimize(
  seq,
  optimal_codons = optimal_codons,
  cf = NULL,
  codon_table = get_codon_table(),
  level = "subfam",
  method = "naive",
  num_sequences = 1,
  organism = NULL,
  envname = "cubar_env",
  attention_type = "original_full",
  deterministic = TRUE,
  temperature = 0.2,
  top_p = 0.95,
  match_protein = FALSE,
  spliceai = FALSE
)

Arguments

seq: DNAString, or an object that can be coerced to a DNAString.
optimal_codons: table optimze codons as generated by est_optimal_codons.
cf: matrix of codon frequencies as calculated by count_codons(). Required when method is set to "IDT".
codon_table: a table of genetic code derived from get_codon_table or create_codon_table.
level: "subfam" (default) or "amino_acid". Optimize codon usage at which level. Required when method is set to "naive" or "IDT".
method: "naive" (default), "IDT" or "CodonTransformer". For which method to estimate optimal codons. The "IDT" method derives from the Codon Optimization Tool of INTEGRATED DNA TECHNOLOGIES. The "CodonTransformer" method derives from the tool CodonTransformer.
num_sequences: number of different DNA sequences to generate. Default is 1. Required when method is set to "IDT" or "CodonTransformer". When greater than 1, identical duplicate sequences will be retained as a single copy, potentially resulting in a final sequence count that is less than the specified value. With the method "CodonTransformer", only works when deterministic=False, and each sequence will be sampled based on the temperature and top_p parameters.
organism: organism ID (integer) or name (string) (e.g., "Escherichia coli general", must be from ORGANISM2ID in CodonUtils). Required when method is set to "CodonTransformer".
envname: the name of an environment when using the method "CodonTransformer" or when "spliceai" is TRUE. Maintain consistency with user-defined conda environment name (default: cubar_env).
attention_type: type of attention mechanism to use in model - 'block_sparse' for memory efficient or 'original_full' (default) for standard attention. Required when method is set to "CodonTransformer".
deterministic: if TRUE (default), uses deterministic decoding (picks most likely tokens). If "False", samples tokens based on probabilities adjusted by temperature. Required when method is set to "CodonTransformer".
temperature: controls randomness in non-deterministic mode. Lower values (0.2) are conservative and pick high probability tokens, while higher values (0.8) allow more diversity. Must be positive. Required when method is set to "CodonTransformer". Default is 0.2.
top_p: nucleus sampling threshold - only tokens with cumulative probability up to this value are considered. Balances diversity and quality of predictions. Must be between 0 and 1. Required when method is set to "CodonTransformer". Default is 0.95.
match_protein: constrains predictions to only use codons that translate back to the exact input protein sequence. Only recommended when using high temperatures or error prone input proteins (e.g. not starting with methionine or having numerous repetitions). Default is FALSE.
spliceai: TRUE or FALSE (default). whether to run spliceai to predict possible splice junction sites. This option derives from the tool SpliceAI.

Value

a DNAString of the optimized coding sequence when num_sequences is set to 1 and spliceai is FALSE, or a DNAStringSet of the optimized coding sequences when num_sequences is large than 1 and spliceai is FALSE, or a data.table object, including columns of candidate optimized sequences and columns indicating the possibility of splice sites when spliceai is TRUE.

References

Fallahpour A, Gureghian V, Filion GJ, Lindner AB, Pandi A. CodonTransformer: a multispecies codon optimizer using context-aware neural networks. Nat Commun. 2025 Apr 3;16(1):3205.

Jaganathan K, Panagiotopoulou S K, McRae J F, et al. Predicting splicing from primary sequence with deep learning[J].Cell, 2019, 176(3): 535-548. e24.

Examples

cf_all <- count_codons(yeast_cds)
optimal_codons <- est_optimal_codons(cf_all)
seq <- 'ATGCTACGA'
# method "naive":
codon_optimize(seq, optimal_codons)
#> 9-letter DNAString object
#> seq: ATGCTACGT
# method "IDT":
codon_optimize(seq, cf = cf_all, method = "IDT")
#> 9-letter DNAString object
#> seq: ATGCTACGT
codon_optimize(seq, cf = cf_all, method = "IDT", num_sequences = 10)
#> DNAStringSet object of length 8:
#>     width seq
#> [1]     9 ATGCTTCGT
#> [2]     9 ATGCTACGT
#> [3]     9 ATGCTACGC
#> [4]     9 ATGCTACGG
#> [5]     9 ATGCTACGA
#> [6]     9 ATGCTGCGT
#> [7]     9 ATGCTTCGC
#> [8]     9 ATGCTTCGA
# method "CodonTransformer":
seq_opt <- codon_optimize(seq, method = "CodonTransformer", organism = "Saccharomyces cerevisiae")
#> Error in reticulate::use_condaenv(envname): Unable to locate conda environment 'cubar_env'.
print(seq_opt)
#> Error: object 'seq_opt' not found
seqs_opt <- codon_optimize(seq, method = "CodonTransformer", organism = "Saccharomyces cerevisiae",
num_sequences = 10, deterministic =FALSE, temperature = 0.4)
#> Error in reticulate::use_condaenv(envname): Unable to locate conda environment 'cubar_env'.
print(seqs_opt)
#> Error: object 'seqs_opt' not found
seqs_opt <- codon_optimize(seq, cf = cf_all, method = "IDT", num_sequences = 10, spliceai = TRUE)
#> Error in reticulate::use_condaenv(envname): Unable to locate conda environment 'cubar_env'.
print(seqs_opt)
#> Error: object 'seqs_opt' not found
seq_opt <- codon_optimize(seq, method = "CodonTransformer", organism = "Saccharomyces cerevisiae",
spliceai = TRUE)
#> Error in reticulate::use_condaenv(envname): Unable to locate conda environment 'cubar_env'.
print(seq_opt)
#> Error: object 'seq_opt' not found