Skip to contents

est_optimal_codons identifies optimal codons within each codon family or amino acid group using binomial regression. Optimal codons are those whose usage correlates positively with high gene expression or negatively with codon usage bias (ENC), suggesting they are preferred for efficient translation.

Usage

est_optimal_codons(
  cf,
  codon_table = get_codon_table(),
  level = "subfam",
  gene_score = NULL,
  fdr = 0.001
)

Arguments

cf

A matrix of codon frequencies as calculated by count_codons(). Rows represent sequences and columns represent codons.

codon_table

A codon table defining the genetic code, derived from get_codon_table() or create_codon_table().

level

Character string specifying the analysis level: "subfam" (default, analyzes codon subfamilies) or "amino_acid" (analyzes at amino acid level).

gene_score

A numeric vector of gene-level scores used to identify optimal codons. Length must equal the number of rows in cf. Common choices include:

  • Gene expression levels (RPKM, TPM, FPKM) - optionally log-transformed

  • Protein abundance measurements

  • Custom gene importance scores

If not provided, the negative of ENC values will be used (lower ENC = higher bias).

fdr

Numeric value specifying the false discovery rate threshold for determining statistical significance of codon optimality (default depends on method).

Value

A data.table containing the input codon table with additional columns indicating codon optimality status, statistical significance, and effect sizes from the regression analysis. The columns include single-letter abbreviation of the amino acid, three-letter abbreviation, codon, codon subfamily, regression coefficient, regression P-value, Benjamini and Hochberg corrected Q-value, and indication of whether the codon is optimal.

References

Presnyak V, Alhusaini N, Chen YH, Martin S, Morris N, Kline N, Olson S, Weinberg D, Baker KE, Graveley BR, et al. 2015. Codon optimality is a major determinant of mRNA stability. Cell 160:1111-1124.

Examples

# perform binomial regression for optimal codon estimation
cf_all <- count_codons(yeast_cds)
codons_opt <- est_optimal_codons(cf_all)
codons_opt <- codons_opt[optimal == TRUE]
codons_opt
#>     aa_code amino_acid  codon subfam       coef        pvalue        qvalue
#>      <char>     <char> <char> <char>      <num>         <num>         <num>
#>  1:       A        Ala    GCT Ala_GC 0.08454964  0.000000e+00  0.000000e+00
#>  2:       A        Ala    GCC Ala_GC 0.01621930  2.127082e-32  2.359128e-32
#>  3:       R        Arg    AGA Arg_AG 0.12902657  0.000000e+00  0.000000e+00
#>  4:       R        Arg    CGT Arg_CG 0.20090361  0.000000e+00  0.000000e+00
#>  5:       N        Asn    AAC Asn_AA 0.04208269 8.024342e-185 1.223712e-184
#>  6:       D        Asp    GAC Asp_GA 0.01574961  3.398292e-28  3.636768e-28
#>  7:       C        Cys    TGT Cys_TG 0.09889375 4.697718e-150 6.512746e-150
#>  8:       Q        Gln    CAA Gln_CA 0.11196536  0.000000e+00  0.000000e+00
#>  9:       E        Glu    GAA Glu_GA 0.08458541  0.000000e+00  0.000000e+00
#> 10:       G        Gly    GGT Gly_GG 0.16530194  0.000000e+00  0.000000e+00
#> 11:       H        His    CAC His_CA 0.03127977  7.294628e-42  8.240228e-42
#> 12:       I        Ile    ATT Ile_AT 0.03956734 1.625599e-208 2.754487e-208
#> 13:       I        Ile    ATC Ile_AT 0.03975891 1.099697e-188 1.765303e-188
#> 14:       L        Leu    CTT Leu_CT 0.02178829  6.897132e-23  7.253880e-23
#> 15:       L        Leu    CTA Leu_CT 0.05101078 7.732994e-124 1.025462e-123
#> 16:       L        Leu    TTG Leu_TT 0.03514392 7.751784e-158 1.125854e-157
#> 17:       K        Lys    AAG Lys_AA 0.05853116  0.000000e+00  0.000000e+00
#> 18:       F        Phe    TTC Phe_TT 0.05451940 3.720900e-254 7.092965e-254
#> 19:       P        Pro    CCA Pro_CC 0.10328272  0.000000e+00  0.000000e+00
#> 20:       S        Ser    AGT Ser_AG 0.02452355  2.109510e-19  2.144669e-19
#> 21:       S        Ser    TCT Ser_TC 0.06070916  0.000000e+00  0.000000e+00
#> 22:       S        Ser    TCC Ser_TC 0.02605206  1.324126e-70  1.583759e-70
#> 23:       T        Thr    ACT Thr_AC 0.04838553 2.506592e-292 5.272486e-292
#> 24:       T        Thr    ACC Thr_AC 0.04684950 2.157821e-230 3.760774e-230
#> 25:       Y        Tyr    TAC Tyr_TA 0.04206093 1.244976e-121 1.582157e-121
#> 26:       V        Val    GTT Val_GT 0.05787243  0.000000e+00  0.000000e+00
#> 27:       V        Val    GTC Val_GT 0.04995247 1.700719e-281 3.458128e-281
#>     aa_code amino_acid  codon subfam       coef        pvalue        qvalue
#>     optimal
#>      <lgcl>
#>  1:    TRUE
#>  2:    TRUE
#>  3:    TRUE
#>  4:    TRUE
#>  5:    TRUE
#>  6:    TRUE
#>  7:    TRUE
#>  8:    TRUE
#>  9:    TRUE
#> 10:    TRUE
#> 11:    TRUE
#> 12:    TRUE
#> 13:    TRUE
#> 14:    TRUE
#> 15:    TRUE
#> 16:    TRUE
#> 17:    TRUE
#> 18:    TRUE
#> 19:    TRUE
#> 20:    TRUE
#> 21:    TRUE
#> 22:    TRUE
#> 23:    TRUE
#> 24:    TRUE
#> 25:    TRUE
#> 26:    TRUE
#> 27:    TRUE
#>     optimal