# Theories behind cubar

#### Bu Zi (draft), Hong Zhang (revise)

Source:`vignettes/articles/theory.Rmd`

`theory.Rmd`

#### The Codon Adaptation Index (CAI)

CAI measures the similarity of codon usage between a coding sequence (CDS) and a set of highly expressed genes (Sharp and Li 1987). To quantify the relative synonymous codon usage (RSCU) values among the set of highly expressed genes, the observed frequency of each codon is divided by the frequency expected under the assumption of equal usage of the synonymous codons for an amino acid (referred to as codon family hereafter):

\[ \text{RSCU}_{ij} = \frac{X_{ij}}{\frac{1}{n_i}\sum_{k=1}^{n_i}{X_{ik}}}=\frac{X_{ij}}{\frac{1}{n_i}X_i} \]

where \(X_{ij}\) is the number of occurrences of the \(j\)th codon for the \(i\)th codon family in CDSs of highly expressed genes, and \(n_i\) is the total number of alternative codons for the \(i\)th codon family. \(X_i\) is the number of occurrence of codons in the \(i\)th codon family. The relative adaptiveness of a codon, \(w\), is defined as the RSCU of that codon divided by the maximum RSCU of that codon family:

\[ w_{ij}=\frac{\text{RSCU}_{ij}}{\max_{k=1,\dots,n_i}{\text{RSCU}_{ik}}} \]

The CAI for a CDS is then calculated as the geometric mean of the relative adaptiveness of codons used in the CDS of that gene:

\[ \text{CAI} = \bigg(\prod_{k=1}^L w_k\bigg)^{\frac{1}{L}} =\text{exp}\bigg(\frac{1}{L}\sum_{k=1}^L \ln w_k\bigg) = \text{exp}\bigg( \frac{\sum_i\sum_j {X_{ij}\ln w_{ij}}}{\sum_i\sum_j X_{ij}} \bigg) \]

where \(L\) is the total number of codons in CDS.

#### The effective number of codons (ENC)

ENC reflects the unequal usage of codons in the CDS of a gene (Wright 1990). The lower the ENC, the larger the overall bias in codon usage. The original implementation of ENC calculates the homozygosity of codon usage in codon family \(i\) as follows:

\[ F_i^{\text{ori}} = \frac{X_i \sum_{j=1}^{n_i}{(\frac{X_{ij}}{X_i})^2} - 1}{X_i - 1} \]

where \(X_i\) is the number of occurrence of codons in the \(i\)th codon family as used above. In cubar, codon family homozygosity is calculated with an improved implementation that is more robust to bias due to small \(n_i\) (Sun, et al. 2013):

\[ F_i = \sum_{j=1}^{n_i}(\frac{X_{ij} + 1}{X_i + n_i})^2 \]

Then we can calculate the final ENC of the CDS of a gene as follows:

\[ \text{ENC}=K_1 + K_2\frac{\sum_{i=1}^{K_2}{X_i}}{\sum_{i=1}^{K_2}{X_iF_i}} + K_3\frac{\sum_{i=1}^{K_3}{X_i}}{\sum_{i=1}^{K_3}{X_iF_i}} + K_4\frac{\sum_{i=1}^{K_4}{X_i}}{\sum_{i=1}^{K_4}{X_iF_i}} \]

where \(K_m\) denotes the number of codon families with \(m\) synonymous codons:

\[ K_m=\sum_{i}{\delta(n_i-m)} \]

It should be noted that throughout cubar, codon families with more than four synomymous codons are divided into different subfamilies based on the first two nucleotides of codons.

#### The fraction of optimal codons (\(F_{op}\))

\(F_{op}\) measures the fraction of optimal codons in the CDS of a gene given a list of optimal codons (Ikemura 1981). It is calculated as follows:

\[ F_{\text{op}}=\frac{\sum_{k=1}^L{I(k\text{-th codon is optimal})}}{L} \]

where \(I\) is the indicator function. In case the optimal codons are unknown, cubar automatically determines the optimal codons in each codon family based on the rationale that optimal codons tend to be used more frequently in genes showing stronger codon usage bias. Specifically, as the number of occurrence of codon \(j\) among codon family \(i\) in gene \(k\) follows a Binomial distribution:

\[ X_{ij}^k \sim \text{Binomial}(X_i^k, p_k) \]

the influence of codon usage bias on the tendency to use codon \(j\) can be estimated with binomial regression using the following link function:

\[ \ln{\frac{p}{1-p}} \sim \beta_0 + \beta_1 \cdot \text{ENC} + \epsilon \]

If codon \(j\) is more likely to be
used in genes with higher overall codon bias (i.e., smaller ENC), then
the regression coefficient should be negative and significantly differs
from zero. cubar implements the binomial regression using the
`glm`

function with `binomial`

family in R.
Optimal codons are determined with a false discovery rate of 0.001 after
multiple testing correction with the Benjamini-Hochberg procedure
(Benjamini and Hochberg 1995) by default.

#### tRNA Adaptation Index (tAI)

tAI quantifies how much the usage of codons in CDS of a gene resembles the abundance of tRNAs (dos Reis, et al. 2004), which is often approximated by tRNA gene copy numbers. To determine tAI, the absolute tRNA adaptiveness value \(W_i\) for each codon \(i\) is defined as

\[ W_i = \sum_{j=1}^{t_i}{(1 - s_{ij}) T_{ij}} \]

where \(t_i\) is the number of tRNA isoacceptors recognizing the \(i\)th codon and \(T_i\) is the abundance or gene copy number of the \(j\)th tRNA recognizing this codon. \(s_{ij}\) is a panelty on non-canonical codon–anticodon pairings and differs among different species (Sabi and Tuller 2014). Cubar uses average \(s_{ij}\) values for eukaryotes (Sabi and Tuller 2014) by default. Absolute adaptiveness values are then normalized by maximum as follows:

\[ w_i = \begin{cases} \frac{W_i}{\max_{j}W_j},& \text{if } W_i > 0 \\ \frac{\bar W|_{W_j \neq 0}}{\max_j{W_j}}, & \text{if } W_i=0\end{cases} \]

where \(\bar W|_{W_j \neq 0}\) is the geometric mean of non-zero absolute adaptiveness values. Then tAI of codons in a gene can be calculated as follows, similar to CAI:

\[ \text{TAI} = \bigg(\prod_{k=1}^L w_k\bigg)^{\frac{1}{L}} =\text{exp}\bigg(\frac{1}{L}\sum_{k=1}^L \ln w_k\bigg) = \text{exp}\bigg( \frac{\sum_i\sum_j {X_{ij}\ln w_{ij}}}{\sum_i\sum_j X_{ij}} \bigg) \]

#### The mean codon stabilization coefficients (CSCg)

CSC of a codon is the Pearson correlation coefficient between the frequency of that codon and the mRNA half-lives across different genes (Presnyak, et al. 2015). CSCg is the average codon stabilization coefficient (CSC) of codons in the CDS of a gene (Carneiro, et al. 2019):

\[ \text{CSCg} = \frac{1}{L}\sum_{k=1}^L{\text{CSC}_k}= \frac{\sum_i\sum_j {X_{ij} \text{CSC}_{ij}}}{\sum_i{\sum_j{X_{ij}}}} \]

#### The deviation from proportionality (\(D_p\))

\(D_p\) measures the departure of the codon usage of an exogenous CDS from the tRNA pool of the host organism (Chen, et al. 2020; Chen and Yang 2022). For a codon family \(i\) with \(n_i\) synonymous codons (\(n_i>1\)), the fraction of codon \(j\) among all occurrences of this family in the exogenous CDS is:

\[ Y_{ij} = \frac{X_{ij}}{X_i} \]

The relative tRNA availability of this codon in the codon family in the host organism is calculated as:

\[ R_{ij} = \frac{w_{ij}}{\sum_{i=1}^{n_i}{w_{ij}}} \]

where \(w_{ij}\) can be relative tRNA adaptiveness values as in the calculation of tAI or the RSCU of host protein-coding genes as in the calculation of CAI. The discrepancy between codon proportions of exogenous CDS and host tRNA availability is then calculated as the Euclidean distance:

\[ D_i = \sqrt{\sum_j(Y_{ij} - R_{ij})^2} \]

Finally, \(D_p\) is calculated as the geometric mean of distances for all codon families with at least two synonymous codons:

\[ D_p = (\prod_{i=1}^K{D_i})^{\frac{1}{K}} = \exp (\frac{1}{K}\sum_{i=1}^K{\ln D_i}) \]

where \(K=\sum_{i}{I(n_i>1)}\) and \(I\) is the indicator function.

#### References

- Benjamini Y, Hochberg Y. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological) 57:289-300.
- Carneiro RL, Requião RD, Rossetto S, Domitrovic T, Palhano FL. 2019. Codon stabilization coefficient as a metric to gain insights into mRNA stability and codon bias and their relationships with translation. Nucleic Acids Res 47:2216-2228.
- Chen F, Wu P, Deng S, Zhang H, Hou Y, Hu Z, Zhang J, Chen X, Yang JR. 2020. Dissimilation of synonymous codon usage bias in virus-host coevolution due to translational selection. Nat Ecol Evol 4:589-600.
- Chen F, Yang JR. 2022. Distinct codon usage bias evolutionary patterns between weakly and strongly virulent respiratory viruses. iScience 25:103682.
- dos Reis M, Savva R, Wernisch L. 2004. Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res 32:5036-5044.
- Ikemura T. 1981. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J Mol Biol 151:389-409.
- Presnyak V, Alhusaini N, Chen YH, Martin S, Morris N, Kline N, Olson S, Weinberg D, Baker KE, Graveley BR, et al. 2015. Codon optimality is a major determinant of mRNA stability. Cell 160:1111-1124.
- Sabi R, Tuller T. 2014. Modelling the efficiency of codon-tRNA interactions based on codon usage bias. DNA Res 21:511-526.
- Sharp PM, Li WH. 1987. The codon Adaptation Index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15:1281-1295.
- Sun X, Yang Q, Xia X. 2013. An improved implementation of effective number of codons (nc). Mol Biol Evol 30:191-196.
- Wright F. 1990. The ‘effective number of codons’ used in a gene. Gene 87:23-29.