Comprehensive Codon Usage Bias Analysis in R
Table of Contents
Overview
Codon usage bias refers to the non-uniform usage of synonymous codons (codons that encode the same amino acid) across different organisms, genes, and functional categories. cubar is a comprehensive R package for analyzing codon usage bias in coding sequences. It provides a unified framework for calculating established codon usage metrics, conducting sliding-window analyses or differential usage analyses, and optimizing sequences for heterologous expression.
Features
𧬠Codon-Level Analysis
- RSCU calculation: Relative synonymous codon usage analysis
 - Amino acid usage: Frequency of each amino acid in sequences
 - Codon weights: Calculate weights based on gene expression, tRNA availability, and mRNA stability
 - Optimal codon inference: Machine learning-based identification of optimal codons
 - Codon-anticodon visualization: Visualization of codon-tRNA pairing relationships
 
π Gene-Level Metrics
- Codon frequency tabulation: Count codon occurrences across sequences
 - CAI (Codon Adaptation Index): Measure similarity to highly expressed genes
 - ENC (Effective Number of Codons): Assess codon usage bias strength
 - Fop (Fraction of Optimal codons): Calculate proportion of optimal codons
 - tAI (tRNA Adaptation Index): Match codon usage to tRNA availability
 - CSCg (Codon Stabilization Coefficients): Quantify mRNA stability effects
 - Dp (Deviation from Proportionality): Analyze virus-host codon usage relationships
 - GC content metrics: Overall GC, GC3s (3rd codon positions), GC4d (4-fold degenerate sites)
 
Why Choose cubar?
- 
π High Performance: Process large datasets (>100,000 sequences) efficiently using optimized 
Biostringsanddata.tablebackends - 𧬠Flexible Genetic Codes: Support for all NCBI genetic codes plus custom genetic code tables
 - π R Ecosystem Integration: Seamlessly integrate with other bioinformatics and data analysis packages
 - π Comprehensive Documentation: Extensive tutorials, examples, and theoretical background
 - π¬ Research Ready: Implements established metrics with proper citations and validation
 
Installation
Development Version
Install the latest development version from GitHub:
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}
# Install cubar from GitHub
devtools::install_github("mt1022/cubar", dependencies = TRUE)Dependencies
System Requirements: - R (β₯ 4.1.0)
Required Packages: - Biostrings (β₯ 2.60.0) - Bioconductor package for sequence manipulation - IRanges (β₯ 2.34.0) - Bioconductor infrastructure for range operations
- data.table (β₯ 1.14.0) - High-performance data manipulation - ggplot2 (β₯ 3.3.5) - Data visualization - rlang (β₯ 0.4.11) - Language tools
Note: Bioconductor packages will be installed automatically, but you may need to update your R installation if you encounter compatibility issues.
Documentation & Tutorials
π Complete documentation is available within R (?function_name) and on our package website.
π― Getting Started
- Introduction to cubar - Basic usage and core functionality
 - Non-standard Genetic Codes - Working with alternative genetic codes
 - Codon Optimization - Sequence optimization strategies
 
π Advanced Topics
- Mathematical Foundations - Detailed theory behind the metrics
 - Function Reference - Complete function documentation
 
Example Workflow
Hereβs a toy example demonstrating key functionality:
library(cubar)
library(ggplot2)
# 1. Load and quality-check sequences
data(yeast_cds)
clean_cds <- check_cds(yeast_cds)
# 2. Calculate codon frequencies
codon_freq <- count_codons(clean_cds)
# 3. Calculate multiple metrics
enc <- get_enc(codon_freq)           # Effective number of codons
gc3s <- get_gc3s(codon_freq)         # GC content at 3rd positions
# 4. Calculate CAI with RSCU of highly expressed genes
data(yeast_exp)
yeast_exp <- yeast_exp[yeast_exp$gene_id %in% rownames(codon_freq), ]
high_expr <- head(yeast_exp[order(-yeast_exp$fpkm), ], 500)
rscu_high <- est_rscu(codon_freq[high_expr$gene_id, ])
cai <- get_cai(codon_freq, rscu_high)
# 5. Visualize results
df <- data.frame(ENC = enc, CAI = cai, GC3s = gc3s)
ggplot(df, aes(color = GC3s, x = ENC, y = CAI)) + 
  geom_point(alpha = 0.6) + 
  scale_color_viridis_c() +
  labs(title = "Codon Usage Bias Relationships",
       x = "Effective Number of Codons", y = "Codon Adaptation Index")π Getting Help
- π GitHub Issues: Report bugs, request features, or ask questions
 - 
π Documentation: Check function help (
?function_name) and online docs 
Related Packages
For complementary analysis, consider these R packages:
- Biostrings - Sequence input/output and manipulation
 - Peptides - Peptide and protein property calculations
 
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- The R and Bioconductor communities for excellent foundational packages
 - Contributors and users who have provided feedback and improvements
 - GitHub Education for providing free access to development tools
 - GitHub Copilot was used to suggest code snippets during development
 
Citation
If you use cubar in your research, please cite:
Mengyue Liu, Bu Zi, Hebin Zhang, Hong Zhang, cubar: a versatile package for codon usage bias analysis in R, Genetics, 2025, iyaf191, https://doi.org/10.1093/genetics/iyaf191
Please also cite the original studies associated with any codon usage metrics or third-party software you use. You can find the relevant references in the documentation of the corresponding functions (for example, type ?cubar::get_enc in the R console and check the βReferencesβ section in the help page).