Codon Usage Bias Analysis • cubar

Comprehensive Codon Usage Bias Analysis in R

Overview

Codon usage bias refers to the non-uniform usage of synonymous codons (codons that encode the same amino acid) across different organisms, genes, and functional categories. cubar is a comprehensive R package for analyzing codon usage bias in coding sequences. It provides a unified framework for calculating established codon usage metrics, conducting sliding-window analyses or differential usage analyses, and optimizing sequences for heterologous expression.

Features

🧬 Codon-Level Analysis

RSCU calculation: Relative synonymous codon usage analysis
Amino acid usage: Frequency of each amino acid in sequences
Codon weights: Calculate weights based on gene expression, tRNA availability, and mRNA stability
Optimal codon inference: Machine learning-based identification of optimal codons
Codon-anticodon visualization: Visualization of codon-tRNA pairing relationships

📊 Gene-Level Metrics

Codon frequency tabulation: Count codon occurrences across sequences
CAI (Codon Adaptation Index): Measure similarity to highly expressed genes
ENC (Effective Number of Codons): Assess codon usage bias strength
Fop (Fraction of Optimal codons): Calculate proportion of optimal codons
tAI (tRNA Adaptation Index): Match codon usage to tRNA availability
CSCg (Codon Stabilization Coefficients): Quantify mRNA stability effects
Dp (Deviation from Proportionality): Analyze virus-host codon usage relationships
GC content metrics: Overall GC, GC3s (3rd codon positions), GC4d (4-fold degenerate sites)

🛠️ Utilities & Tools

Sliding window analysis: Positional codon usage patterns within genes
Sequence optimization: Redesign sequences for optimal expression
Differential codon usage: Statistical comparison between sequence sets
Quality control: Comprehensive CDS validation and preprocessing

Why Choose cubar?

🚀 High Performance: Process large datasets (>100,000 sequences) efficiently using optimized Biostrings and data.table backends
🧬 Flexible Genetic Codes: Support for all NCBI genetic codes plus custom genetic code tables
🔗 R Ecosystem Integration: Seamlessly integrate with other bioinformatics and data analysis packages
📚 Comprehensive Documentation: Extensive tutorials, examples, and theoretical background
🔬 Research Ready: Implements established metrics with proper citations and validation

Installation

Stable Release (Recommended)

Install the latest stable version from CRAN:

install.packages("cubar")

Development Version

Install the latest development version from GitHub:

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}

# Install cubar from GitHub
devtools::install_github("mt1022/cubar", dependencies = TRUE)

Dependencies

System Requirements: - R (≥ 4.1.0)

Required Packages: - Biostrings (≥ 2.60.0) - Bioconductor package for sequence manipulation - IRanges (≥ 2.34.0) - Bioconductor infrastructure for range operations
- data.table (≥ 1.14.0) - High-performance data manipulation - ggplot2 (≥ 3.3.5) - Data visualization - rlang (≥ 0.4.11) - Language tools

Note: Bioconductor packages will be installed automatically, but you may need to update your R installation if you encounter compatibility issues.

Documentation & Tutorials

📖 Complete documentation is available within R (?function_name) and on our package website.

🎯 Getting Started

Introduction to cubar - Basic usage and core functionality
Non-standard Genetic Codes - Working with alternative genetic codes
Codon Optimization - Sequence optimization strategies

📚 Advanced Topics

Mathematical Foundations - Detailed theory behind the metrics
Function Reference - Complete function documentation

Example Workflow

Here’s a toy example demonstrating key functionality:

library(cubar)
library(ggplot2)

# 1. Load and quality-check sequences
data(yeast_cds)
clean_cds <- check_cds(yeast_cds)

# 2. Calculate codon frequencies
codon_freq <- count_codons(clean_cds)

# 3. Calculate multiple metrics
enc <- get_enc(codon_freq)           # Effective number of codons
gc3s <- get_gc3s(codon_freq)         # GC content at 3rd positions

# 4. Calculate CAI with RSCU of highly expressed genes
data(yeast_exp)
yeast_exp <- yeast_exp[yeast_exp$gene_id %in% rownames(codon_freq), ]
high_expr <- head(yeast_exp[order(-yeast_exp$fpkm), ], 500)
rscu_high <- est_rscu(codon_freq[high_expr$gene_id, ])
cai <- get_cai(codon_freq, rscu_high)

# 5. Visualize results
df <- data.frame(ENC = enc, CAI = cai, GC3s = gc3s)
ggplot(df, aes(color = GC3s, x = ENC, y = CAI)) + 
  geom_point(alpha = 0.6) + 
  scale_color_viridis_c() +
  labs(title = "Codon Usage Bias Relationships",
       x = "Effective Number of Codons", y = "Codon Adaptation Index")

🆘 Getting Help

📋 GitHub Issues: Report bugs, request features, or ask questions
📖 Documentation: Check function help (?function_name) and online docs

For complementary analysis, consider these R packages:

Biostrings - Sequence input/output and manipulation
Peptides - Peptide and protein property calculations

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

The R and Bioconductor communities for excellent foundational packages
Contributors and users who have provided feedback and improvements
GitHub Education for providing free access to development tools
GitHub Copilot was used to suggest code snippets during development

Citation

If you use cubar in your research, please cite:

Mengyue Liu, Bu Zi, Hebin Zhang, Hong Zhang, cubar: a versatile package for codon usage bias analysis in R, Genetics, 2025, iyaf191, https://doi.org/10.1093/genetics/iyaf191

Please also cite the original studies associated with any codon usage metrics or third-party software you use. You can find the relevant references in the documentation of the corresponding functions (for example, type ?cubar::get_enc in the R console and check the “References” section in the help page).

📚 Documentation • 🐛 Report Bug • 💡 Request Feature

cubar

Table of Contents