Codon Usage Bias Analysis • cubar

Comprehensive Codon Usage Bias Analysis in R

Overview
Features
Why Choose cubar?
Installation
Documentation & Tutorials
- 🎯 Getting Started
- 📚 Advanced Topics
Example Workflow
🆘 Getting Help
Related Packages
License
Acknowledgments

Overview

Codon usage bias refers to the non-uniform usage of synonymous codons (codons that encode the same amino acid) across different organisms, genes, and functional categories. cubar is a comprehensive R package for analyzing codon usage bias in coding sequences. It provides a unified framework for calculating established codon usage metrics, conducting sliding-window analyses or differential usage analyses, and optimizing sequences for heterologous expression.

Features

🧬 Codon-Level Analysis

RSCU calculation: Relative synonymous codon usage analysis
Amino acid usage: Frequency of each amino acid in sequences
Codon weights: Calculate weights based on gene expression, tRNA availability, and mRNA stability
Optimal codon inference: Machine learning-based identification of optimal codons
Codon-anticodon visualization: Visualization of codon-tRNA pairing relationships

📊 Gene-Level Metrics

Codon frequency tabulation: Count codon occurrences across sequences
CAI (Codon Adaptation Index): Measure similarity to highly expressed genes
ENC (Effective Number of Codons): Assess codon usage bias strength
Fop (Fraction of Optimal codons): Calculate proportion of optimal codons
tAI (tRNA Adaptation Index): Match codon usage to tRNA availability
CSCg (Codon Stabilization Coefficients): Quantify mRNA stability effects
Dp (Deviation from Proportionality): Analyze virus-host codon usage relationships
GC content metrics: Overall GC, GC3s (3rd codon positions), GC4d (4-fold degenerate sites)

🛠️ Utilities & Tools

Sliding window analysis: Positional codon usage patterns within genes
Sequence optimization: Redesign sequences for optimal expression
Differential codon usage: Statistical comparison between sequence sets
Quality control: Comprehensive CDS validation and preprocessing

Why Choose cubar?

🚀 High Performance: Process large datasets (>100,000 sequences) efficiently using optimized Biostrings and data.table backends
🧬 Flexible Genetic Codes: Support for all NCBI genetic codes plus custom genetic code tables
🔗 R Ecosystem Integration: Seamlessly integrate with other bioinformatics and data analysis packages
📚 Comprehensive Documentation: Extensive tutorials, examples, and theoretical background
🔬 Research Ready: Implements established metrics with proper citations and validation

Installation

Stable Release (Recommended)

Install the latest stable version from CRAN:

install.packages("cubar")

Development Version

Install the latest development version from GitHub:

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}

# Install cubar from GitHub
devtools::install_github("mt1022/cubar", dependencies = TRUE)

Dependencies

System Requirements: - R (≥ 4.1.0)

Required Packages: - Biostrings (≥ 2.60.0) - Bioconductor package for sequence manipulation - IRanges (≥ 2.34.0) - Bioconductor infrastructure for range operations
- data.table (≥ 1.14.0) - High-performance data manipulation - ggplot2 (≥ 3.3.5) - Data visualization - rlang (≥ 0.4.11) - Language tools

Note: Bioconductor packages will be installed automatically, but you may need to update your R installation if you encounter compatibility issues.

Documentation & Tutorials

📖 Complete documentation is available within R (?function_name) and on our package website.

🎯 Getting Started

Introduction to cubar - Basic usage and core functionality
Non-standard Genetic Codes - Working with alternative genetic codes
Codon Optimization - Sequence optimization strategies

📚 Advanced Topics

Mathematical Foundations - Detailed theory behind the metrics
Function Reference - Complete function documentation

Example Workflow

Here’s a typical analysis workflow demonstrating key functionality:

library(cubar)
library(ggplot2)

# 1. Load and quality-check sequences
data(yeast_cds)
clean_cds <- check_cds(yeast_cds)

# 2. Calculate codon frequencies
codon_freq <- count_codons(clean_cds)

# 3. Calculate multiple metrics
enc <- get_enc(codon_freq)           # Effective number of codons
gc3s <- get_gc3s(codon_freq)         # GC content at 3rd positions

# 4. Analyze highly expressed genes
data(yeast_exp)
yeast_exp <- yeast_exp[yeast_exp$gene_id %in% rownames(codon_freq), ]
high_expr <- head(yeast_exp[order(-yeast_exp$fpkm), ], 500)
rscu_high <- est_rscu(codon_freq[high_expr$gene_id, ])
cai <- get_cai(codon_freq, rscu_high)

# 5. Visualize results
df <- data.frame(ENC = enc, CAI = cai, GC3s = gc3s)
ggplot(df, aes(color = GC3s, x = ENC, y = CAI)) + 
  geom_point(alpha = 0.6) + 
  scale_color_viridis_c() +
  labs(title = "Codon Usage Bias Relationships",
       x = "Effective Number of Codons", y = "Codon Adaptation Index")

🆘 Getting Help

📋 GitHub Issues: Report bugs, request features, or ask questions
📖 Documentation: Check function help (?function_name) and online docs

For complementary analysis, consider these R packages:

Biostrings - Sequence input/output and manipulation
Peptides - Peptide and protein property calculations

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

GitHub Copilot was used to suggest code snippets during development
GitHub Education for providing free access to development tools
The R and Bioconductor communities for excellent foundational packages
Contributors and users who have provided feedback and improvements

📚 Documentation • 🐛 Report Bug • 💡 Request Feature

cubar

Table of Contents