Skip to contents

check_cds performs comprehensive quality control on coding sequences (CDS) by filtering sequences based on various criteria and optionally removing start or stop codons. This function ensures that sequences meet the requirements for downstream codon usage analysis.

Usage

check_cds(
  seqs,
  codon_table = get_codon_table(),
  min_len = 6,
  check_len = TRUE,
  check_start = TRUE,
  check_stop = TRUE,
  check_istop = TRUE,
  rm_start = TRUE,
  rm_stop = TRUE,
  start_codons = c("ATG")
)

Arguments

seqs

Input CDS sequences as a DNAStringSet or compatible object.

codon_table

Codon table matching the genetic code of the input sequences. Generated using get_codon_table() or create_codon_table().

min_len

Minimum CDS length in nucleotides (default: 6).

check_len

Logical. Check whether CDS length is divisible by 3 (default: TRUE).

check_start

Logical. Check whether CDSs begin with valid start codons (default: TRUE).

check_stop

Logical. Check whether CDSs end with valid stop codons (default: TRUE).

check_istop

Logical. Check for internal stop codons (default: TRUE).

rm_start

Logical. Remove start codons from the sequences (default: TRUE).

rm_stop

Logical. Remove stop codons from the sequences (default: TRUE).

start_codons

Character vector specifying valid start codons (default: "ATG").

Value

A DNAStringSet containing filtered and optionally trimmed CDS sequences that pass all quality control checks.

Examples

# Perform CDS sequence quality control for a sample of yeast genes
s <- head(yeast_cds, 10)
print(s)
#> DNAStringSet object of length 10:
#>      width seq                                              names               
#>  [1]   471 ATGAGTTCCCGGTTTGCAAGAAG...TGATGTGGATATGGATGCGTAA YPL071C
#>  [2]   432 ATGTCTAGATCTGGTGTTGCTGT...CAGAGGCGCTGGTTCTCATTAA YLL050C
#>  [3]  2160 ATGTCTGGAATGGGTATTGCGAT...AGAGAGCCTTGCTGGAATATAG YMR172W
#>  [4]   663 ATGTCAGCACCTGCTCAAAACAA...TGAAGACGATGCTGATTTATAA YOR185C
#>  [5]  2478 ATGGATAACTTCAAAATTTACAG...ATATCAAAATGGCAGAAAATGA YLL032C
#>  [6]  2703 ATGGGCTCCAATAAGGAAGCAAA...AAAGCTGCCATATACCAAATAA YBR225W
#>  [7]  1488 ATGAAAACTGATAGATTACTGAT...TCAGGCTCATTTTGCAATCTAA YEL041W
#>  [8]  1305 ATGTCTCAACACGCAAGCTCATC...GGAGAACGAAATTACTATATAA YOR237W
#>  [9]  1413 ATGACTATCCCTGGAAGATTTAT...CTGCTCTGGTATACATAAATAA YMR027W
#> [10]   195 ATGAAGATTTTCACGCTGTATAC...TGGCACTCACACTACGCACTAG YBR182C-A
check_cds(s)
#> DNAStringSet object of length 10:
#>      width seq                                              names               
#>  [1]   465 AGTTCCCGGTTTGCAAGAAGTAA...TACTGATGTGGATATGGATGCG YPL071C
#>  [2]   426 TCTAGATCTGGTGTTGCTGTTGC...CAGCAGAGGCGCTGGTTCTCAT YLL050C
#>  [3]  2154 TCTGGAATGGGTATTGCGATTCT...GCAAGAGAGCCTTGCTGGAATA YMR172W
#>  [4]   657 TCAGCACCTGCTCAAAACAATGC...TGATGAAGACGATGCTGATTTA YOR185C
#>  [5]  2472 GATAACTTCAAAATTTACAGTAC...TAAATATCAAAATGGCAGAAAA YLL032C
#>  [6]  2697 GGCTCCAATAAGGAAGCAAAAAA...GCCAAAGCTGCCATATACCAAA YBR225W
#>  [7]  1482 AAAACTGATAGATTACTGATTAA...TCGTCAGGCTCATTTTGCAATC YEL041W
#>  [8]  1299 TCTCAACACGCAAGCTCATCTTC...GAGGGAGAACGAAATTACTATA YOR237W
#>  [9]  1407 ACTATCCCTGGAAGATTTATGAC...TTTCTGCTCTGGTATACATAAA YMR027W
#> [10]   189 AAGATTTTCACGCTGTATACCAT...TAGTGGCACTCACACTACGCAC YBR182C-A