Skip to contents

check_cds performs quality control of CDS sequences by filtering some peculiar sequences and optionally remove start or stop codons.

Usage

check_cds(
  seqs,
  codon_table = get_codon_table(),
  min_len = 6,
  check_len = TRUE,
  check_start = TRUE,
  check_stop = TRUE,
  check_istop = TRUE,
  rm_start = TRUE,
  rm_stop = TRUE,
  start_codons = c("ATG")
)

Arguments

seqs

input CDS sequences

codon_table

codon table matching the genetic code of seqs

min_len

minimum CDS length in nt

check_len

check whether CDS length is divisible by 3

check_start

check whether CDSs have start codons

check_stop

check whether CDSs have stop codons

check_istop

check internal stop codons

rm_start

whether to remove start codons

rm_stop

whether to remove stop codons

start_codons

vector of start codons

Value

DNAStringSet of filtered (and trimmed) CDS sequences

Examples

# CDS sequence QC for a sample of yeast genes
s <- head(yeast_cds, 10)
#> Loading required package: Biostrings
#> Loading required package: BiocGenerics
#> 
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:utils’:
#> 
#>     findMatches
#> The following objects are masked from ‘package:base’:
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: XVector
#> Loading required package: GenomeInfoDb
#> 
#> Attaching package: ‘Biostrings’
#> The following object is masked from ‘package:base’:
#> 
#>     strsplit
print(s)
#> DNAStringSet object of length 10:
#>      width seq                                              names               
#>  [1]   471 ATGAGTTCCCGGTTTGCAAGAAG...TGATGTGGATATGGATGCGTAA YPL071C
#>  [2]   432 ATGTCTAGATCTGGTGTTGCTGT...CAGAGGCGCTGGTTCTCATTAA YLL050C
#>  [3]  2160 ATGTCTGGAATGGGTATTGCGAT...AGAGAGCCTTGCTGGAATATAG YMR172W
#>  [4]   663 ATGTCAGCACCTGCTCAAAACAA...TGAAGACGATGCTGATTTATAA YOR185C
#>  [5]  2478 ATGGATAACTTCAAAATTTACAG...ATATCAAAATGGCAGAAAATGA YLL032C
#>  [6]  2703 ATGGGCTCCAATAAGGAAGCAAA...AAAGCTGCCATATACCAAATAA YBR225W
#>  [7]  1488 ATGAAAACTGATAGATTACTGAT...TCAGGCTCATTTTGCAATCTAA YEL041W
#>  [8]  1305 ATGTCTCAACACGCAAGCTCATC...GGAGAACGAAATTACTATATAA YOR237W
#>  [9]  1413 ATGACTATCCCTGGAAGATTTAT...CTGCTCTGGTATACATAAATAA YMR027W
#> [10]   195 ATGAAGATTTTCACGCTGTATAC...TGGCACTCACACTACGCACTAG YBR182C-A
check_cds(s)
#> DNAStringSet object of length 10:
#>      width seq                                              names               
#>  [1]   465 AGTTCCCGGTTTGCAAGAAGTAA...TACTGATGTGGATATGGATGCG YPL071C
#>  [2]   426 TCTAGATCTGGTGTTGCTGTTGC...CAGCAGAGGCGCTGGTTCTCAT YLL050C
#>  [3]  2154 TCTGGAATGGGTATTGCGATTCT...GCAAGAGAGCCTTGCTGGAATA YMR172W
#>  [4]   657 TCAGCACCTGCTCAAAACAATGC...TGATGAAGACGATGCTGATTTA YOR185C
#>  [5]  2472 GATAACTTCAAAATTTACAGTAC...TAAATATCAAAATGGCAGAAAA YLL032C
#>  [6]  2697 GGCTCCAATAAGGAAGCAAAAAA...GCCAAAGCTGCCATATACCAAA YBR225W
#>  [7]  1482 AAAACTGATAGATTACTGATTAA...TCGTCAGGCTCATTTTGCAATC YEL041W
#>  [8]  1299 TCTCAACACGCAAGCTCATCTTC...GAGGGAGAACGAAATTACTATA YOR237W
#>  [9]  1407 ACTATCCCTGGAAGATTTATGAC...TTTCTGCTCTGGTATACATAAA YMR027W
#> [10]   189 AAGATTTTCACGCTGTATACCAT...TAGTGGCACTCACACTACGCAC YBR182C-A