Blog Archives

Command-line tools for easy parsing of biological sequence files

7/15/2015

Want an awesome BioPerl-based suite of command-line ready tools to integrate into unix pipelines that are designed to process Fasta formatted files record-by-record instead of line-by-line? If you are working with Fasta files in any capacity, you do want them.

FAST tools. They are fanFASTic (sorry). Trust me. Published here earlier this year, FAST is designed to be easily utilized by anyone who has ever executed any unix command in the terminal.

For example, if you want to grep a sequence for a particular motif, use fasgrep. If you want to find the length of sequences inside of a multifasta file, use faslen. Do you need to quickly translate an entire file of CDS records? Use fasxl. Output can be piped into other unix/FAST commands easily. For the bench scientists I work with, these tools give them immense power when working with larger files they are uncomfortable with. For me, my pipelines are nicer and preliminary investigations of data are easier.

Instead of opening up a multifasta file and using the 'find' option to search for your sequence of interest, you can just do it from the command line. This is powerful, especially if you want to pull out tens of sequences. Doing them one-by-one is tedious, and likely a waste of your time. You can look at Example 4 in the cookbook for a template to do this.

FASTools are simple to install if you already have Bioperl on your system by running the following:

perl -MCPAN -e 'install FAST'

When FASTools are installed, you get a suite of programs that are designed to help make multifasta processing easy. Each tool has its own man page. Here is the list of current executables available (I have highlighted the ones I use everyday with bold font):

faslen - annotate sequence lengths
fascodon - tally/annotate codon usage
fascomp - tally/annotate monomer frequencies
fasxl - translate gapped and ungapped sequences and alignments
fasrc - reverse complement nucleotide sequences and alignments

fasgrep - select sequence records by perl regular expressions
fasfilter - select sequence records by numerical values
fastax - select sequence records by NCBI Taxonomy IDs or names
fascut - select/reorder sequence record data by sequential ranges
fasuniq - remove duplicate sequence records from sorted data
fashead - select leading sequence records
fastail - select trailing sequence records
alncut - select sites based on variation and gap-content content
gbfcut - select sequences by regex match on features in a GenBank features
gbfalncut - select sites by regex match on features in a GenBank features

fasconvert - convert sequences to or from from fasta format
fassort - sort sequence records
fastaxsort - sort sequence records by NCBI Taxonomy IDs or names
faspaste - concatenate sequence record data
fastr - transform sequence records by characters and alphabets
fassub - transform sequence records by regex-based substitutions

faswc -tally sequences and characters
alnpi - tally molecular population genetic statistics

If you want some ideas to inspire you to become a multifasta record ninja, check out the cookbook. You can find this here. Examples in the cookbook are designed around tasks we were constantly executing during data processing in all of our projects in the Ardell lab at UC Merced.

If you want to see other examples, ask us! We would like to show how versatile these tools really can be! If you'd like to contribute, head on over to the github repository.

Happy parsing!

0 Comments

Command-line tools for easy parsing of biological sequence files

Archives

Categories