This page is dedicated to a repository containing a collection of scripts used in processing high throughput DNA sequencing data. There will be a variety of scripts used in processing 454, Illumina, Solid, Sanger reads in cluster and traditional computing environments. As I upload the code please be away that I am providing it as-is and make no promises regarding its accuracy and reliability. I try to make my scripts as vertical as possible, meaning that there are few dependencies between scripts and other scripts or external libraries, however sometimes there will be some dependencies so take notice.
Files:
splitReads.py
This script simply takes a fasta file and splits it into the desired number of pieces. It does this in a round robin fashion, i.e. entry 1 goes to file 1, entry 2 goes to file 2, etc.
parallelRepeatMasker.py
This script helps you parallelize a repeat masker job by splitting the genome up into desired number of pieces then executing repeat masker on each piece. This method will certainly not provide the same results as running repeat masker on the whole genome as features spanning contigs in different files can never be identified. However I think that this should not be a problem for most people.
lastz_target_query.py
This python script demonstrates how to automate a pairwise comparison between fasta files in the target directory and fasta files in the query directory. It is written to be used in a cluster environment, with each job submitted in parallel to an LSF queuing system. It should be easily adapted to other queuing systems such as SGE, or even to a SMP or traditional computer environment. This particular example involves using lastz to perform a local alignment of the query against the target.
How to Obtain:
You can download or view any file in the git repository here: http://github.com/eljimbo/NextGenScripts