DETools is a set of scripts that can be used to extract small data from a larger dataset. The v1 of these scripts were originally written in SCILAB, the open source platform for numerical computation. Lateron, these were modified and rewritten in C++ for faster and user-friendly computation. The improved v2 of these scripts is now available for download as a Windows Installer here. Individual scripts can be downloaded below.
The SeqEx utility is used to extract a set of sequences from a larger sequence dataset. The SeqEx utility is specifically written for extracting sequences which are no more than 7682 characters in length. The function can thus be used with the NAST aligner of the Greengenes database. For the SeqEx utility to work, you must have two INPUT files: the BIG sequence file in the FASTA format and a sequence list file in TEXT format. The list file contains the sequence identifiers (one ID per line) for which you wish to extract the sequences. Make sure that all the IDs listed in the list file are present in the BIG dataset.
The DistEx utility is used to extract a distance matrix for a set of sequences which form a part of a larger distance matrix. The DistEx utility output is in the order of the input sequence list and is independent of their order of appearance in the BIGGER matrix. The DistEx utility generates two output files: Sample and Outfile. The SAMPLE file can be directly used as an input in LIBSHUFF as it has the same formating as the input SAMPLE file used in LIBSHUFF. The OUTFILE has the same formatting as the input distance matrix. For the DistEx utility to work, you must have two INPUT files: the BIG distance matrix file in the Phylip format and a sequence list file in TEXT format. The list file contains the sequence identifiers (one ID per line) for which you wish to extract the distance matrix. Make sure that all the IDs listed in the list file are present in the BIG matrix.
The CutOff utility works on either a simialrity matrix or a distance matirx. The similarity matrix could be prepared from the Phylip distance matrix using the JukesCantor algorithm. This tool allows filtering out the pair of sequence identifiers that have values lower than the given cutoff. Hence, the output file contains the most similar sequences if using a similarity matrix as an input. In contrast, one selects for the most dissimilar sequences while using a distance matrix as an Input. The output is generated in the CSV file format.