The overall goal of this article is to reconstruct a reliable phylogenetic tree from DNA or protein sequences. This is accomplished by first identifying similar sequences using blast programs at NCBI. The second step is to align similar sequences.
Next, the best fit model of evolution is determined from the alignment. The final step is to infer the phylogenetic relationship from the aligned sequences. Ultimately, the step-by-step pipeline is used to show how users can go from sequence data to reliable phylogenies.
This method can help answer key questions in diverse fields by inferring identity and function to novel sequences. To use the online version of the basic local alignment search tool or blast navigate to the National Center for Biotechnology Information or NNC B'S Blast Web Server. Click on the appropriate blast program.
Input a FASTA formatted text sequence such as the one shown here into the query box. Click the appropriate blast program for use in the search, and then click blast. The output is in HTML format by default and displays the most similar sequences to the input text sequence.
The next section covers using a local blast executable on Windows Mac. Users can skip to the following section called Blast Local executables for Macs x. To run the blast command line program on a Windows machine, download the appropriate Windows executable from the NCBI blast website.
After installing the Blast program, configure the PC environment variable as follows, click the PCs start button and right click computer. Then click properties. In the new window, select advanced system settings, and in the advanced tab of the new pop-up, click the environment variables button.
Then under the user variables for user section, click the new button. In the new popup, add the variable name path, and the variable value shown here. Next, download a pre-formatted blast database, which are updated daily from the NCBI website or a genome for a particular organism.
Then open an MS DOS prompt by clicking start and typing CMD in the search bar and change to the NCBI blast folder. Create the database using the Make blast DB command shown here. Create a query protein sequence called test by inserting a FASTA formatted protein text sequence into the DB folder.
Then to identify the most similar sequences to the test protein, interrogate the database via a blast P query command. The following section repeats this information for Mac users. Windows users can skip to section five generating multiple sequence alignments.
To run the blast command line program on a Mac, download the appropriate MAC executable by remotely accessing the N-C-B-I-F-T-P site. To do this open finder and search for terminal in the terminal window, type the FTP address for the N-C-B-I-F-T-P site. Type anonymous for name and password, and then type CD blast slash executables slash latest.
List the executables by typing LS and download the latest version that matches your system requirements by typing the following. Now, decompress the downloaded files. Now add the location of the binaries for the blast executable to your path so that the shell can search through this directory.
When looking for commands, download a pre-formatted blast database or a genome from the NCBI website. Search the genomes directory by typing CD genomes. Then download the genome or sequence of interest as follows, and then type quit to exit the FTP site.
Next, make the database by typing the Make Blast DB instruction. Insert a FASTA formatted query sequence into the bin folder and interrogate the database with the blast P query command to find the most similar sequence to the test sequence data among the commonly used multiple sequence alignment or MSA programs is tea coffee. After inputting fasta formatted sequence data into the query box at the tea coffee site, the output indicates similar residues by the color coding.
Another commonly used MSA program is the clusteral MSA, which can be downloaded as a command line version, CLUSTERAL W, or a graphical version CLUSTERAL X for various operating systems. Next, load the data to the clusteral program as fast a formatted sequence text by selecting the file tab. Then clicking the load sequences button.
Now switch to the align tab and click the do complete alignment button to align the sequences for a best fit model of evolution. Download the protest program. Once protest is downloaded, double click on protest.
Once protest is launched, click on select file in the alignment box to load the sequence data. Then click start to run the program. After completing the run, the program indicates the best model based on the criteria for inferring sequences.
After downloading and launching Phi ML, load the input sequence as a file lip formatted sequence by typing the file name and PY.Then launch the program by typing y. After downloading a Bayesian inference program from the Mr Bays website, start the program by clicking on the executable file. Then read Nexus formatted sequence data into the program by typing execute file name dot NEX.
Next, set the evolutionary model and select the number of generations to run. After running the analysis with the mc mc command, summarize the trees using the sum T command to view a phylogenetic tree. Download the tree view program.
As a final note, there are constant releases of new software aimed at providing better alignments, similarity predictions, or phylogenetic trees. While the overview in this video covered popular programs, the viewer is encouraged to explore additional options. The blast algorithm performs local alignments, which searches for short stretches of sequence similarity.
After the algorithm has looked up all possible stretches from the query sequence and maximally extended these sequences, it then assembles alignments. For each query sequence pair, the e value gives an indication of the statistical significance for a match. The lower the E value, the more significant the hit.
For example, a sequence alignment with an E value of 0.05 means that the likelihood of this match occurring by chance alone is five in 100. The BIT score uses a specific scoring matrix to provide an indication of how good the alignment is. The higher the BIT score, the better the alignment.
A multiple sequence alignment or MSA is a sequence alignment of three or more primary sequences composed of amino acids, DNA or RNA. The output from the MSA tea coffee seen here, color codes similar residues. A sample alignment of six protein sequences aligned using cluster X is shown here for amino acid alignments.
The program protest is used to determine the selection of best fit models of amino acid replacements. Within the data, the program lists the models as they are being analyzed and displays the best fit after completion of the program, Phi ML estimates maximum likelihood phylogenies from alignments of nucleotide or amino acid sequences. It incorporates a large number of substitution models coupled to various options to search tree topology space.
Mr.Bayes utilizes Bayesian CMC inference across a number of evolutionary models to reconstruct phylogenetic relationships. Once the program is running, progress can be viewed in specific intervals as shown here. Once a phylogenetic tree is generated, the topology needs to be visualized.
In this figure, the tree view window displays a sample tree of proteins from fly.Base. Tree view includes a tree editor that allows the user to move branches and reroute trees. While attempting this procedure, it's important to remember to read thoroughly the user guides for each program.
This protocol provides a practical starting point to introduce the reader as to how these programs work. However, I encourage the reader to play around with and become familiar with the many settings associated with each program.