Puneet Wadhwa's BIOINFORMATICS BLOG

Thursday, October 27, 2005

Multiple Sequence Alignment of DNA and Proteins - An introduction


Introduction:
In some of the previous articles on BLAST, we went through the basic principles of sequence alignment. In this article, we will look at some of the principles of multiple sequence alignment and also explore some of the common software used for multiple sequence alignment.

Why perform Multiple Sequence Alignment:
First, let us look at why you would want to do multiple sequence alignment in the first place. Multiple alignment can be used to study evolutionary relationships between related proteins. Since the changes between gene sequences due to evolution are incremental, we can take homologous genes , i.e. genes with a common evolutionary origin, from a diverse range of organisms and then compare them by aligning identical or similar residues. The comparison of these related genes may then be used to study, which regions of genes have been conserved, and which are sensitive to mutation, over the years. This is very useful in designing experiments to test and modify the function of specific proteins, and to predict the function and structure of proteins, and to identify new members of protein families.

Multiple Sequence Alignment programs and techniques:

Progressive strategies for multiple alignment: A common approach for multiple sequence alignment is to progressively align pairs of sequences. First two sequences are selected and are aligned together, and then this alignment is used to align each subsequent sequences.

One of the most popular programs for multiple sequence alignment is known as ClustalW. It is a general purpose multiple alignment program for DNA or proteins. It calculates the best match for the selected sequences, and lines them up so that the similarities and differences can be seen. It also generates a cladogram which can be useful for studying the evolutionary relationships between the set of sequences.

You may run the ClustalW programming by either downloading and installing it at your local machine, or may run it online at http://www.ebi.ac.uk/clustalw/. To download the software, you may visit the following location ftp://ftp.ebi.ac.uk/pub/software/unix/clustalw/.

We will look at how to run ClustalW using EBI's online ClustalW server, at the above location. For running ClustalW, you need a set of sequences in Fasta format, which is nothing but a header line beginning with ">", followed by sequence name/description and then followed by the sequence in the next line.

Let us leave the rest of the parameters as default, and if you want, you may enter your email address so that the results can be emailed to you. After the ClustalW finishes running, it produces dour files: Output file (.output), Alignment file - plain text version (.aln); Guide tree file (.dnd), and your input file (.input) . The ClustalW also shows the alignment in the form of a phylogenetic tree, or a cladogram which can be chosen from option menu (right-click) of the Java applet.

The difference between a cladogram and a phylogenetic tree is that, A Phylogenetic tree is a branching diagram (tree) in which branch lengths are proportional to the amount of inferred evolutionary change. A Cladogram is a tree where the branches are of equal length, thus cladograms show common ancestry, but do not indicate the amount of evolutionary "time".

Thursday, October 20, 2005

An introduction to BLAST - Basic Local Alignment Search Tool!

Hey friends:

BLAST is an acronym for Basic Local Alignment Search Tool, and it consists of a set of algorithms for comparing biological sequences such as nucleotides or protein sequences. A nucleotide sequence is nothing but a DNA (or part of) sequence expressed as a long string of 4 characters: A,T,C and G. They stand for Adenine, Guanine, Cytosine and Thymine. So, every nucleotide sequence consists of only these four characters arranged in different orders.

BLAST allows you to compare your sequence against a database of sequences and informs you if your sequence matches any of the sequences in the database, along with a lot of information like:

  • Homology of match (% of characters matched)
  • Alignment length (over what length did the nucleotides match)
  • Evalue (Expectation value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score)

For a complete BLAST glossary you may visit http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html

So, now that you know BLAST can be used to align two sequences and to study the similarity between two or more sequences, let us look into the principles of sequence alignment briefly.

Sequence alignment refers to arranging two sequences in an order such that their similar portions are highlighted.

For ex:

AGCTATGGGCAAATTTGGAACAAACCAAAAAGT
........ ........ ...............
AGCTATGGACAAATTTGCAACAAACCAAAAAGT

The portions in the sequence which do not match are shown by gaps in the alignment.

Global Alignment: It refers to the alignment in which all the characters in both sequences participate in the alignment.

Local Alignment: It refers to finding closely matching regions between sequences. In local alignment the beginning part (say 0.100 nucleotides) of a sequence may align with the ending part of another sequence (say 400-500).

Links:

----------------------------------------------------------------------------------------
Puneet Wadhwa

www.puneetwadhwa.com

Sample fasta file format

>lclBOB1ONE
CATGGATTCAGCAGCAGCGAACTCGCCAATGTAGTGGGTGGCACAGCCAG
GGTCTTGACTCTGGCTCTGCAGTAGCACAGTCTGGAAAAGCTCTGAGGGG
AGAGAGACCCCCACTGGTCCGAGGGTCTGGCACAGAGCCAGAAATGGGGG
GGAAGGTATGAGGCTGGGTCGCCTCTGACCTCTCAGGTACCATCCAGGAG
GCCCTGGCCTCTCACTGAACCCGGCCACTCCTCTTTGGCATGGCCTCTTC

>lclBOB2TWO
CCTGGAAGCTCTTGGGGGGCATATCTGGTGGGGAGAAAGCAGGGGTTGGG
GAGGCCGAAGAAGGTCAGGCCCTCAGCTGCCTTCATCAGTTCCCACCCTC
CAGCCCCCAACTCCTCCTGCAGACAAGCTGGTGTCTAAGAACTACCCGGA
CCTGTCCTTGGGAGACTACTCCCTGCTCTGGAAAGCCCACAAGAAGCTCA
CCCGCTCAGCCCTGCTGCTGGGCATCCGTGACTCCATGGAGCCAGTGGTG
GAGCAGCTGACCCAGGAGTTCTGTGAGCGCATGAGAGCCCAGCCCGGCAC

>lclBOB3THREE
CCTGGAAGCTCTTGGGGGGCATATCTGGTGGGGAGAAAGCAGGGGTTGGG
GAGGCCGAAGAAGGTCAGGCCCTCAGCTGCCTTCATCAGTTCCCACCCTC
CAGCCCCCAACTCCTCCTGCAGACAAGCTGGTGTCTAAGAACTACCCGGA
CCTGTCCTTGGGAGACTACTCCCTGCTCTGGAAAGCCCACAAGAAGCTCA
CCCGCTCAGCCCTGCTGCTGGGCATCCGTGACTCCATGGAGCCAGTGGTG
GAGCAGCTGACCCAGGAGTTCTGTGAGCGCATGAGAGCCCAGCCCGGCAC

Creating and using custom BLAST sequence databases

Hi again!

This post is based on my recent experience dealing with 'blasting' custom nucleotide or protein sequences with NCBI's blast tool.

If you are a beginner in Bioinformatics, and would like to know more about blast, please read my earlier post on "An Introduction to BLAST", before you delve into deeper and juicier topics such as creating custom BLAST databases, and comparing your sequences against those blast databases. So, hang on tight and you would be blasting in no time!

First things first, you need to download and install BLAST on your computer or a server first. BLAST can be downloaded at http://www.ncbi.nlm.nih.gov/blast/download.shtml. Follow the instructions on NCBI website, and install the BLAST tool after downloading it.

To create a custom BLAST database, you need a simple FASTA file consisting of a header in a particular format (discussed below), then followed by the nucleotide or protein sequence.
The header of a FASTA file begins with a ">" character, followed by header information. The custom sequence's fasta file is then converted into BLAST database by a tool called formatdb (which ships with NCBI's blast package, downloadable at http://www.ncbi.nlm.nih.gov/Ftp/)

The formatdb command has several options, and the entire readme file may be viewed here.

We are going to look at the most common formatdb command options, and the most common header formats for formatting custom databases.

COMMAND FOR FORMATDB:
formatdb -i input_db -p F -o T for nucleotide
formatdb -i input_db -p T -o T for protein

-i option is used to specify the name of the input fasta file
-p option is used to specify type of file (T - protein, F - nucleotide [T/F]; default = T)

Note on -o option from the FORMATDB README: "It is always advantageous to use the '-o' option if the database identifiers are in the format specified at ftp://ftp.ncbi.nih.gov/blast/db. If the database identifiers are in this parseable format, formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers.

For constructing custom databases with the above command, there are certain rules about the header format.

1.) ID's of type "local" or "general" should be used. This means thatthe ID's will have the syntax "lcl IDENTIFIER" (for "local") or "gnl DATABASEIDENTIFIER" (for "general"). The tokens DATABASE andIDENTIFIER should be assigned by the user here. The local ID has only one user provided token, the general ID requires two. The fields are separated by vertical bars ("")..
2.) Letters, numbers, underscores ("_"), dashes, and periods may beused. Uppercase and lowercase letters are treated as being distinct.No spaces are allowed in the ID, this indicates the end of the ID.

3.) All ID's should be unique, if the entire ID is examined. As an example consider the following four ID's:
gnlH.sapiensseq1
gnlH.sapiensseq2
gnlM.Musseq1
lclseq1

All of these ID's are considered unique. The first two might besequences one and two of a collection of Human sequences; the fourthmight be the first sequence in a collection of mouse sequences; thefourth is simply identified as the first sequence.

I recommend using either the gnl or lcl option. Some of valid header formats thus are:

>lcl.BOB1.ONE
>lcl.29 (Replace . with pipe.. for some reason this blog doesn't like pipes)
and so on..

Once you have your FASTA file ready (
see example) then use the formatDB command as discussed above to create a BLAST database.

Thursday, October 06, 2005

What is Bioinformatics ?

Bioinformatics is an interdisciplinary science that encompasses use of techniques from the Life sciences world, and Computer science to solve complex biological problems. According to the dictionary, Bioinformatics is "Information technology as applied to the life sciences, especially the technology used for the collection, storage, and retrieval of genomic data". The term Bioinformatics has most recently become a buzzword with several large and small companies, and major research efforts in the field of bioinformatics include sequence alignment, gene prediction, genome assembly etc. It uses the latest advancements in the the area of Computer Science such as Data warehousing, Data mining and Mathematical sciences.

Welcome!

Hi There,

A warm welcome to you and thanks for visiting my blog. I intend to use this space to speak my mind, post articles about Bioinformatics (a field I have recently developed a great liking for), and other information pertaining to Computers and Internet in general, including career opportunities.

Please feel free to post comments to my blog, or to send suggestions or contributions to me at pwadhwa@gmail.com