Creating and using custom BLAST sequence databases
Hi again!
This post is based on my recent experience dealing with 'blasting' custom nucleotide or protein sequences with NCBI's blast tool.
If you are a beginner in Bioinformatics, and would like to know more about blast, please read my earlier post on "An Introduction to BLAST", before you delve into deeper and juicier topics such as creating custom BLAST databases, and comparing your sequences against those blast databases. So, hang on tight and you would be blasting in no time!
First things first, you need to download and install BLAST on your computer or a server first. BLAST can be downloaded at http://www.ncbi.nlm.nih.gov/blast/download.shtml. Follow the instructions on NCBI website, and install the BLAST tool after downloading it.
To create a custom BLAST database, you need a simple FASTA file consisting of a header in a particular format (discussed below), then followed by the nucleotide or protein sequence.
The header of a FASTA file begins with a ">" character, followed by header information. The custom sequence's fasta file is then converted into BLAST database by a tool called formatdb (which ships with NCBI's blast package, downloadable at http://www.ncbi.nlm.nih.gov/Ftp/)
The formatdb command has several options, and the entire readme file may be viewed here.
We are going to look at the most common formatdb command options, and the most common header formats for formatting custom databases.
COMMAND FOR FORMATDB:
formatdb -i input_db -p F -o T for nucleotide
formatdb -i input_db -p T -o T for protein
-i option is used to specify the name of the input fasta file
-p option is used to specify type of file (T - protein, F - nucleotide [T/F]; default = T)
Note on -o option from the FORMATDB README: "It is always advantageous to use the '-o' option if the database identifiers are in the format specified at ftp://ftp.ncbi.nih.gov/blast/db. If the database identifiers are in this parseable format, formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers.
For constructing custom databases with the above command, there are certain rules about the header format.
1.) ID's of type "local" or "general" should be used. This means thatthe ID's will have the syntax "lcl IDENTIFIER" (for "local") or "gnl DATABASEIDENTIFIER" (for "general"). The tokens DATABASE andIDENTIFIER should be assigned by the user here. The local ID has only one user provided token, the general ID requires two. The fields are separated by vertical bars ("")..
2.) Letters, numbers, underscores ("_"), dashes, and periods may beused. Uppercase and lowercase letters are treated as being distinct.No spaces are allowed in the ID, this indicates the end of the ID.
3.) All ID's should be unique, if the entire ID is examined. As an example consider the following four ID's:
gnlH.sapiensseq1
gnlH.sapiensseq2
gnlM.Musseq1
lclseq1
All of these ID's are considered unique. The first two might besequences one and two of a collection of Human sequences; the fourthmight be the first sequence in a collection of mouse sequences; thefourth is simply identified as the first sequence.
I recommend using either the gnl or lcl option. Some of valid header formats thus are:
>lcl.BOB1.ONE
>lcl.29 (Replace . with pipe.. for some reason this blog doesn't like pipes)
and so on..
Once you have your FASTA file ready (see example) then use the formatDB command as discussed above to create a BLAST database.
This post is based on my recent experience dealing with 'blasting' custom nucleotide or protein sequences with NCBI's blast tool.
If you are a beginner in Bioinformatics, and would like to know more about blast, please read my earlier post on "An Introduction to BLAST", before you delve into deeper and juicier topics such as creating custom BLAST databases, and comparing your sequences against those blast databases. So, hang on tight and you would be blasting in no time!
First things first, you need to download and install BLAST on your computer or a server first. BLAST can be downloaded at http://www.ncbi.nlm.nih.gov/blast/download.shtml. Follow the instructions on NCBI website, and install the BLAST tool after downloading it.
To create a custom BLAST database, you need a simple FASTA file consisting of a header in a particular format (discussed below), then followed by the nucleotide or protein sequence.
The header of a FASTA file begins with a ">" character, followed by header information. The custom sequence's fasta file is then converted into BLAST database by a tool called formatdb (which ships with NCBI's blast package, downloadable at http://www.ncbi.nlm.nih.gov/Ftp/)
The formatdb command has several options, and the entire readme file may be viewed here.
We are going to look at the most common formatdb command options, and the most common header formats for formatting custom databases.
COMMAND FOR FORMATDB:
formatdb -i input_db -p F -o T for nucleotide
formatdb -i input_db -p T -o T for protein
-i option is used to specify the name of the input fasta file
-p option is used to specify type of file (T - protein, F - nucleotide [T/F]; default = T)
Note on -o option from the FORMATDB README: "It is always advantageous to use the '-o' option if the database identifiers are in the format specified at ftp://ftp.ncbi.nih.gov/blast/db. If the database identifiers are in this parseable format, formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers.
For constructing custom databases with the above command, there are certain rules about the header format.
1.) ID's of type "local" or "general" should be used. This means thatthe ID's will have the syntax "lcl IDENTIFIER" (for "local") or "gnl DATABASEIDENTIFIER" (for "general"). The tokens DATABASE andIDENTIFIER should be assigned by the user here. The local ID has only one user provided token, the general ID requires two. The fields are separated by vertical bars ("")..
2.) Letters, numbers, underscores ("_"), dashes, and periods may beused. Uppercase and lowercase letters are treated as being distinct.No spaces are allowed in the ID, this indicates the end of the ID.
3.) All ID's should be unique, if the entire ID is examined. As an example consider the following four ID's:
gnlH.sapiensseq1
gnlH.sapiensseq2
gnlM.Musseq1
lclseq1
All of these ID's are considered unique. The first two might besequences one and two of a collection of Human sequences; the fourthmight be the first sequence in a collection of mouse sequences; thefourth is simply identified as the first sequence.
I recommend using either the gnl or lcl option. Some of valid header formats thus are:
>lcl.BOB1.ONE
>lcl.29 (Replace . with pipe.. for some reason this blog doesn't like pipes)
and so on..
Once you have your FASTA file ready (see example) then use the formatDB command as discussed above to create a BLAST database.
9 Comments:
Hello Puneet,
Its nice to see a BLOG on Bioinformatics. I just recently started working in the field. I have to upload a custom database from RDP on BLAST to analyze 16sRNA pyrosequencing reads for my project. I tried to follow all the help files available for uploading custom data but I haven't been successful. I have downloaded exe files from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ and my database from http://rdp.cme.msu.edu/misc/resources.jsp. And taking help from NCBI I am trying to run commands in cmd prompt, but it does not recognize formatdb for me to format the database available in FASTA format in RDP website.
Hope you will be able to help me out. Looking forward to your response.
Pallavi
By Unknown, at 11:15 AM
Thank you very much for this helpful information!
By Jony Sheynin, at 8:05 PM
Hello Puneet,
I'm currently working on bioinformatics topics, specifically, BLAST. I was wondering, when would we require the use of customized databases?
Regards,
KL
By kltoworld, at 1:19 PM
Hello,
I'm trying to blast some fasta files, I tried to download the exe files but there is no formtdb command or anyother simillar command I see that other people had this problem
Can anybody help?
Best,
Nat
By Unknown, at 2:35 AM
Hello pallavi,
I'm also a biginner in this field. I want to blast EST's sequences against the genebank database, so I can cross the EST's accession number I have with accession numbers of known transcripts. I need to use the blastn and tblastx programs against the nr database.
I also tried to download but didn't succeed.
I aslo have downloaded exe files from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
But, when I try to run commands in cmd prompt, it does not recognize formatdb for me.
As my problem is pretty simillar to yours, I we can help each other solving it.
Thanks in regard,
Nataly
By Unknown, at 2:43 AM
The newest versions of BLAST use 'makeblastdb' instead of "formatdb'.
By Anonymous, at 5:49 PM
Check also the custom blast database server tool. It makes everything easy (uses the newer Blast+ commands).
By Anonymous, at 8:54 AM
i successfully prepared all setup to make a blastdb using makeblastdb command. the command also running successfully but none of output files are generating...
Plz solve if any idea..
By IOIB, at 1:24 PM
I want to thank you for this informative post. I really appreciate sharing this great post. Keep up your work. Thanks for sharing this great article. Great information thanks a lot for the detailed article.
database bioinformatics
By Jhon mac, at 8:51 AM
Post a Comment
<< Home