<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-17537533</id><updated>2012-01-30T08:54:59.039-06:00</updated><title type='text'>Puneet Wadhwa's BIOINFORMATICS BLOG</title><subtitle type='html'>Hi There,
A warm welcome to you and thanks for visiting my blog!&lt;br&gt;
I am presently working at Open Biosystems, a Huntsville (AL) based Biotech company as a Bioinformatics Curator. Before this, I got my MS in CS from Univ. of Alabama - Huntsville, in Dec 2004.
Feel free to browse, and read interesting articles about Bioinformatics. Please also visit my websites at www.puneetwadhwa.com and www.betteresolutions.com
&lt;hr&gt;</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>16</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-17537533.post-114322004796116588</id><published>2006-03-24T11:07:00.000-06:00</published><updated>2006-03-27T17:42:18.783-06:00</updated><title type='text'>Extracting ORF (CDS portion) from Sequence using Bioperl</title><content type='html'>Hey Readers:&lt;br /&gt;&lt;br /&gt;I have been recently wrestling with creation of an ORF Database from Refseq Human Sequences, by extracting out the CDS portion of the sequence from the Genbank record. I needed this database for my project since we were analyzing our proprietary Incyte Gene Collection (IGC) for exact matches to Refseq sequences over the ORF (CDS) region.&lt;br /&gt;&lt;br /&gt;Building this database was not easy, and I hope others could learn from this experience of mine...&lt;br /&gt;&lt;br /&gt;I first downloaded the RAW Genbank files (.gbff) and then extracted them into a folder. I then searched these files using unix commands like grep etc. to find out which files contained Human species. And then, I wrote a regular expressions based C#.NET GBFF Parser, which used regular expressions to extract out the features from the record like gene name, definition, locus, CDS, species etc. I then grabbed the sequence from the Genbank record using my parser, and then stripped out the CDS part using my program, from the CDS begin to the CDS end locations on the sequence. This was NOT very easy and certainly not the best way.&lt;br /&gt;&lt;br /&gt;I now have a surefire way of extracting ORF's from the Genbank records, and it is very elegant, uses very less amount of code, and has been written in Bioperl. Attached below, is the code to do the same. However, is you use or distribute this code, please leave its header and author information intact, and provide a link back to my website.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;!/usr/bin/perl&lt;br /&gt;&lt;br /&gt;#===========================================&lt;br /&gt;# ORF DATABASE BUILDER UTILITY - Puneet Wadhwa #&lt;br /&gt;# http://puneetwadhwa.blogspot.com * pwadhwa@gmail.com #&lt;br /&gt;#-------------------------------------------&lt;br /&gt;# The purpose of this utility is to fetch the Genbank #&lt;br /&gt;# record live from the Genbank website, and strip out #&lt;br /&gt;# and display the CDS sequence from the record. #&lt;br /&gt;# Takes a list of NCBI Accessions as input and outputs a #&lt;br /&gt;# fasta record. #&lt;br /&gt;# USAGE: cat input_file ./GenerateOrfDatabase.pl #&lt;br /&gt;# Input file contains NCBI accessions (one per line) for #&lt;br /&gt;# which the CDS is to be extracted. #&lt;br /&gt;#===============================================&lt;br /&gt;&lt;br /&gt;use Bio::DB::GenBank;&lt;br /&gt;use Bio::SeqIO;&lt;br /&gt;use Bio::SeqIO::swiss;&lt;br /&gt;use Bio::Seq;&lt;br /&gt;use Bio::SeqIO::FTHelper;&lt;br /&gt;use Bio::SeqFeature::Generic;&lt;br /&gt;use Bio::Species;&lt;br /&gt;use Bio::AnnotationI;&lt;br /&gt;use Bio::FeatureHolderI;&lt;br /&gt;use Bio::SeqFeatureI;&lt;br /&gt;&lt;br /&gt;my $gb = new Bio::DB::GenBank;&lt;br /&gt;my $seqout = new Bio::SeqIO(-fh =&gt; \*STDOUT, -format =&gt; 'fasta');&lt;br /&gt;&lt;br /&gt;%acc_hash = 0;&lt;br /&gt;&lt;br /&gt;while (&lt;&gt;)&lt;br /&gt;{&lt;br /&gt;chomp;&lt;br /&gt;$acc = $_;&lt;br /&gt;&lt;br /&gt;my $seq_obj = $gb-&gt;get_Seq_by_acc($acc);&lt;br /&gt;&lt;br /&gt;foreach my $feat ( $seq_obj-&gt;top_SeqFeatures )&lt;br /&gt;{&lt;br /&gt;if ( $feat-&gt;primary_tag eq 'CDS' )&lt;br /&gt;{&lt;br /&gt;my $cds_obj = $feat-&gt;spliced_seq;&lt;br /&gt;print "&gt;".$seq_obj-&gt;display_id()."\n".$cds_obj-&gt;seq."\n";&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;0;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-114322004796116588?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/114322004796116588/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=114322004796116588' title='16 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/114322004796116588'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/114322004796116588'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2006/03/extracting-orf-cds-portion-from.html' title='Extracting ORF (CDS portion) from Sequence using Bioperl'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>16</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-114244124219176052</id><published>2006-03-15T10:46:00.000-06:00</published><updated>2006-03-15T10:47:24.686-06:00</updated><title type='text'>A thousand clones...</title><content type='html'>Hey Readers:&lt;br /&gt;&lt;br /&gt;Following is the latest article about my project which appeared on my company website's blog. We have now finished the complete study and analysis of annotating the Incyte Gene Collection and have discovered 1072 novel genes which are not available in any other commercial collections.&lt;br /&gt;&lt;br /&gt;Here is the article:&lt;br /&gt;&lt;br /&gt;&lt;span style="color:#ff0000;"&gt;&lt;strong&gt;A thousand clones&lt;/strong&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;To be exact,1072. That is the number of human Incyte Gene Collection (IGC) clones that we found to contain an exact match to an entire RefSeq CDS (January 6, 2006 release), but for which there is no exact match to an MGC clone (February 23, 2006 release). At the risk of sounding self-congratulatory: I told you so! (See previous blog.)In this collection, which I will refer to as the IGC Non-MGC Set, you will find both novel clones and NOVEL CLONES. In many cases, there are one or more close MGC counterparts that differ from the IGC/RefSeq sequence by only one or two base pairs. Some may well be legitimate single nucleotide polymorphisms (SNPs) that did not have the good fortune to be included in RefSeq. These may or may not be functionally distinct from the IGC clone. In other cases, the closest MGC counterpart is a splice variant of the IGC/RefSeq sequence and likely codes for a polypeptide with a distinct function. In still other cases, there is no MGC counterpart within the same UniGene cluster.Never mind all that, you might be thinking with some impatience. How many druggable genes are there? I have not yet attempted grouping into gene families, but I did quickly spot some caspases, cytochrome p450s, and an adenylate cyclase. I encourage you to have a look for yourself by downloading the new spreadsheet from our website by again navigating to Genomics &gt; Mammalian Resources &gt; cDNAs &gt; Incyte Gene Collection and clicking on the &lt;a href="http://www.openbiosystems.com/ProductDataFiles.aspx?AliasPath=/Genomics/Mammalian%20Resources/cDNA%20Clones/The%20Incyte%20Collection&amp;amp;CatalogNumber=IHS1380" target="_blank"&gt;data file&lt;/a&gt; icon under the ordering information for IHS1380. You will note that there are actually 1135 line items, because there are sometimes multiple RefSeq accessions corresponding to the same CDS.By the way, all of the IGC clones containing an exact match to an entire RefSeq CDS (a total of 4116 clones) can now be found using our online clone query when searching by RefSeq accession, gene symbol, or UniGene cluster. So next time you search on our website, you are even more likely to turn up a useful clone. More to come…&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-114244124219176052?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/114244124219176052/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=114244124219176052' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/114244124219176052'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/114244124219176052'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2006/03/thousand-clones.html' title='A thousand clones...'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-114132767149949817</id><published>2006-03-02T13:25:00.000-06:00</published><updated>2006-03-02T13:27:52.073-06:00</updated><title type='text'>Waking a sleeping giant: annotating the Incyte Gene Collection (IGC)</title><content type='html'>Hey Readers:&lt;br /&gt;&lt;br /&gt;Following is an article about my project which appeared on my company website's blog. We have got some great results from mining the IGC collection, and I am very very excited to share them with you.&lt;br /&gt;&lt;br /&gt;Here it goes:&lt;br /&gt;--------------&lt;br /&gt;With over one million cDNAs for human, rat, monkey, and dog, the IGC is a monster collection that, on statistical grounds, is certain to contain some good stuff.  Consider this: of the more than 400,000 human cDNAs, Incyte has categorized 11,377 as full length and 16,756 as potentially full length. However, while these latter clone sets have been fully sequenced, they were never submitted to GenBank and no annotations (if they ever existed) were passed along to Open Biosystems. Currently, the only way to mine the IGC is by BLASTing query sequences online against our Incyte clone database. The IGC is a potentially valuable resource, but with largely (and frustratingly) unknown content.&lt;br /&gt;&lt;br /&gt;To enable better exploitation of the IGC, we have begun a high throughput annotation program with these goals: (1) to associate human RefSeq accession numbers with IGC clones when this can be done with high confidence and (2) to discover which human RefSeq-associated IGC sequences are not found in the Mammalian Gene Collection (MGC). I’d like to share with you some preliminary results from our pilot study.&lt;br /&gt;&lt;br /&gt;We began by filtering the ~28,000 full-length and potential full-length human cDNAs by size and selecting the set of 367 sequences that are 3 kilobases or longer. These sequences were then BLASTed against every CDS in human RefSeq and filtered for 100% identity. Even at this high stringency there were 118 “hits”, that is, IGC sequences that contain a complete human RefSeq CDS. The 118 coding sequences were then BLASTed against the MGC, yielding 47 hits at 100% identity. Taking into account a few cases in which the same IGC sequence corresponded to two or more RefSeq accessions, there were 64 RefSeq-certified IGC cDNAs not found in the MGC.&lt;br /&gt;&lt;br /&gt;For 25 of the brave new 64, there was no MGC clone within the same UniGene cluster; for the 39 others, one or more MGC clones were found in the same cluster. However, spot-checking of these MGC sequences confirmed that they are either apparent splice variants of the RefSeq CDS or contain single nucleotide substitutions. So far, so good*. If you are curious about these preliminary results, a spreadsheet can be downloaded from our website by navigating to &lt;a href="http://www.OpenBiosystems.com"&gt;www.OpenBiosystems.com&lt;/a&gt; &gt; Genomics &gt; Mammalian Resources &gt; cDNAs &gt; Incyte Gene Collection and clicking on the data file icon under the ordering information for IHS1380.&lt;br /&gt;&lt;br /&gt;We have already begun BLASTing the larger set of 28,000 against the RefSeq coding sequences. Our goal is to identify 1000 RefSeq-certified human IGC cDNAs that are not in the MGC. Eventually we hope to dig deeper by identifying IGC cDNAs that are not perfect matches to any RefSeq CDS, but represent plausible splice variants or SNPs. I am also intrigued by the possibility of identifying IGC cDNAs that are completely outside RefSeq—representing entirely novel genes. We shall see!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-114132767149949817?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/114132767149949817/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=114132767149949817' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/114132767149949817'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/114132767149949817'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2006/03/waking-sleeping-giant-annotating.html' title='Waking a sleeping giant: annotating the Incyte Gene Collection (IGC)'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-113952328471900146</id><published>2006-02-09T16:14:00.000-06:00</published><updated>2006-02-09T16:14:45.676-06:00</updated><title type='text'>What is the HAPMAP?</title><content type='html'>What Is the HapMap?&lt;br /&gt;&lt;br /&gt;This is an article that recently appeared on The International HapMap project, &lt;a href="http://www.hapmap.org/"&gt;http://www.hapmap.org/&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The HapMap is a catalog of common genetic variants that occur in human beings. It describes what these variants are, where they occur in our DNA, and how they are distributed among people within populations and among populations in different parts of the world. The International HapMap Project is not using the information in the HapMap to establish connections between particular genetic variants and diseases. Rather, the Project is designed to provide information that other researchers can use to link genetic variants to the risk for specific illnesses, which will lead to new methods of preventing, diagnosing, and treating disease.&lt;br /&gt;Figure 1: When DNA sequences on a part of chromosome 7 from two random individuals are compared, two single nucleotide polymorphisms (SNPs) occur in about 2,200 nucleotides.&lt;br /&gt;The DNA in our cells contains long chains of four chemical building blocks -- adenine, thymine, cytosine, and guanine, abbreviated A, T, C, and G. More than 6 billion of these chemical bases, strung together in 23 pairs of chromosomes, exist in a human cell. (See &lt;a href="http://www.dnaftb.org/dnaftb/"&gt;http://www.dnaftb.org/dnaftb/&lt;/a&gt; for basic information about genetics.) These genetic sequences contain information that influences our physical traits, our likelihood of suffering from disease, and the responses of our bodies to substances that we encounter in the environment.&lt;br /&gt;The genetic sequences of different people are remarkably similar. When the chromosomes of two humans are compared, their DNA sequences can be identical for hundreds of bases. But at about one in every 1,200 bases, on average, the sequences will differ (Figure 1). One person might have an A at that location, while another person has a G, or a person might have extra bases at a given location or a missing segment of DNA. Each distinct "spelling" of a chromosomal region is called an allele, and a collection of alleles in a person's chromosomes is known as a genotype.&lt;br /&gt;&lt;br /&gt;Differences in individual bases are by far the most common type of genetic variation. These genetic differences are known as single nucleotide polymorphisms, or SNPs (pronounced "snips"). By identifying most of the approximately 10 million SNPs estimated to occur commonly in the human genome, the International HapMap Project is identifying the basis for a large fraction of the genetic diversity in the human species.&lt;br /&gt;&lt;br /&gt;For geneticists, SNPs act as markers to locate genes in DNA sequences. Say that a spelling change in a gene increases the risk of suffering from high blood pressure, but researchers do not know where in our chromosomes that gene is located. They could compare the SNPs in people who have high blood pressure with the SNPs of people who do not. If a particular SNP is more common among people with hypertension, that SNP could be used as a pointer to locate and identify the gene involved in the disease.&lt;br /&gt;&lt;br /&gt;However, testing all of the 10 million common SNPs in a person's chromosomes would be extremely expensive. The development of the HapMap will enable geneticists to take advantage of how SNPs and other genetic variants are organized on chromosomes. Genetic variants that are near each other tend to be inherited together. For example, all of the people who have an A rather than a G at a particular location in a chromosome can have identical genetic variants at other SNPs in the chromosomal region surrounding the A. These regions of linked variants are known as haplotypes (Figure 2).&lt;br /&gt;&lt;br /&gt;In many parts of our chromosomes, just a handful of haplotypes are found in humans. [&lt;a href="http://www.hapmap.org/originhaplotype.html"&gt;See The Origins of Haplotypes&lt;/a&gt;.] In a given population, 55 percent of people may have one version of a haplotype, 30 percent may have another, 8 percent may have a third, and the rest may have a variety of less common haplotypes. The International HapMap Project is identifying these common haplotypes in four populations from different parts of the world. It also is identifying "tag" SNPs that uniquely identify these haplotypes. By testing an individual's tag SNPs (a process known as genotyping), researchers will be able to identify the collection of haplotypes in a person's DNA. The number of tag SNPs that contain most of the information about the patterns of genetic variation is estimated to be about 300,000 to 600,000, which is far fewer than the 10 million common SNPs.&lt;br /&gt;&lt;br /&gt;Once the information on tag SNPs from the HapMap is available, researchers will be able to use them to locate genes involved in medically important traits. Consider the researcher trying to find genetic variants associated with high blood pressure. Instead of determining the identity of all SNPs in a person's DNA, the researcher would genotype a much smaller number of tag SNPs to determine the collection of haplotypes present in each subject. The researcher could focus on specific candidate genes that may be associated with a disease, or even look across the entire genome to find chromosomal regions that may be associated with a disease. If people with high blood pressure tend to share a particular haplotype, variants contributing to the disease might be somewhere within or near that haplotype.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-113952328471900146?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/113952328471900146/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=113952328471900146' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113952328471900146'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113952328471900146'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2006/02/what-is-hapmap.html' title='What is the HAPMAP?'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-113840024540955735</id><published>2006-01-27T16:02:00.000-06:00</published><updated>2006-01-27T16:17:40.156-06:00</updated><title type='text'>Sybase ADO.NET (Internal Error 30002) FIX</title><content type='html'>&lt;a href="http://photos1.blogger.com/blogger/5487/699/1600/error.0.jpg"&gt;&lt;/a&gt;&lt;br /&gt;Hey Friends:&lt;br /&gt;&lt;br /&gt;This post is more relevant to those who are using Sybase ADO.NET drivers in their .NET applications to connect to Sybase databases. I recently came across a wierd SYBASE Exception which had a description: Internal Error 30002, while I was trying to insert records into my Sybase database tables. I could not find any explanation on Sybase tech support pages, and also not much helpful information on the search engines.&lt;br /&gt;&lt;br /&gt;I opened up a Tech support case with Sybase as my company has their support subscription, and they advised us to upgrade the .NET PC Client to the latest EBF (ebf13008). This magically solved all these internal error problems.&lt;br /&gt;&lt;br /&gt;I hope some of you who may be facing these errors might get help.&lt;br /&gt;&lt;br /&gt;Thanks,&lt;br /&gt;Puneet Wadhwa&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-113840024540955735?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/113840024540955735/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=113840024540955735' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113840024540955735'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113840024540955735'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2006/01/sybase-adonet-internal-error-30002-fix.html' title='Sybase ADO.NET (Internal Error 30002) FIX'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-113261723739769773</id><published>2005-11-21T17:53:00.000-06:00</published><updated>2006-01-27T16:19:44.436-06:00</updated><title type='text'>An introduction to Data Mining</title><content type='html'>Hey Readers,&lt;br /&gt;&lt;br /&gt;This post is regarding one of the most exciting fields of research in Computer Science and it is opening plethora of opportunities for scientists and researchers especially in the field of Bioinformatics.&lt;br /&gt;&lt;br /&gt;Data Mining basically refers to extraction of "knowledge" from large amounts of data. It is also commonly referred to as "Knowledge Mining" or KDD (short for Knowledge Discovery in Databases). Common applications of Data Mining range from using predictive techniques to unearth interesting patterns in large amounts of data; using previous sales and research data to predict future sales; improving the response rates of direct mail campaigns; and using previous sales data to recommend products to returning or new customers.&lt;br /&gt;&lt;br /&gt;Look forward to some more information on Data mining coming soon..&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-113261723739769773?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/113261723739769773/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=113261723739769773' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113261723739769773'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113261723739769773'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2005/11/introduction-to-data-mining.html' title='An introduction to Data Mining'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-113104811086149072</id><published>2005-11-03T13:55:00.000-06:00</published><updated>2005-11-03T14:01:50.860-06:00</updated><title type='text'>Effect of BLAST Low complexity filter on BLAST Results</title><content type='html'>This post is regarding one of my recent experiences with BLAST. I was trying to blast a big fasta file consisting of sequences over 4000 base pairs in length, and had the low complexity filter turned on (which is always ON by default).&lt;br /&gt;&lt;br /&gt;Ideally when you blast a sequence against itself, it should give you a match over the entire length, but in this case for a sequence of 4200 bp, it only gave me a match over 1500 bp. This is what perplexed me for some time, and I decided to research this problem using NCBI's "blast2seq" tool, with both input sequences same as the 4200 bp sequence. This gave me 3 hits over lengths ranging from 1500 to 1200 bp. This gave me a clue to turn the filter "off", and then I found the hit to be over the entire region.&lt;br /&gt;&lt;br /&gt;I hope someone facing the same problem would benefit from this post :)&lt;br /&gt;&lt;br /&gt;Best.&lt;br /&gt;Puneet Wadhwa&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-113104811086149072?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/113104811086149072/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=113104811086149072' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113104811086149072'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113104811086149072'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2005/11/effect-of-blast-low-complexity-filter.html' title='Effect of BLAST Low complexity filter on BLAST Results'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-113096525109660017</id><published>2005-11-02T14:43:00.000-06:00</published><updated>2005-11-09T14:45:18.040-06:00</updated><title type='text'>RNA Interference and Gene silencing</title><content type='html'>RNA interference, or RNAi, is a way for cells to regulate which genes would be expressed. This amazing phenomenon was first observed in petunias, when a scientist called Rich Jorgensen introduced a pigment-producing gene under the control of a powerful promoter. Instead of the expected deep purple color in the petunia, the result was a mixture of variegated and white petunias.&lt;br /&gt;&lt;br /&gt;RNAi was named the breakthrough of the year in 2002, yielding a new potential for disease treatment and unraveling the mysteries of the functioning of human genes.&lt;br /&gt;&lt;br /&gt;So, why do we need to shutdown the production of some genes?&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Scientists have been interested in the ability to shut down genes, so that observing the effect of turning down a gene can be observed on an organism, giving clues about the function of the gene.&lt;/li&gt;&lt;li&gt;The ability to shut off the genes, may also result in developing new treatments for diseases by turning down the harmful protein producing gene.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;A very beautiful article about RNA Interference, or RNAi recently appeared on pbs.org, and can be found at &lt;a href="http://www.pbs.org/wgbh/nova/sciencenow/3210/02.html"&gt;http://www.pbs.org/wgbh/nova/sciencenow/3210/02.html&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-113096525109660017?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/113096525109660017/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=113096525109660017' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113096525109660017'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113096525109660017'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2005/11/rna-interference-and-gene-silencing.html' title='RNA Interference and Gene silencing'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-113095580601438260</id><published>2005-11-02T12:20:00.000-06:00</published><updated>2005-11-02T12:24:44.166-06:00</updated><title type='text'>Article about careers in Bioinformatics</title><content type='html'>Hey Readers,&lt;br /&gt;&lt;br /&gt;A warm welcome again and wish you all a Happy Diwali! (An Indian festival of lights and Goddess lakshmi - the goddess of wealth)&lt;br /&gt;&lt;br /&gt;I found a pretty encouraging article about career opportunities in Bioinformatics and it is attached below. It is located at &lt;a href="http://sciencecareers.sciencemag.org/feature/cperspec/biosci.shl"&gt;http://sciencecareers.sciencemag.org/feature/cperspec/biosci.shl&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Without further ado, here you go:&lt;br /&gt;-----------------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;Bioinformatics, the use of computer technology to manage biological information, made its spectacular debut a few years ago, as the first trickles of gene sequence information from the Human Genome Program (HGP) and other sequencing projects grew into a deluge. Individuals with the skills to work on the interface between molecular biology and computer science instantly became some of the most sought-after job applicants in the biopharma world. With about 3 billion base pairs on its agenda, and a target completion date of 2005, HGP alone should foster a continuing explosion of data and a robust job market for computational biologists.&lt;br /&gt;"Career opportunities in bioinformatics are very, very good," said John M. Greene, senior staff scientist, bioinformatics research, at Gene Logic Inc., Gaithersburg, Maryland. "It seems that every time you turn around a company has decided to set up a bioinformatics group, or expand an existing group. Many scientists are turning their careers in this direction."&lt;br /&gt;But Greene notes that breaking into the field may not be as simple as all the talk about a feeding frenzy for personnel suggests. He cites the common misperception that a person can take a course in C, the programming language, acquire some database knowledge, and be deluged with high-paying job offers. Salaries around the six-figure mark are possible in bioinformatics, but getting them or even an entry-level position requires more planning than was common in the past.&lt;br /&gt;&lt;br /&gt;Not many of today's bioinformatics people planned it. Many started out doing something else, entered the field before it had a name, and learned key skills on the job. Some were computer scientists who learned biology. Others were life scientists who learned computing.&lt;br /&gt;After getting a Ph.D. in genetics from Harvard University, Greene did a postdoc, and worked for almost a year at a start-up antisense company. His career path led to Human Genome Sciences (HGS) and a job that involved substantial Basic Local Alignment Search Tool (BLAST) analysis on expressed sequence tags (ESTs) to identify genes with possible medical applications. BLAST programs are basic tools for searching DNA and protein databases for sequence similarities. Greene liked the work and finally switched into bioinformatics full time at HGS. He recently moved up to Gene Logic, which offers pharmaceutical companies technology to speed up development of drug targets. Gene Logic has a proprietary technology that identifies changes in gene expression associated with disease. It is developing a flow-through DNA chip to gauge drug efficacy and toxicity by analyzing gene changes, and an object-oriented database of gene expression patterns to identify new drug targets.&lt;br /&gt;&lt;br /&gt;Strongest demand today exists for individuals with degrees in the life sciences and computer sciences, and multiple years of programming and database development experience, Greene says. Typical combinations include a Ph.D. in molecular biology, cell biology, or biochemistry and a B.S. in computer sciences. Life science Ph.D's, largely self-taught in key computer skills, with industry experience, have good opportunities. People who emerge from the few doctoral programs in bioinformatics also will be "incredibly marketable," especially those with industry experience. This range of individuals, very difficult to find, often wind up heading bioinformatics departments or programs.&lt;br /&gt;&lt;br /&gt;At the staff scientist and senior staff scientist levels, biopharma companies now tend to place emphasis on applicants with computer science skills. That's largely because databases and search tools are still being developed. Greene thinks that emphasis will shift in a few years to interpreting information in databases. Companies will then look for individuals who first and foremost are biologists but have key computational skills.&lt;br /&gt;&lt;br /&gt;What are those skills? Greene's list includes knowledge of UNIX, the operating system used for many computational biology programs; a good grasp of the concept of relational databases, which are the heart of bioinformatics; and skill with Structured Query Language (SQL), a language used to query databases. In the future, knowledge of object-oriented databases may be increasingly important. Programming skills also are essential. Skills with C, the programming language, will help individuals learn Perl, the scripting language widely used in bioinformatics. Object-oriented languages, such as Java, will be increasingly important. Expert knowledge of sequence-analysis programs like BLAST and FASTA is critical. Web skills, of course, are necessary, including the ability to write some Hypertext Markup Language (HTML). What gives one applicant an edge over another? Recruiters get excited over applicants who have applied computational biology skills in a practical way. The individual who wrote a program, for instance, and used it in thesis or postdoctoral work, might have an advantage over a similar individual who just took programming courses."For individuals who thrill at being on the cutting edge of science, with the skills to excel in two very different worlds, bioinformatics can be an extraordinarily good career," Greene said. "For me, the switch was the best step I've taken in the last decade."&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-113095580601438260?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/113095580601438260/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=113095580601438260' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113095580601438260'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113095580601438260'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2005/11/article-about-careers-in.html' title='Article about careers in Bioinformatics'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-113172858320210461</id><published>2005-11-01T11:00:00.000-06:00</published><updated>2005-11-11T11:03:55.143-06:00</updated><title type='text'>Syndication of articles from my Bioinformatics Blog</title><content type='html'>Dear Friends.&lt;br /&gt;&lt;br /&gt;I would like to offer syndication of my articles on any website, that may want to display Bioinformatics related articles, tutorials and career information pertaining to Life Sciences.&lt;br /&gt;&lt;br /&gt;If you like the content you see here, you can also show it in your website, provided you link back to my Blog and quote the original source of information. If you are interested in link exchange possibilities, feel free to send me an email at &lt;a href="mailto:pwadhwa@gmail.com"&gt;pwadhwa@gmail.com&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-113172858320210461?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/113172858320210461/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=113172858320210461' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113172858320210461'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113172858320210461'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2005/11/syndication-of-articles-from-my.html' title='Syndication of articles from my Bioinformatics Blog'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-113045510080738758</id><published>2005-10-27T18:12:00.000-05:00</published><updated>2005-11-01T17:02:29.906-06:00</updated><title type='text'>Multiple Sequence Alignment of DNA and Proteins - An introduction</title><content type='html'>&lt;u&gt;&lt;/u&gt;&lt;br /&gt;&lt;u&gt;Introduction:&lt;/u&gt;&lt;br /&gt;In some of the previous articles on BLAST, we went through the basic principles of sequence alignment. In this article, we will look at some of the principles of multiple sequence alignment and also explore some of the common software used for multiple sequence alignment.&lt;br /&gt;&lt;br /&gt;&lt;u&gt;Why perform Multiple Sequence Alignment:&lt;/u&gt;&lt;br /&gt;First, let us look at why you would want to do multiple sequence alignment in the first place. Multiple alignment can be used to study evolutionary relationships between related proteins. Since the changes between gene sequences due to evolution are incremental, we can take homologous genes , i.e. genes with a common evolutionary origin, from a diverse range of organisms and then compare them by aligning identical or similar residues. The comparison of these related genes may then be used to study, which regions of genes have been conserved, and which are sensitive to mutation, over the years. This is very useful in designing experiments to test and modify the function of specific proteins, and to predict the function and structure of proteins, and to identify new members of protein families.&lt;br /&gt;&lt;br /&gt;&lt;u&gt;Multiple Sequence Alignment programs and techniques:&lt;/u&gt;&lt;br /&gt;&lt;u&gt;&lt;/u&gt;&lt;br /&gt;Progressive strategies for multiple alignment: A common approach for multiple sequence alignment is to progressively align pairs of sequences. First two sequences are selected and are aligned together, and then this alignment is used to align each subsequent sequences.&lt;br /&gt;&lt;br /&gt;One of the most popular programs for multiple sequence alignment is known as ClustalW. It is a general purpose multiple alignment program for DNA or proteins. It calculates the best match for the selected sequences, and lines them up so that the similarities and differences can be seen. It also generates a cladogram which can be useful for studying the evolutionary relationships between the set of sequences.&lt;br /&gt;&lt;br /&gt;You may run the ClustalW programming by either downloading and installing it at your local machine, or may run it online at &lt;a href="http://www.ebi.ac.uk/clustalw/"&gt;http://www.ebi.ac.uk/clustalw/&lt;/a&gt;. To download the software, you may visit the following location &lt;a href="ftp://ftp.ebi.ac.uk/pub/software/unix/clustalw/"&gt;ftp://ftp.ebi.ac.uk/pub/software/unix/clustalw/&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;We will look at how to run ClustalW using EBI's online ClustalW server, at the above location. For running ClustalW, you need a set of sequences in Fasta format, which is nothing but a header line beginning with "&gt;", followed by sequence name/description and then followed by the sequence in the next line.&lt;br /&gt;&lt;br /&gt;Let us leave the rest of the parameters as default, and if you want, you may enter your email address so that the results can be emailed to you. After the ClustalW finishes running, it produces dour files: Output file (.output), Alignment file - plain text version (.aln); Guide tree file (.dnd), and your input file (.input) . The ClustalW also shows the alignment in the form of a phylogenetic tree, or a cladogram which can be chosen from option menu (right-click) of the Java applet.&lt;br /&gt;&lt;br /&gt;The difference between a cladogram and a phylogenetic tree is that, A Phylogenetic tree is a branching diagram (tree) in which branch lengths are proportional to the amount of inferred evolutionary change. A Cladogram is a tree where the branches are of equal length, thus cladograms show common ancestry, but do not indicate the amount of evolutionary "time".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-113045510080738758?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/113045510080738758/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=113045510080738758' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113045510080738758'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/113045510080738758'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2005/10/multiple-sequence-alignment-of-dna-and.html' title='Multiple Sequence Alignment of DNA and Proteins - An introduction'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-112982818785293898</id><published>2005-10-20T19:43:00.000-05:00</published><updated>2005-10-27T18:18:03.660-05:00</updated><title type='text'>An introduction to BLAST - Basic Local Alignment Search Tool!</title><content type='html'>&lt;p&gt;Hey friends:&lt;br /&gt;&lt;br /&gt;BLAST is an acronym for &lt;strong&gt;Basic Local Alignment Search Tool&lt;/strong&gt;, and it consists of a set of algorithms for comparing biological sequences such as nucleotides or protein sequences. A nucleotide sequence is nothing but a DNA (or part of) sequence expressed as a long string of 4 characters: A,T,C and G. They stand for Adenine, Guanine, Cytosine and Thymine. So, every nucleotide sequence consists of only these four characters arranged in different orders.&lt;br /&gt;&lt;br /&gt;BLAST allows you to compare your sequence against a database of sequences and informs you if your sequence matches any of the sequences in the database, along with a lot of information like: &lt;/p&gt;&lt;ul&gt;&lt;li&gt;Homology of match (% of characters matched) &lt;/li&gt;&lt;li&gt;Alignment length (over what length did the nucleotides match)&lt;/li&gt;&lt;li&gt;Evalue (Expectation value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score)&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;For a complete BLAST glossary you may visit &lt;a href="http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html"&gt;http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html&lt;/a&gt;&lt;/p&gt;&lt;p&gt;So, now that you know BLAST can be used to align two sequences and to study the similarity between two or more sequences, let us look into the principles of sequence alignment briefly.&lt;/p&gt;&lt;p&gt;Sequence alignment refers to arranging two sequences in an order such that their similar portions are highlighted. &lt;/p&gt;&lt;p&gt;For ex: &lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;AGCTATGGGCAAATTTGGAACAAACCAAAAAGT&lt;br /&gt;........ ........ ...............&lt;br /&gt;AGCTATGGACAAATTTGCAACAAACCAAAAAGT&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:Courier New;"&gt;The portions in the sequence which do not match are shown by gaps in the alignment.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:Courier New;"&gt;&lt;em&gt;Global Alignment: &lt;/em&gt;It refers to the alignment in which all the characters in both sequences participate in the alignment.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:Courier New;"&gt;&lt;em&gt;Local Alignment: &lt;/em&gt;It refers to finding closely matching regions between sequences. In local alignment the beginning part (say 0.100 nucleotides) of a sequence may align with the ending part of another sequence (say 400-500).&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:Courier New;"&gt;Links:&lt;/span&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;A very interesting article on Sequence Alignment can be read here. &lt;a href="http://www.answers.com/sequence%20alignment"&gt;http://www.answers.com/sequence%20alignment&lt;/a&gt;&lt;/li&gt;&lt;li&gt;NCBI's blast tool can be found at &lt;a href="http://www.ncbi.nlm.nih.gov/blast/"&gt;http://www.ncbi.nlm.nih.gov/blast/&lt;/a&gt;&lt;/li&gt;&lt;li&gt;An article on methodology behind blast, &lt;a href="http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html"&gt;http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;----------------------------------------------------------------------------------------&lt;br /&gt;Puneet Wadhwa&lt;/p&gt;&lt;p&gt;&lt;a href="http://www.puneetwadhwa.com"&gt;www.puneetwadhwa.com&lt;/a&gt; &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-112982818785293898?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/112982818785293898/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=112982818785293898' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/112982818785293898'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/112982818785293898'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2005/10/introduction-to-blast-basic-local.html' title='An introduction to BLAST - Basic Local Alignment Search Tool!'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-112985047551719911</id><published>2005-10-20T18:19:00.000-05:00</published><updated>2005-10-20T18:21:15.516-05:00</updated><title type='text'>Sample fasta file format</title><content type='html'>&lt;span style="font-family:arial;font-size:78%;"&gt;&gt;lclBOB1ONE&lt;br /&gt;CATGGATTCAGCAGCAGCGAACTCGCCAATGTAGTGGGTGGCACAGCCAG&lt;br /&gt;GGTCTTGACTCTGGCTCTGCAGTAGCACAGTCTGGAAAAGCTCTGAGGGG&lt;br /&gt;AGAGAGACCCCCACTGGTCCGAGGGTCTGGCACAGAGCCAGAAATGGGGG&lt;br /&gt;GGAAGGTATGAGGCTGGGTCGCCTCTGACCTCTCAGGTACCATCCAGGAG&lt;br /&gt;GCCCTGGCCTCTCACTGAACCCGGCCACTCCTCTTTGGCATGGCCTCTTC&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;font-size:78%;"&gt;&gt;lclBOB2TWO&lt;br /&gt;CCTGGAAGCTCTTGGGGGGCATATCTGGTGGGGAGAAAGCAGGGGTTGGG&lt;br /&gt;GAGGCCGAAGAAGGTCAGGCCCTCAGCTGCCTTCATCAGTTCCCACCCTC&lt;br /&gt;CAGCCCCCAACTCCTCCTGCAGACAAGCTGGTGTCTAAGAACTACCCGGA&lt;br /&gt;CCTGTCCTTGGGAGACTACTCCCTGCTCTGGAAAGCCCACAAGAAGCTCA&lt;br /&gt;CCCGCTCAGCCCTGCTGCTGGGCATCCGTGACTCCATGGAGCCAGTGGTG&lt;br /&gt;GAGCAGCTGACCCAGGAGTTCTGTGAGCGCATGAGAGCCCAGCCCGGCAC&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;font-size:78%;"&gt;&gt;lclBOB3THREE&lt;br /&gt;CCTGGAAGCTCTTGGGGGGCATATCTGGTGGGGAGAAAGCAGGGGTTGGG&lt;br /&gt;GAGGCCGAAGAAGGTCAGGCCCTCAGCTGCCTTCATCAGTTCCCACCCTC&lt;br /&gt;CAGCCCCCAACTCCTCCTGCAGACAAGCTGGTGTCTAAGAACTACCCGGA&lt;br /&gt;CCTGTCCTTGGGAGACTACTCCCTGCTCTGGAAAGCCCACAAGAAGCTCA&lt;br /&gt;CCCGCTCAGCCCTGCTGCTGGGCATCCGTGACTCCATGGAGCCAGTGGTG&lt;br /&gt;GAGCAGCTGACCCAGGAGTTCTGTGAGCGCATGAGAGCCCAGCCCGGCAC&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-112985047551719911?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/112985047551719911/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=112985047551719911' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/112985047551719911'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/112985047551719911'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2005/10/sample-fasta-file-format.html' title='Sample fasta file format'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-112984877114472594</id><published>2005-10-20T17:50:00.000-05:00</published><updated>2005-10-20T18:28:43.906-05:00</updated><title type='text'>Creating and using custom BLAST sequence databases</title><content type='html'>Hi again!&lt;br /&gt;&lt;br /&gt;This post is based on my recent experience dealing with 'blasting' custom nucleotide or protein sequences with &lt;a href="http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html"&gt;NCBI's blast &lt;/a&gt;tool.&lt;br /&gt;&lt;br /&gt;If you are a beginner in Bioinformatics, and would like to know more about blast, please read my earlier post on "&lt;a href="http://puneetwadhwa.blogspot.com/2005/10/introduction-to-blast-basic-local.html"&gt;An Introduction to BLAST&lt;/a&gt;", before you delve into deeper and juicier topics such as creating custom BLAST databases, and comparing your sequences against those blast databases. So, hang on tight and you would be blasting in no time!&lt;br /&gt;&lt;br /&gt;First things first, you need to download and install BLAST on your computer or a server first. BLAST can be downloaded at &lt;a href="http://www.ncbi.nlm.nih.gov/blast/download.shtml"&gt;http://www.ncbi.nlm.nih.gov/blast/download.shtml&lt;/a&gt;. Follow the instructions on NCBI website, and install the BLAST tool after downloading it.&lt;br /&gt;&lt;br /&gt;To create a custom BLAST database, you need a simple FASTA file consisting of a header in a particular format (discussed below), then followed by the nucleotide or protein sequence.&lt;br /&gt;The header of a FASTA file begins with a "&gt;" character, followed by header information. The custom sequence's fasta file is then converted into BLAST database by a tool called formatdb (which ships with NCBI's blast package, downloadable at &lt;a href="http://www.ncbi.nlm.nih.gov/Ftp/"&gt;http://www.ncbi.nlm.nih.gov/Ftp/&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;The formatdb command has several options, and the entire readme file may be viewed &lt;a href="http://bioinformatics.ubc.ca/resources/tools/?name=formatdb"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;We are going to look at the most common formatdb command options, and the most common header formats for formatting custom databases.&lt;br /&gt;&lt;br /&gt;&lt;u&gt;COMMAND FOR FORMATDB&lt;/u&gt;:&lt;br /&gt;formatdb -i input_db -p F -o T &lt;u&gt;for nucleotide&lt;/u&gt;&lt;br /&gt;formatdb -i input_db -p T -o T &lt;u&gt;for protein&lt;/u&gt;&lt;br /&gt;&lt;br /&gt;-i option is used to specify the name of the input fasta fi&lt;span style="font-size:0;"&gt;le&lt;/span&gt;&lt;br /&gt;-p option is used to specify type of file (T - protein, F - nucleotide [T/F]; default = T)&lt;br /&gt;&lt;br /&gt;Note on -o option from the FORMATDB README: "It is always advantageous to use the '-o' option if the database identifiers are in the format specified at &lt;a href="ftp://ftp.ncbi.nih.gov/blast/db"&gt;ftp://ftp.ncbi.nih.gov/blast/db&lt;/a&gt;. If the database identifiers are in this parseable format, formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;For constructing custom databases with the above command, there are certain rules about the header format.&lt;br /&gt;&lt;br /&gt;1.) ID's of type "local" or "general" should be used. This means thatthe ID's will have the syntax "lcl IDENTIFIER" (for "local") or "gnl DATABASEIDENTIFIER" (for "general"). The tokens DATABASE andIDENTIFIER should be assigned by the user here. The local ID has only one user provided token, the general ID requires two. The fields are separated by vertical bars ("")..&lt;br /&gt;2.) Letters, numbers, underscores ("_"), dashes, and periods may beused. Uppercase and lowercase letters are treated as being distinct.No spaces are allowed in the ID, this indicates the end of the ID.&lt;br /&gt;&lt;br /&gt;3.) All ID's should be unique, if the entire ID is examined. As an example consider the following four ID's:&lt;br /&gt;gnlH.sapiensseq1&lt;br /&gt;gnlH.sapiensseq2&lt;br /&gt;gnlM.Musseq1&lt;br /&gt;lclseq1&lt;br /&gt;&lt;br /&gt;All of these ID's are considered unique. The first two might besequences one and two of a collection of Human sequences; the fourthmight be the first sequence in a collection of mouse sequences; thefourth is simply identified as the first sequence.&lt;br /&gt;&lt;br /&gt;I recommend using either the &lt;em&gt;gnl &lt;/em&gt;or &lt;em&gt;lcl &lt;/em&gt;option. Some of valid header formats thus are:&lt;br /&gt;&lt;br /&gt;&gt;&lt;span style="font-family:arial;"&gt;lcl.BOB1.ONE&lt;/span&gt;&lt;br /&gt;&gt;lcl.29 (Replace . with pipe.. for some reason this blog doesn't like pipes)&lt;br /&gt;and so on..&lt;br /&gt;&lt;br /&gt;Once you have your FASTA file ready (&lt;/span&gt;&lt;a href="http://puneetwadhwa.blogspot.com/2005/10/sample-fasta-file-format.html"&gt;&lt;span style="font-family:arial;"&gt;see example&lt;/span&gt;&lt;/a&gt;&lt;span style="font-family:arial;"&gt;) then use the formatDB command as discussed above to create a BLAST database.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-112984877114472594?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/112984877114472594/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=112984877114472594' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/112984877114472594'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/112984877114472594'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2005/10/creating-and-using-custom-blast.html' title='Creating and using custom BLAST sequence databases'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-112863170609933569</id><published>2005-10-06T15:48:00.000-05:00</published><updated>2005-10-07T10:34:09.186-05:00</updated><title type='text'>What is Bioinformatics ?</title><content type='html'>Bioinformatics is an interdisciplinary science that encompasses use of techniques from the Life sciences world, and Computer science to solve complex biological problems. According to the dictionary, Bioinformatics is "Information technology as applied to the life sciences, especially the technology used for the collection, storage, and retrieval of genomic data". The term Bioinformatics has most recently become a buzzword with several large and small companies, and major research efforts in the field of bioinformatics include sequence alignment, gene prediction, genome assembly etc. It uses the latest advancements in the the area of Computer Science such as Data warehousing, Data mining and Mathematical sciences.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-112863170609933569?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/112863170609933569/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=112863170609933569' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/112863170609933569'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/112863170609933569'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2005/10/what-is-bioinformatics.html' title='What is Bioinformatics ?'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17537533.post-112862749746110483</id><published>2005-10-06T14:19:00.000-05:00</published><updated>2005-11-12T15:02:35.266-06:00</updated><title type='text'>Welcome!</title><content type='html'>&lt;a href="http://photos1.blogger.com/blogger/5487/699/1600/puneet.jpg"&gt;&lt;/a&gt;Hi There,&lt;br /&gt;&lt;br /&gt;A warm welcome to you and thanks for visiting my blog. I intend to use this space to speak my mind, post articles about Bioinformatics (a field I have recently developed a great liking for), and other information pertaining to Computers and Internet in general, including career opportunities.&lt;br /&gt;&lt;br /&gt;Please feel free to post comments to my blog, or to send suggestions or contributions to me at &lt;a href="mailto:pwadhwa@gmail.com"&gt;pwadhwa@gmail.com&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17537533-112862749746110483?l=puneetwadhwa.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://puneetwadhwa.blogspot.com/feeds/112862749746110483/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17537533&amp;postID=112862749746110483' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/112862749746110483'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17537533/posts/default/112862749746110483'/><link rel='alternate' type='text/html' href='http://puneetwadhwa.blogspot.com/2005/10/welcome.html' title='Welcome!'/><author><name>Puneet Wadhwa</name><uri>http://www.blogger.com/profile/03531176644920939774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='22' src='http://www.puneetwadhwa.com/puneet.jpg'/></author><thr:total>0</thr:total></entry></feed>
