Puneet Wadhwa's BIOINFORMATICS BLOG

Friday, March 24, 2006

Extracting ORF (CDS portion) from Sequence using Bioperl

Hey Readers:

I have been recently wrestling with creation of an ORF Database from Refseq Human Sequences, by extracting out the CDS portion of the sequence from the Genbank record. I needed this database for my project since we were analyzing our proprietary Incyte Gene Collection (IGC) for exact matches to Refseq sequences over the ORF (CDS) region.

Building this database was not easy, and I hope others could learn from this experience of mine...

I first downloaded the RAW Genbank files (.gbff) and then extracted them into a folder. I then searched these files using unix commands like grep etc. to find out which files contained Human species. And then, I wrote a regular expressions based C#.NET GBFF Parser, which used regular expressions to extract out the features from the record like gene name, definition, locus, CDS, species etc. I then grabbed the sequence from the Genbank record using my parser, and then stripped out the CDS part using my program, from the CDS begin to the CDS end locations on the sequence. This was NOT very easy and certainly not the best way.

I now have a surefire way of extracting ORF's from the Genbank records, and it is very elegant, uses very less amount of code, and has been written in Bioperl. Attached below, is the code to do the same. However, is you use or distribute this code, please leave its header and author information intact, and provide a link back to my website.

!/usr/bin/perl

#===========================================
# ORF DATABASE BUILDER UTILITY - Puneet Wadhwa #
# http://puneetwadhwa.blogspot.com * pwadhwa@gmail.com #
#-------------------------------------------
# The purpose of this utility is to fetch the Genbank #
# record live from the Genbank website, and strip out #
# and display the CDS sequence from the record. #
# Takes a list of NCBI Accessions as input and outputs a #
# fasta record. #
# USAGE: cat input_file ./GenerateOrfDatabase.pl #
# Input file contains NCBI accessions (one per line) for #
# which the CDS is to be extracted. #
#===============================================

use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::SeqIO::swiss;
use Bio::Seq;
use Bio::SeqIO::FTHelper;
use Bio::SeqFeature::Generic;
use Bio::Species;
use Bio::AnnotationI;
use Bio::FeatureHolderI;
use Bio::SeqFeatureI;

my $gb = new Bio::DB::GenBank;
my $seqout = new Bio::SeqIO(-fh => \*STDOUT, -format => 'fasta');

%acc_hash = 0;

while (<>)
{
chomp;
$acc = $_;

my $seq_obj = $gb->get_Seq_by_acc($acc);

foreach my $feat ( $seq_obj->top_SeqFeatures )
{
if ( $feat->primary_tag eq 'CDS' )
{
my $cds_obj = $feat->spliced_seq;
print ">".$seq_obj->display_id()."\n".$cds_obj->seq."\n";
}
}
}
0;

Wednesday, March 15, 2006

A thousand clones...

Hey Readers:

Following is the latest article about my project which appeared on my company website's blog. We have now finished the complete study and analysis of annotating the Incyte Gene Collection and have discovered 1072 novel genes which are not available in any other commercial collections.

Here is the article:

A thousand clones

To be exact,1072. That is the number of human Incyte Gene Collection (IGC) clones that we found to contain an exact match to an entire RefSeq CDS (January 6, 2006 release), but for which there is no exact match to an MGC clone (February 23, 2006 release). At the risk of sounding self-congratulatory: I told you so! (See previous blog.)In this collection, which I will refer to as the IGC Non-MGC Set, you will find both novel clones and NOVEL CLONES. In many cases, there are one or more close MGC counterparts that differ from the IGC/RefSeq sequence by only one or two base pairs. Some may well be legitimate single nucleotide polymorphisms (SNPs) that did not have the good fortune to be included in RefSeq. These may or may not be functionally distinct from the IGC clone. In other cases, the closest MGC counterpart is a splice variant of the IGC/RefSeq sequence and likely codes for a polypeptide with a distinct function. In still other cases, there is no MGC counterpart within the same UniGene cluster.Never mind all that, you might be thinking with some impatience. How many druggable genes are there? I have not yet attempted grouping into gene families, but I did quickly spot some caspases, cytochrome p450s, and an adenylate cyclase. I encourage you to have a look for yourself by downloading the new spreadsheet from our website by again navigating to Genomics > Mammalian Resources > cDNAs > Incyte Gene Collection and clicking on the data file icon under the ordering information for IHS1380. You will note that there are actually 1135 line items, because there are sometimes multiple RefSeq accessions corresponding to the same CDS.By the way, all of the IGC clones containing an exact match to an entire RefSeq CDS (a total of 4116 clones) can now be found using our online clone query when searching by RefSeq accession, gene symbol, or UniGene cluster. So next time you search on our website, you are even more likely to turn up a useful clone. More to come…

Thursday, March 02, 2006

Waking a sleeping giant: annotating the Incyte Gene Collection (IGC)

Hey Readers:

Following is an article about my project which appeared on my company website's blog. We have got some great results from mining the IGC collection, and I am very very excited to share them with you.

Here it goes:
--------------
With over one million cDNAs for human, rat, monkey, and dog, the IGC is a monster collection that, on statistical grounds, is certain to contain some good stuff. Consider this: of the more than 400,000 human cDNAs, Incyte has categorized 11,377 as full length and 16,756 as potentially full length. However, while these latter clone sets have been fully sequenced, they were never submitted to GenBank and no annotations (if they ever existed) were passed along to Open Biosystems. Currently, the only way to mine the IGC is by BLASTing query sequences online against our Incyte clone database. The IGC is a potentially valuable resource, but with largely (and frustratingly) unknown content.

To enable better exploitation of the IGC, we have begun a high throughput annotation program with these goals: (1) to associate human RefSeq accession numbers with IGC clones when this can be done with high confidence and (2) to discover which human RefSeq-associated IGC sequences are not found in the Mammalian Gene Collection (MGC). I’d like to share with you some preliminary results from our pilot study.

We began by filtering the ~28,000 full-length and potential full-length human cDNAs by size and selecting the set of 367 sequences that are 3 kilobases or longer. These sequences were then BLASTed against every CDS in human RefSeq and filtered for 100% identity. Even at this high stringency there were 118 “hits”, that is, IGC sequences that contain a complete human RefSeq CDS. The 118 coding sequences were then BLASTed against the MGC, yielding 47 hits at 100% identity. Taking into account a few cases in which the same IGC sequence corresponded to two or more RefSeq accessions, there were 64 RefSeq-certified IGC cDNAs not found in the MGC.

For 25 of the brave new 64, there was no MGC clone within the same UniGene cluster; for the 39 others, one or more MGC clones were found in the same cluster. However, spot-checking of these MGC sequences confirmed that they are either apparent splice variants of the RefSeq CDS or contain single nucleotide substitutions. So far, so good*. If you are curious about these preliminary results, a spreadsheet can be downloaded from our website by navigating to www.OpenBiosystems.com > Genomics > Mammalian Resources > cDNAs > Incyte Gene Collection and clicking on the data file icon under the ordering information for IHS1380.

We have already begun BLASTing the larger set of 28,000 against the RefSeq coding sequences. Our goal is to identify 1000 RefSeq-certified human IGC cDNAs that are not in the MGC. Eventually we hope to dig deeper by identifying IGC cDNAs that are not perfect matches to any RefSeq CDS, but represent plausible splice variants or SNPs. I am also intrigued by the possibility of identifying IGC cDNAs that are completely outside RefSeq—representing entirely novel genes. We shall see!