Puneet Wadhwa's BIOINFORMATICS BLOG

Friday, March 24, 2006

Extracting ORF (CDS portion) from Sequence using Bioperl

Hey Readers:

I have been recently wrestling with creation of an ORF Database from Refseq Human Sequences, by extracting out the CDS portion of the sequence from the Genbank record. I needed this database for my project since we were analyzing our proprietary Incyte Gene Collection (IGC) for exact matches to Refseq sequences over the ORF (CDS) region.

Building this database was not easy, and I hope others could learn from this experience of mine...

I first downloaded the RAW Genbank files (.gbff) and then extracted them into a folder. I then searched these files using unix commands like grep etc. to find out which files contained Human species. And then, I wrote a regular expressions based C#.NET GBFF Parser, which used regular expressions to extract out the features from the record like gene name, definition, locus, CDS, species etc. I then grabbed the sequence from the Genbank record using my parser, and then stripped out the CDS part using my program, from the CDS begin to the CDS end locations on the sequence. This was NOT very easy and certainly not the best way.

I now have a surefire way of extracting ORF's from the Genbank records, and it is very elegant, uses very less amount of code, and has been written in Bioperl. Attached below, is the code to do the same. However, is you use or distribute this code, please leave its header and author information intact, and provide a link back to my website.

!/usr/bin/perl

#===========================================
# ORF DATABASE BUILDER UTILITY - Puneet Wadhwa #
# http://puneetwadhwa.blogspot.com * pwadhwa@gmail.com #
#-------------------------------------------
# The purpose of this utility is to fetch the Genbank #
# record live from the Genbank website, and strip out #
# and display the CDS sequence from the record. #
# Takes a list of NCBI Accessions as input and outputs a #
# fasta record. #
# USAGE: cat input_file ./GenerateOrfDatabase.pl #
# Input file contains NCBI accessions (one per line) for #
# which the CDS is to be extracted. #
#===============================================

use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::SeqIO::swiss;
use Bio::Seq;
use Bio::SeqIO::FTHelper;
use Bio::SeqFeature::Generic;
use Bio::Species;
use Bio::AnnotationI;
use Bio::FeatureHolderI;
use Bio::SeqFeatureI;

my $gb = new Bio::DB::GenBank;
my $seqout = new Bio::SeqIO(-fh => \*STDOUT, -format => 'fasta');

%acc_hash = 0;

while (<>)
{
chomp;
$acc = $_;

my $seq_obj = $gb->get_Seq_by_acc($acc);

foreach my $feat ( $seq_obj->top_SeqFeatures )
{
if ( $feat->primary_tag eq 'CDS' )
{
my $cds_obj = $feat->spliced_seq;
print ">".$seq_obj->display_id()."\n".$cds_obj->seq."\n";
}
}
}
0;

7 Comments:

  • Hey William,

    Python is another very actively used language for Bioinformatics tasks. I have never used it so I do not know how simple it is as compared to Perl etc. But I feel Perl is more popular amongst Bioinformaticists.. just my opinion :)

    By Blogger Puneet Wadhwa, at 4:55 PM  

  • thanks for making this script available, saved me a lot of time!

    By Anonymous Anonymous, at 5:59 AM  

  • It's amazing to see how someone working in biotechnology should learn programming as well. You should be at ease in several domains today to be capable; programming, biotechnology, etc...

    Armand Rousso
    http://biotechnology.armandrousso.biz/

    By Blogger Armand Rousso, at 11:59 PM  

  • Python being used in biotechnology? Like Armand said you should be versatile in this job.

    http://corporateman.wordpress.com

    By Anonymous Anonymous, at 12:04 AM  

  • how much the python language will take time an how much it is useful for us i m very curious to know abt it please help me

    By Blogger afzal hussain, at 4:32 PM  

  • python is really very useful for further bioinformatics research???

    By Blogger afzal hussain, at 4:33 PM  

  • I would like to suggest to you all about Sequilab, a sequence profiling portal with web 2.0 features built into it. The main thing here is that you can link your NCBI-BLAST results to a list of sequence analysis tools like translate, ORF, CpG islands, View structure (Cn3D), mfold (for RNA).

    Users can also organize their prveiously carried out projects and discuss with peers in communities if needed.

    The portal can be accessed by visiting http://www.sequilab.org. Registration is free and a one step process

    By Blogger Nattu, at 10:42 AM  

Post a Comment

<< Home